Archive

Posts Tagged ‘tracker’

Speaking at GNOME.Asia 2012 (Hong Kong)

Only 3 weeks left for GNOME.Asia, this year in the beautiful city of Hong Kong.

According to the current schedule I’ll be giving two talks on Saturday June 9th:

  • An introduction to Tracker, SPARQL and whatnot:
    Tracker is not (only) a search engine: it is a semantic data storage, powered by RDF ontologies, Nepomuk, and the SPARQL protocol. Tracker is one of the core building blocks of the MeeGo Harmattan operating system; and since release 2.30, it is also a blessed external dependency in GNOME. But it wasn’t until GNOME 3 and the first release of GNOME Documents that it started to get some really good attention… and there is more to come!
  • ModemManager revamped: now supporting LTE/4G modems:
    ModemManager, along with NetworkManager, provides easy-to-use broadband modem connections in the GNOME desktop. With new requirements coming along with the new LTE (4G) communication standards, ModemManager got not only a face-lift, but also a deep surgery to improve its codebase and the way it supports vendor-specific plugins.

    The new ModemManager comes with a new GDBus-powered interface; built-in support for LTE and mixed CDMA+LTE devices; dynamic interfaces and per-interface state machines; helper libmm-common and libmm-glib libraries; a handy “mmcli” command line interface utility; and a new plugin development strategy which is port-type agnostic and based on GIO async calls.

    This talk is an introduction to ModemManager, with special focus on LTE and all the new details and features coming with the new codebase.

Thanks to my employer, Lanedo GmbH, for sponsoring my attendance!

Unaccenting words, or at least trying to…

July 21, 2011 Leave a comment

So today I found that Tracker in MeeGo packaging was still depending on libunac, while it shouldn’t. And that has reminded me that I had a blog post still unfinished about why and how we removed the libunac dependency in Tracker… so here it goes :-)

One of the features supported in Tracker is doing FTS searches for words without considering ‘accents’. Of course, we’re not talking about accent as in the specific pronunciation of words relative to a location or nation. Our ‘unaccenting’ mechanism, as we call it, refers to the process of removing combining diacritical marks from characters, so that users can look for words with or without these marks. Therefore, this ‘unaccenting’ applies not only to diacritics in Latin alphabets, but also to other alphabets like Arabic, Greek, Hebrew or Korean which also have special combining diacritical marks.

In the previous 0.8 stable series of Tracker, the unaccenting mechanism was completely done by the ‘unac’ library. We were not really convinced that unac was a good option in our case, as it involved extra conversions from UTF-8 to UTF-16 and back, and measurements showed that it was one of the most time consuming steps during FTS parsing. In order to improve the situation, and as we already did ourselves some Unicode normalization work before passing the work to unac, we ended up writing our own unaccenting mechanism in Tracker for 0.10.

The method is applied to all our three Unicode-support backends (GNU libunistring, ICU and GLib), and roughly involves just two steps:

  • Apply a compatibility decomposition to the word (NFKD normalization).
  • Remove all combining diacritical marks, this is, all Unicode points within the following ranges:
    • Basic range: [U+0300,U+036F]
    • Supplement: [U+1DC0,U+1DFF]
    • For Symbols: [U+20D0,U+20FF]
    • Half marks: [U+FE20,U+FE2F]

Instead of NFKD, NFD decomposition can also be used in the method, but as the main purpose of the unaccenting is a Full Text Search in Tracker, compatibility of Unicode points is also a desired feature which we could get in the same step.

Looking at an example may explain it easier. Consider the French word “école”, which has a diacritic on top of the first ‘e’ character. This accented ‘e’ character can be encoded in UTF-8 in either a composed or decomposed way:

  • (NFC, composed) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65
  • (NFD, decomposed) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65

The UTF-8 encoding of the composed way (NFC or NFKC) will (probably) always need less bytes than the decomposed (NFD or NFKD) counterpart. This is because the accented ‘e’ character will be represented in composed way as a single Unicode point: ‘é’ U+00E9 (UTF-8: [0xC3 0xA9]). In the decomposed way, the same accented ‘e’ character is represented as a base character ‘e’ U+0065 (UTF-8: 0×65) plus a combining mark ‘ ́ ‘ U+0301 (UTF-8: [0xCC 0x81]).

For either of the previous two representations of ‘école’, the removal of combining diacritical marks is as we have already described:

  • First, get the word NFKD-normalized (or NFD if point compatibility is not needed):
    • (NFC) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
    • (NFD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
  • Once we have the word decomposed, we can now just walk each unicode point found in the string, and remove those which end up falling into one of the ranges applicable to diacritics. In this case, only the accent on top of the ‘e’ character is found and removed: U+0301 (UTF-8: [0xCC 0x81]):
    • (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (unaccented) 0×65 0×63 0x6F 0x6C 0×65

This new method not only worked perfectly in all the test cases we could think of, it was even much faster than using the unac library (up to 73% faster in the best case, and same speed in more complex cases).

Lanedo, Tracker, Devhelp, GHM, GUADEC…

July 16, 2010 Leave a comment

So, starting on April 1st (not a fools’ day joke!), I’m now employed by Lanedo GmbH, and full-time working on the Gnome Tracker project, mainly involved in FTS and the filesystem miner.

It was great to work with everyone in Azetti Networks. Keep Jardin and Waterfall alive!

I also hacked a new feature for Gnome Devhelp, the possibility of enabling or disabling specific books, which was integrated into git master last week, and will make it for the 3.0 release.

Oh, and I’ll be attending both the GHM and GUADEC in The Hague!

Follow

Get every new post delivered to your Inbox.