Archive

Posts Tagged ‘tracker’

GNOME.Asia 2012

June 25, 2012 Leave a comment

Finally back home after attending GNOME.Asia 2012 and some vacation time afterwards.

 

CC BY-NC-SA 2.0, Sammy Fung

The conference started on Friday afternoon, with a design workshop managed by Allan Day, Jakub Steiner and William Jon McCann. It was pretty interesting, truth be told, and I particularly enjoyed how Jakub played with Inkscape to create the icon of “GNOME Ball” (now we just need a program for the icon). I truly want to spend more time with Inkscape now, and try to draw something better than my previous attempts to draw an icon/logo.

 

CC BY-NC-SA 2.0, Sammy Fung

I gave two talks during the first day of the conference, one about Tracker (slides here) and another one about ModemManager (slides here). The talks were recorded, but official videos of the talks are not yet available on the Internet.

 

Thanks to the usual jet lag, I also found time to develop the application-menu support for Devhelp during my stay in Hong Kong. The code is currently available in the ‘application-menu’ branch in the gnome git repo, and tracked at GB#677927.

 

Finally, thanks to my employer, Lanedo GmbH, for sponsoring my attendance to this great conference!

Speaking at GNOME.Asia 2012 (Hong Kong)

Only 3 weeks left for GNOME.Asia, this year in the beautiful city of Hong Kong.

According to the current schedule I’ll be giving two talks on Saturday June 9th:

  • An introduction to Tracker, SPARQL and whatnot:
    Tracker is not (only) a search engine: it is a semantic data storage, powered by RDF ontologies, Nepomuk, and the SPARQL protocol. Tracker is one of the core building blocks of the MeeGo Harmattan operating system; and since release 2.30, it is also a blessed external dependency in GNOME. But it wasn’t until GNOME 3 and the first release of GNOME Documents that it started to get some really good attention… and there is more to come!
  • ModemManager revamped: now supporting LTE/4G modems:
    ModemManager, along with NetworkManager, provides easy-to-use broadband modem connections in the GNOME desktop. With new requirements coming along with the new LTE (4G) communication standards, ModemManager got not only a face-lift, but also a deep surgery to improve its codebase and the way it supports vendor-specific plugins.

    The new ModemManager comes with a new GDBus-powered interface; built-in support for LTE and mixed CDMA+LTE devices; dynamic interfaces and per-interface state machines; helper libmm-common and libmm-glib libraries; a handy “mmcli” command line interface utility; and a new plugin development strategy which is port-type agnostic and based on GIO async calls.

    This talk is an introduction to ModemManager, with special focus on LTE and all the new details and features coming with the new codebase.

Thanks to my employer, Lanedo GmbH, for sponsoring my attendance!

Unaccenting words, or at least trying to…

July 21, 2011 Leave a comment

So today I found that Tracker in MeeGo packaging was still depending on libunac, while it shouldn’t. And that has reminded me that I had a blog post still unfinished about why and how we removed the libunac dependency in Tracker… so here it goes :-)

One of the features supported in Tracker is doing FTS searches for words without considering ‘accents’. Of course, we’re not talking about accent as in the specific pronunciation of words relative to a location or nation. Our ‘unaccenting’ mechanism, as we call it, refers to the process of removing combining diacritical marks from characters, so that users can look for words with or without these marks. Therefore, this ‘unaccenting’ applies not only to diacritics in Latin alphabets, but also to other alphabets like Arabic, Greek, Hebrew or Korean which also have special combining diacritical marks.

In the previous 0.8 stable series of Tracker, the unaccenting mechanism was completely done by the ‘unac’ library. We were not really convinced that unac was a good option in our case, as it involved extra conversions from UTF-8 to UTF-16 and back, and measurements showed that it was one of the most time consuming steps during FTS parsing. In order to improve the situation, and as we already did ourselves some Unicode normalization work before passing the work to unac, we ended up writing our own unaccenting mechanism in Tracker for 0.10.

The method is applied to all our three Unicode-support backends (GNU libunistring, ICU and GLib), and roughly involves just two steps:

  • Apply a compatibility decomposition to the word (NFKD normalization).
  • Remove all combining diacritical marks, this is, all Unicode points within the following ranges:
    • Basic range: [U+0300,U+036F]
    • Supplement: [U+1DC0,U+1DFF]
    • For Symbols: [U+20D0,U+20FF]
    • Half marks: [U+FE20,U+FE2F]

Instead of NFKD, NFD decomposition can also be used in the method, but as the main purpose of the unaccenting is a Full Text Search in Tracker, compatibility of Unicode points is also a desired feature which we could get in the same step.

Looking at an example may explain it easier. Consider the French word “école”, which has a diacritic on top of the first ‘e’ character. This accented ‘e’ character can be encoded in UTF-8 in either a composed or decomposed way:

  • (NFC, composed) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65
  • (NFD, decomposed) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65

The UTF-8 encoding of the composed way (NFC or NFKC) will (probably) always need less bytes than the decomposed (NFD or NFKD) counterpart. This is because the accented ‘e’ character will be represented in composed way as a single Unicode point: ‘é’ U+00E9 (UTF-8: [0xC3 0xA9]). In the decomposed way, the same accented ‘e’ character is represented as a base character ‘e’ U+0065 (UTF-8: 0×65) plus a combining mark ‘ ́ ‘ U+0301 (UTF-8: [0xCC 0x81]).

For either of the previous two representations of ‘école’, the removal of combining diacritical marks is as we have already described:

  • First, get the word NFKD-normalized (or NFD if point compatibility is not needed):
    • (NFC) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
    • (NFD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
  • Once we have the word decomposed, we can now just walk each unicode point found in the string, and remove those which end up falling into one of the ranges applicable to diacritics. In this case, only the accent on top of the ‘e’ character is found and removed: U+0301 (UTF-8: [0xCC 0x81]):
    • (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (unaccented) 0×65 0×63 0x6F 0x6C 0×65

This new method not only worked perfectly in all the test cases we could think of, it was even much faster than using the unac library (up to 73% faster in the best case, and same speed in more complex cases).

Follow

Get every new post delivered to your Inbox.