Monthly Archives: July 2011

Unaccenting words, or at least trying to…

So today I found that Tracker in MeeGo packaging was still depending on libunac, while it shouldn’t. And that has reminded me that I had a blog post still unfinished about why and how we removed the libunac dependency in Tracker… so here it goes :-)

One of the features supported in Tracker is doing FTS searches for words without considering ‘accents’. Of course, we’re not talking about accent as in the specific pronunciation of words relative to a location or nation. Our ‘unaccenting’ mechanism, as we call it, refers to the process of removing combining diacritical marks from characters, so that users can look for words with or without these marks. Therefore, this ‘unaccenting’ applies not only to diacritics in Latin alphabets, but also to other alphabets like Arabic, Greek, Hebrew or Korean which also have special combining diacritical marks.

In the previous 0.8 stable series of Tracker, the unaccenting mechanism was completely done by the ‘unac’ library. We were not really convinced that unac was a good option in our case, as it involved extra conversions from UTF-8 to UTF-16 and back, and measurements showed that it was one of the most time consuming steps during FTS parsing. In order to improve the situation, and as we already did ourselves some Unicode normalization work before passing the work to unac, we ended up writing our own unaccenting mechanism in Tracker for 0.10.

The method is applied to all our three Unicode-support backends (GNU libunistring, ICU and GLib), and roughly involves just two steps:

  • Apply a compatibility decomposition to the word (NFKD normalization).
  • Remove all combining diacritical marks, this is, all Unicode points within the following ranges:
    • Basic range: [U+0300,U+036F]
    • Supplement: [U+1DC0,U+1DFF]
    • For Symbols: [U+20D0,U+20FF]
    • Half marks: [U+FE20,U+FE2F]

Instead of NFKD, NFD decomposition can also be used in the method, but as the main purpose of the unaccenting is a Full Text Search in Tracker, compatibility of Unicode points is also a desired feature which we could get in the same step.

Looking at an example may explain it easier. Consider the French word “école”, which has a diacritic on top of the first ‘e’ character. This accented ‘e’ character can be encoded in UTF-8 in either a composed or decomposed way:

  • (NFC, composed) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65
  • (NFD, decomposed) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65

The UTF-8 encoding of the composed way (NFC or NFKC) will (probably) always need less bytes than the decomposed (NFD or NFKD) counterpart. This is because the accented ‘e’ character will be represented in composed way as a single Unicode point: ‘é’ U+00E9 (UTF-8: [0xC3 0xA9]). In the decomposed way, the same accented ‘e’ character is represented as a base character ‘e’ U+0065 (UTF-8: 0×65) plus a combining mark ‘ ́ ‘ U+0301 (UTF-8: [0xCC 0x81]).

For either of the previous two representations of ‘école’, the removal of combining diacritical marks is as we have already described:

  • First, get the word NFKD-normalized (or NFD if point compatibility is not needed):
    • (NFC) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
    • (NFD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
  • Once we have the word decomposed, we can now just walk each unicode point found in the string, and remove those which end up falling into one of the ranges applicable to diacritics. In this case, only the accent on top of the ‘e’ character is found and removed: U+0301 (UTF-8: [0xCC 0x81]):
    • (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (unaccented) 0×65 0×63 0x6F 0x6C 0×65

This new method not only worked perfectly in all the test cases we could think of, it was even much faster than using the unac library (up to 73% faster in the best case, and same speed in more complex cases).

ModemManager, now with Iridium satellite network support

ModemManager and the Iridium satellite network

I recently sent a new ‘iridium’ plugin for review upstream, this time for Iridium modems. The plugin was developed using a Iridium 9522B Satellite Transceiver modem connected through RS232, properly handled by ModemManager’s plugin system thanks to the extended RS232 support available in git master. The ‘iridium’ plugin handles these modems as any other GSM modem, even if it has nothing to do with GSM technologies.

Iridium is a constellation of 66 active (plus spares) LEO satellites orbiting at an altitude of 781 km, which gives phone and network coverage to every point in Earth. It was initially thought to be a constellation of 77 satellites, therefore named ‘Iridium’ after the chemical element with atomic number 77. The name didn’t change to ‘Dysprosium‘ when it was redesigned to maintain only 66 active satellites, no wonder why.

Even if the Iridium modems expose a GSM-modem like AT command set, several special things needed to be considered. For example, IP address setup via PPP needed more than the 20s hardcoded in NetworkManager, due to the extreme latency of the satellite network. Therefore, NM was also updated to allow ModemManager plugins to specify a specific ‘IpTimeout’ value.

See my email to the NM mailing list for further information on how to use ModemManager with Iridium support.

Ammonit Measurement GmbH sponsoring some hardware for ModemManager development

In Lanedo we have worked with Ammonit Measurement GmbH to help them with the improvement of ModemManager to handle Wavecom, Cinterion and Iridium modems. The guys at Ammonit were kind enough to sponsor some modems, so that I can spend my free time in developing and improving ModemManager, as well as in testing the modems before stable releases (Dan will probably be happy for that):

  • Sierra Wireless Fastrack Xtend FXT009 (GPRS modem, USB, handled by the ‘wavecom’ plugin)
  • Cinterion TC63i (GPRS modem, RS232, handled by the ‘cinterion’ plugin)

So, thanks Ammonit!

Follow

Get every new post delivered to your Inbox.

Join 36 other followers