Blog Archives

Unaccenting words, or at least trying to…

So today I found that Tracker in MeeGo packaging was still depending on libunac, while it shouldn’t. And that has reminded me that I had a blog post still unfinished about why and how we removed the libunac dependency in Tracker… so here it goes :-)

One of the features supported in Tracker is doing FTS searches for words without considering ‘accents’. Of course, we’re not talking about accent as in the specific pronunciation of words relative to a location or nation. Our ‘unaccenting’ mechanism, as we call it, refers to the process of removing combining diacritical marks from characters, so that users can look for words with or without these marks. Therefore, this ‘unaccenting’ applies not only to diacritics in Latin alphabets, but also to other alphabets like Arabic, Greek, Hebrew or Korean which also have special combining diacritical marks.

In the previous 0.8 stable series of Tracker, the unaccenting mechanism was completely done by the ‘unac’ library. We were not really convinced that unac was a good option in our case, as it involved extra conversions from UTF-8 to UTF-16 and back, and measurements showed that it was one of the most time consuming steps during FTS parsing. In order to improve the situation, and as we already did ourselves some Unicode normalization work before passing the work to unac, we ended up writing our own unaccenting mechanism in Tracker for 0.10.

The method is applied to all our three Unicode-support backends (GNU libunistring, ICU and GLib), and roughly involves just two steps:

  • Apply a compatibility decomposition to the word (NFKD normalization).
  • Remove all combining diacritical marks, this is, all Unicode points within the following ranges:
    • Basic range: [U+0300,U+036F]
    • Supplement: [U+1DC0,U+1DFF]
    • For Symbols: [U+20D0,U+20FF]
    • Half marks: [U+FE20,U+FE2F]

Instead of NFKD, NFD decomposition can also be used in the method, but as the main purpose of the unaccenting is a Full Text Search in Tracker, compatibility of Unicode points is also a desired feature which we could get in the same step.

Looking at an example may explain it easier. Consider the French word “école”, which has a diacritic on top of the first ‘e’ character. This accented ‘e’ character can be encoded in UTF-8 in either a composed or decomposed way:

  • (NFC, composed) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65
  • (NFD, decomposed) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65

The UTF-8 encoding of the composed way (NFC or NFKC) will (probably) always need less bytes than the decomposed (NFD or NFKD) counterpart. This is because the accented ‘e’ character will be represented in composed way as a single Unicode point: ‘é’ U+00E9 (UTF-8: [0xC3 0xA9]). In the decomposed way, the same accented ‘e’ character is represented as a base character ‘e’ U+0065 (UTF-8: 0×65) plus a combining mark ‘ ́ ‘ U+0301 (UTF-8: [0xCC 0x81]).

For either of the previous two representations of ‘école’, the removal of combining diacritical marks is as we have already described:

  • First, get the word NFKD-normalized (or NFD if point compatibility is not needed):
    • (NFC) [0xC3 0xA9] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
    • (NFD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65
  • Once we have the word decomposed, we can now just walk each unicode point found in the string, and remove those which end up falling into one of the ranges applicable to diacritics. In this case, only the accent on top of the ‘e’ character is found and removed: U+0301 (UTF-8: [0xCC 0x81]):
    • (NFKD) 0×65 [0xCC 0x81] 0×63 0x6F 0x6C 0×65 —>
      (unaccented) 0×65 0×63 0x6F 0x6C 0×65

This new method not only worked perfectly in all the test cases we could think of, it was even much faster than using the unac library (up to 73% faster in the best case, and same speed in more complex cases).

Follow

Get every new post delivered to your Inbox.

Join 36 other followers