A heuristic-based method exploiting Unicode tables
Romanization is the task of transforming languages and scripts into a common Latin representation.
Let’s take for example the Japanese word “アメリカ”. If you have never studied Japanese, you probably don’t know what it means. Let’s romanize it— we get: “amerika.”
Now, can you guess what “アメリカ” means?
You guessed it: America.
Romanization is not translation. It is much simpler than translation but can help you to understand words, or even sentences, in languages you don’t know.
In natural language processing (NLP), romanization is a very cheap way to enable word or sentence comparisons in different languages, for instance using methods or metrics that exploit string similarity.
The Levenshtein distance between “アメリカ” and “America” is 7, the maximum, since we have to replace 4 characters and add 3 to transform “アメリカ” into “America”. It doesn’t tell us that the two words are actually very close in terms of meaning or pronunciation. If we romanized “アメリカ” into “amerika”, then the Levenshtein distance between “amerika” and “America” is 2, indicating that both words are relatively close to each others.
In this article, I will present a simple, but highly efficient, method for romanization based on Unicode tables and rules that works for any language. This method is proposed by Hermjakob et al. (2018). It doesn’t require any machine learning. Consequently, it is lightweight, works on a small CPU, and doesn’t require any data other than the Unicode tables.
The objective is to map the Unicode characters of any language to ASCII characters.
Unicode is a standard for the consistent encoding of text in most of the world’s writing systems and languages, including the ones that are extinct. The current version (15.0) contains 149,186 characters for 161 writing scripts.
ASCII is another encoding standard. Its first edition dates back to 1963, when computers had very limited capabilities. The ASCII standard only covers 95 printable characters.
Romanized texts won’t contain any diacritics. For instance, the French word “café” will be mapped to “cafe”.
Note: At this point, I must add that romanization is not transliteration. In other words, romanization won’t create any missing vowels for languages that may use implicit vowels (e.g., Arabic languages).
Now that I clarified the objective, let’s discuss how this heuristic-based approach works.
Unicode tables contain explicit character descriptions. The list of Unicode characters on Wikipedia presents these descriptions. The approach proposed by Hermjakob et al. (2018) exploits them to map the characters.
Let’s have a look at some of these descriptions, for instance for Vietnamese characters:
Vietnamese uses a lot of diacritics, i.e., the small marks attached to the Latin letters. As we can see in this Unicode table, the description explicitly indicates how the character is formed. For instance, “ỏ” is a “Latin Small Letter O with hook above.”
Given this description, the heuristic-based romanization simply makes a rule that “ỏ” is the Latin letter “o”. All the Vietnamese characters in the screenshot above are romanized into “o”.
Let’s have another example with Cyrillic characters used for instance in Russian.
This romanization adopts other heuristics such as if consonants are followed by one or more vowels, the romanization will only keep the consonants.
Here, for these Cyrillic characters, the romanization simply maps as follows given the descriptions in the Unicode tables:
- “Х” to “h”
- “Ц” to “ts”
- “Ч” to “ch”
- “Ш” to “sh”
The Vietnamese and Cyrillic examples I presented are very simple cases. In practice, for this approach to be universal, many more rules have to be created.
Moreover, for some families of characters, there aren’t any descriptions in the Unicode tables. This is the case for instance for Chinese, Korean, and Egyptian hieroglyphs, for which we have to make separate sets of romanization rules using other resources such as the Mandarin pinyin table for Chinese characters.
There is also the special case of numbers. A romanizer should map the numbers of any language to the same numbers for consistency.
We can choose to romanize the numbers to Western Arabic numerals, e.g., 0 to 9. For some languages, the mapping will be trivial when one number can be mapped to its corresponding Western Arabic numeral. For instance, in Bengali, the number “২৪৫৬” will be romanized into 2456 by the following one-to-one transformation:
- “২” to “2”
- “৪” to “4”
- “৫” to “5”
- “৬” to “6”
For other languages, the counting system can differ a lot. This is the case for Chinese, Japanese, or Amharic, among others. We have to create another set of rules for these languages, as in the following examples:
- In Amharic: ፲፱፻፸ is made of four numbers ፲, ፱, ፻, and ፸, that are mapped to 10, 9, 100, and 70 which will be then transformed into the romanized sequence 1970. ፲፱፻፸ is 1970.
- In Japanese: ３２万 is made of three numbers, ３, ２, and 万, that will be mapped to 3, 2, and 0000 which will be then transformed into the romanized sequence 320000. ３２万 is 320000.
Fortunately, we don’t have to implement all these rules. Ulf Hermjakob (USC Information Sciences Institute) has released a “universal” romanizer, called “uroman” implementing them.
The software is written in Perl and can be run on any machine with a CPU. It is open source and can be used without restrictions.
It takes as input the sequence of characters to romanize and yields as output the romanized version.
All the rules for romanization are listed in a separate file which can be modified to add new rules if needed.
For a demonstration, we will try to understand some Russian without translating it.
I will romanize the following phrases:
- Графическая карта
- есть хлеб
- бег по утрам
If you don’t know Russian, you can’t guess what they mean.
I saved the phrases into a file “phrases.txt”, with one phrase per line, and then gave the file phrases.txt as input to uroman with the following command:
bin/uroman.pl < phrases.txt
It will generate the romanized phrases as follows:
- Компьютер → kompyuter
- Графическая карта → graficheskaya karta
- Америка → amerika
- Кофе → kofe
- Россия → rossiya
- есть хлеб → est khleb
- бег по утрам → beg po utram
Now, we may guess their meaning. I could do it for the first 5 entries:
- “kompyuter” is “computer”
- “graficheskaya karta” is “graphic card”
- “amerika” is “America”
- “kofe” is “coffee”
- “rossiya” is “Russia”
Most of them have a similar pronunciation to the language they were borrowed from. Your ability to guess the meaning of romanized words will depend a lot on the languages you already know.
The last two phrases are much more challenging. They are cases where romanization isn’t very useful if you aren’t familiar with a language close to Russian.
- “est khleb” is “eating bread”
- “beg po utram” is “running in the morning”
This was just an example with Russian. You can do the same with any language.
Limits of this approach
While this approach is universal, it won’t be accurate for a limited number of languages.
For instance, the Chinese character “雨” (rain) will be romanized into “yu”.
But what about the same character in Japanese, which has the same meaning, but with a very different pronunciation?
This approach doesn’t make the distinction. If you romanize a Japanese text with this approach, you will get a romanized version closer to the Chinese pronunciation. There isn’t any language identification to prevent it.
This approach for romanization received the best demo paper award from the ACL 2018.
This method is simple and universal. It simplifies a lot of natural language processing tasks by converting any language to the same writing script.
However, if you need an accurate romanizer for a specific language, I wouldn’t recommend it. A sequence-to-sequence deep learning method would yield more accurate results if you have suitable training data and the required computational resources.