What Languages Does Senzing Support?
Senzing utilizes UTF-8 which allows for most languages of the world to be properly captured. Beyond ingesting and storing data, Senzing analytics go further – taking into consideration domain, culture, and cross-script differences.
Foundationally, Senzing uses ICU to normalize data for cross-script comparisons e.g., the same name in Latin and Arab scripts. ICU often does a reasonable job at this. In certain areas, Senzing goes beyond ICU for cross-script matching.
Advanced Personal Name Comparisons
Because names can be particularly tricky, Senzing also uses IBM’s InfoSphere Global Name Management for culturally-aware name comparison. This world-class name library uses spelling patterns and country-of-association information to determine the cultural provenance of a name. Name search strategies use this cultural information to make decisions about how best to analyze a single name or compare two names. The following three cultural groups are supported:
- Southwest Asian
Additionally, these cultural categories are supported:
- A "Generic" category is used to support all other cultures
The following native scripts are supported:
- Russian (Cyrillic)
- Japanese (Kana)
- Korean (Hangul)
- Mandarin (Hanzi)
- Khmer (Cambodian) -- Coming in v3.0
- And a general fall-back to ICU transliteration
More about Japanese support: We handle personal names in Japanese Kana compared with other forms (typically Kana or Romanized). This includes full multi-cultural capabilities. We do some handling of Kana to Kana forms of business names though not exhaustive. Japanese Kanji is not handled by Senzing and would be treated as if it was Chinese Hanzi if provided.
Advanced Address Comparisons
Addresses can also be particularly tricky as they tend to have many data quality issues. Senzing has some capability to handle native scripts for addresses. This is most effective for native-to-native (not native to Romanized) processing. This is an area we are eager for feedback on and to make investments to improve.
Another way to address cross-script address comparison is to use an address hygiene product like IBM InfoSphere QualityStage Address Verification Interface in order to Romanize the addresses and then provide both native script and Romanized versions to Senzing.