What Languages Does Senzing Support?
Senzing utilizes UTF-8 which allows for most languages of the world to be properly captured. Beyond ingesting and storing data, Senzing analytics go further – taking into consideration domain, culture, and cross-script differences.
Foundationally, Senzing uses ICU to normalize data for cross-script comparisons e.g., the same name in Latin and Arab scripts. ICU often does a reasonable job at this. In certain areas, Senzing goes beyond ICU for cross-script matching.
Advanced Name Comparisons
Because names can be particularly tricky, Senzing also uses IBM’s InfoSphere Global Name Management for culturally-aware name comparison. This world-class name library uses spelling patterns and country-of-association information to determine the cultural provenance of a name. Name search strategies use this cultural information to make decisions about how best to analyze a single name or compare two names. The following three cultural groups are supported:
- Southwest Asian
Additionally, these cultural categories are supported:
- A "Generic" category is used to support all other cultures
The following native scripts are supported:
- Russian (Cyrillic)
- Japanese (Kana)
- Korean (Hangul)
- Mandarin (Hanzi)
More about Japanese support: We handle personal names in Japanese Kana compared with other forms (typically Kana or Romanized). This includes full multi-cultural capabilities. We do some handling of Kana to Kana forms of business names though not exhaustive. Japanese Kanji is not handled by Senzing and would be treated as if it was Chinese Hanzi if provided.
Advanced Address Comparisons
Addresses can also be particularly tricky as they tend to have many data quality issues. One way to address cross-script address comparison is to use an address hygiene product like IBM InfoSphere QualityStage Address Verification Interface.
For more address hygiene options search our Knowledge Center for “Address Hygiene.”