What Languages Does Senzing Support?
Senzing utilizes UTF-8 which allows for most languages of the world to be properly captured. Beyond ingesting and storing data, Senzing analytics go further – taking into consideration domain, culture, and cross-script differences.
Foundationally, Senzing uses ICU to normalize data for cross-script comparisons, e.g., the same name in Latin and Arab scripts. ICU often does a reasonable job at this. In certain areas, Senzing goes beyond ICU for cross-script matching.
The best news is that Senzing's entity-centric learning capabilities allow it to learn attribute variations (including script variations) even when it can't match the attributes in their provided forms.
Advanced Personal Name Comparisons
Because names can be particularly tricky, Senzing also uses IBM’s InfoSphere Global Name Management for culturally-aware name comparison. This world-class name library uses spelling patterns and country-of-association information to determine the cultural provenance of a name. Name search strategies use this cultural information to decide how best to analyze a single name or compare two names. The following three cultural groups are supported:
- Southwest Asian
- Afghan
- Arabic
- Farsi
- Pakistani
- European
- Anglo
- French
- German
- Hispanic
- Han
- Chinese
- Korean
- Vietnamese
Additionally, these cultures are supported:
- Indian
- Indonesian
- Japanese
- Polish
- Portuguese
- East Slavic (Ukrainian, Belarusian, Russian)
- Thai
- Turkish
- Yoruban
- A "Generic" category is used to support all other cultures
The following native scripts are supported:
- Latin
- Arabic
- East Slavic (Cyrillic)
- Greek
- Hindi
- Japanese (Kana)
- Korean (Hangul)
- Mandarin (Hanzi)
- Khmer (Cambodian) -- v3.2
- Burmese (Myanmar) -- v3.6
- And a general fall-back to ICU transliteration
More about Japanese support: We handle personal names in Japanese Kana compared with other forms (typically Kana or Romanized). This includes full multicultural capabilities. Japanese Kanji is not handled by Senzing and would be treated as if it was Chinese Hanzi if provided.
Organizational Names
In general, robust cross-script matching of organizational names requires reference data containing multiple versions of names. This is because there is no consistency in how this is done... some organizations represent names phonetically (transliterate), some translate (or translate parts of the same), and some organizations find the need to rebrand when moving into new markets/scripts. Fortunately, there are many data providers in the market or services out there that provide such data enrichment.
Advanced Address Comparisons
Addresses can also be particularly tricky as they tend to have many data quality issues. Senzing has some capability to handle native scripts for addresses. This is most effective for native-to-native (not native-to-Romanized) processing. This is an area we are eager for feedback on and to make investments to improve.
Another way to address cross-script address comparison is to use an address hygiene product in order to Romanize the addresses and then provide both native script and Romanized versions to Senzing.
Comments
0 comments
Please sign in to leave a comment.