What languages does G2 support?
G2 utilizes UTF-8 encoding which allows for most languages of the world to be properly captured. Beyond just ingesting and storing data, G2’s analytics go further – taking into consideration domain, culture, and cross-script differences.
Foundationally, G2 uses ICU to normalize data for cross-script comparisons e.g., the same name in Latin and Arab scripts. ICU often does a reasonable job at this (check out this demonstration page to test ICU yourself). In certain areas, G2 goes beyond ICU for cross-script matching.
Advanced Name Comparisons
Because names can be particularly tricky, G2 also uses IBM’s InfoSphere Global Name Management for culturally-aware name comparison. This world-class name library uses spelling patterns and country-of-association information to determine the cultural provenance of a name. Name search strategies use this cultural information to make decisions about how best to analyze a single name or compare two names. The following three cultural groups are supported:
- Southwest Asian
Additionally, these cultural categories are supported:
- A "Generic" category is used to support all other cultures
The following native scripts are supported:
- Russian (Cyrillic)
- Japanese (Kana)
- Korean (Hangul)
- Mandarin (Hanzi)
- Russian (Cyrillic)
More about Japanese support:
We handle personal names in Japanese Kana compared with other forms (typically Kana or Romanized). This includes full multi-cultural capabilities. We do some handling of Kana to Kana forms of business names though not exhaustive. Japanese Kanji is not handled by G2 and would be treated as if it was Chinese Hanzi if provided.
Advanced Address Comparisons
Addresses can also be particularly tricky as they tend to have many data quality issues. One way to address cross-script address comparison is to use an address hygiene product like: InfoSphere QualityStage Address Verification Interface.
For more address hygiene options search our Knowledge Center for “Address Hygiene.”