Senzing G2 supports data provided in UTF-8 but, as a context analyzing ER engine, Senzing G2 must understand that data and not just load/store it. For this purpose, the engine has cross-script capabilities so Romanized data can match against data in native script. The G2 Globalization FAQ has more information on G2 globalization capabilities.
Often data sources with native script will also contain Romanized data. When this happens, best practice is to include both the Romanized and native script versions of the data in the same record. This allows the engine to make use of multiple forms of the data and make use of enhancements in all representations.
Assume there is a data file for the G2Loader that was Romanized and presented as follows:
If the original native script for the name is available it would be useful to add a second name with a different usage type. Usage types can be created arbitrarily and for this purpose a NAME usage type of NATIVE is appropriate.
1,Valdamir Vasilievich Gogol,Влади́мир Васи́льевич Го́голь,M,1911-12-13,123456789,RUS
This same pattern can be used for addresses and other features that may be available in multiple scripts.