Senzing supports data provided in UTF-8 but, as a context analyzing entity resolution engine, Senzing must understand that data and not just load/store it. For this purpose, the engine has cross-script capabilities so Romanized data can match against data in native script. The Globalization article has more information on globalization capabilities.
Often data sources with native script will also contain Romanized data. When this happens, best practice is to include both the Romanized and native script versions of the data in the same record. This allows the engine to make use of multiple forms of the data and make use of enhancements in all representations.
Assume there is a data file for ingestion Romanized as follows:
RECORD_ID,PRIMARY_NAME_FULL,GENDER,DATE_OF_BIRTH,PASSPORT_NUMBER,PASSPORT_COUNTRY 1 ,Valdamir Gogol ,M ,1911-12-13 ,123456789 ,RUS
If the original native script for the name is available it would be useful to add a second name with a different usage type. Usage types can be created arbitrarily and for this purpose a NAME usage type of NATIVE is appropriate.
RECORD_ID,PRIMARY_NAME_FULL ,NATIVE_NAME_FULL ,GENDER,DATE_OF_BIRTH,PASSPORT_NUMBER,PASSPORT_COUNTRY 1 ,Valdamir Vasilievich Gogol,Влади́мир Васи́льевич Го́голь,M ,1911-12-13 ,123456789 ,RUS
This same pattern can be used for addresses and other features that may be available in multiple scripts.
Please sign in to leave a comment.