Does G2 do data cleansing?
G2 is not a data cleansing tool. Data cleansing operations (homegrown, tools, or combinations) should be applied to your data before the data is submitted to G2. What we do is standardize the data received to make sure it is formatted the same. This is easy enough with dates, ID numbers, phone numbers, etc. It’s bit harder with names and addresses.
In terms of data quality, what fields tend to be the most problematic for G2?
Names are tricky. Do your best to use the first and last name fields as this will produce better results that just placing the whole name is the NAME_FULL field.
Addresses have historically been one of the most burdensome fields to deal with. Nearly all analytical engines require addresses to be parsed and standardized which is difficult and can require expensive and time consuming software options. NEW TO G2v2, we now prefer a single ADDR_FULL address for scoring and have been seeing excellent results. This eliminates the burden in processing addresses and is key to fast time to value. Obviously, the quality of addresses is important and leveraging tools to improve that quality is always helpful, but we are achieving excellent results without it.
What are some commercial address hygiene software options?
There are many vendors on the market who would love to help you. Some offer a service whereby you send them your data and they send you back cleaned up addresses. Others will sell you software that runs in your operation. Some options include:
What about codified fields?
Codified fields like Place of Birth and Citizenship which are supposed to be a country, need to use the same code if the values represent the same country. As well, gender should always be coded the same e.g., “M” and ”F”. This means if one source uses “Male” and “Female”, you will need to transform these to “M” and “F” before presenting these to G2.
Any other “No No’s” to be aware of?
The ENTITY_KEY field is the unique source key used by the source system. It must be unique within a data source or you get very different entities being resolved incorrectly (when this happens at scale we call these “fur balls”).
What happens if G2 is fed incorrectly mapped data or records of exceptionally poor quality?
Well it depends. Certainly, the “garbage in – garbage out” adage applies. Some data quality issues may literally cause the engine to hiccup. We have tried to code to defend against such blunders e.g., every record has the same phone number. That said, almost surely there remain crazy data quality conditions that might cause the G2 engine to go insane.
If we have loaded bad data by accident, how do we fix this?
There are several easy options to fix bad data. One popular method is using the special Prune message as described in detail here. Another option is to perform a ‘delete’ operation on the records in error – then just re-load the records after the load file is corrected. Worse case, if almost all the data in G2 is in error, is to just erase everything and re-load.