This article guides you through the process of mapping and loading data into Senzing using the csv files attached to this article. If you want to follow along, download them to a directory of your choice.
- The excel spreadsheet is the one marked up in the video.
- The -raw.csv file is the one that actually gets processed.
- The .py python script is the one we build in the video.
- The .g2c file has the configuration required to load this data.
- The _mapped.json files are the files created by the python script that are ready to load into Senzing.
Please read below and/or watch this Video tutorial. Here we go!
The source data you want to load into Senzing will range from the simple to the complex. Simple is a basic csv file like the employee list shown below ...
... to the complex, like the OFAC List which looks like this ...
Lets start with the simple, but realize no matter how complex there are only two things to do:
- Decide what fields belong to what entity
- Map them to Senzing attributes based on the Generic Entity Specification
This is how we mapped it ... (the new row 1 has the corresponding Senzing attributes)
Things to note ...
- A data source is required, but there is no column for one. This is ok as we can specify a data source for the whole file when we load it.
- Column A employee number was mapped to record_id as it appears to be unique.
- Columns B through K have one to one mappings with Senzing attributes. Note the use of the PRIMARY label for the name attributes and HOME for the address attributes.
- Columns L-N are payload attributes that don't even need to be registered in Senzing since they do not help resolution. But they are useful to display to users when presented with a match.
- The employer name column is mapped to a group association attribute rather than a name attribute as it is NOT the name of the employee.
- Always map to NAME_ORG if you know its an organization name. Otherwise map to name_full or name_last/first, etc.
- Always specify a label for names, addresses and phones. Not only do they help group the components of an address that belong together, they can be used to help find the information you want. Like you may want to see the latest HOME address or the most current CELL phone.
- Always specify the PRIMARY label on a name even if there are no AKAs. A primary name should be considered before an AKA in any best name calculation.
- Always specify the BUSINESS label for the one physical location of an organization. It will keep chain stores and other subsidiaries from resolving even if they have the same name, phone number, and website.
- Don't be afraid to add a new attributes when needed. These will mostly be industry specific IDs that did not make it into the list we ship with. Don't try to put them all into a generic attribute like OTHER_ID. You will likely want to set their behaviors independently and its nice to see exactly what matched in the match_key.
- Be conservative when setting behaviors for new attributes. For instance, on a new kind of ID, set it first to a basic F1. Only add the exclusive behavior when you determine it should break matches even between entities that might otherwise resolve. Only add the stable behavior when you determine that it should cement matches between entities that might otherwise not resolve.
Take note ...
Easily, the single most common cause of over matching is when attributes that don't belong to an entity get mapped to it. Ask yourself on every field... does this attribute really belong to the entity I am trying to map? And if the answer is no, either include it as payload or map it the entity it does belong to and relate this entity to it.
Please refer to the video linked above for more information on mapping!