This article guides you through the process of mapping and loading data into Senzing using the csv files attached to this article. If you want to follow along, download them to a directory of your choice.
- The Excel spreadsheet is the one marked up in the video.
- The -raw.csv file is the one that actually gets processed.
- The .py Python script is the one we build in the video.
- The .g2c file has the configuration required to load this data.
- The _mapped.json files are the files created by the Python script that is ready to load into Senzing.
Important note!
- The video below references ENTITY_TYPE which has been replaced with RECORD_TYPE.
- You can either map to GROUP_ASSOCIATION_ORG_NAME or EMPLOYER_NAME
Please read below and/or watch this Video tutorial. Here we go!
The source data you want to load into Senzing will range from simple to complex. Simple is a basic CSV file like the employee list shown below ...
... to the complex, like the OFAC List, which looks like this ...
Let's start with the simple, but realize no matter how complex, there are only two things to do:
- Decide what fields belong to what entity
- Map them to Senzing attributes based on the Generic Entity Specification
This is how we mapped it ... (the new row 1 has the corresponding Senzing attributes)
Things to note ...
- A data source is required, but there is no column for one. This is ok as we can specify a data source for the whole file when we load it.
- Column A employee number was mapped to record_id as it appears to be unique.
- Columns B through K have one-to-one mappings with Senzing attributes. Note using the PRIMARY label for the name attributes and HOME for the address attributes.
- Columns L-N are payload attributes that don't even need to be registered in Senzing since they do not help resolution. But they are useful to display to users when presented with a match.
- The employer name column is mapped to a group association attribute rather than a name attribute, as it is NOT the employee's name.
Best practices
There are multiple schools of thought here, depending on how you use Senzing. Our Chief Architect, Jeff Butcher, is the master of consuming data directly from Senzing with as rich a context as possible. Our Mr. Performance, Brian Macy, prefers to optimize Senzing for entity resolution and join that insight with other systems/data for analysis.
So here are your two [nearly the same] schools of thought...
Butcher's suggestions
- Always map to NAME_ORG if you know it's an organization name. Otherwise, map to name_full or name_last/first, etc.
- Always specify a label for names, addresses, and phones. Not only do they help group the components of an address that belong together, but they can also be used to help find the information you want, like you may want to see the latest HOME address or the most current CELL phone.
- Always specify the PRIMARY label on a name, even if there are no AKAs. A primary name should be considered before an AKA in any best name calculation.
- Always specify the BUSINESS label for the one physical location of an organization. It will keep chain stores and other subsidiaries from resolving even if they have the same name, phone number, and website.
- Don't be afraid to add new attributes when needed. Contact support@senzing.com for free 30-minute training on how to do this. These will mostly be industry-specific IDs that did not make it into the list we ship with. Don't try to put them all into a generic attribute like OTHER_ID. You will likely want to set their behaviors independently, and it's nice to see precisely what matched in the match_key.
- Be conservative when setting behaviors for new attributes. For instance, on a new kind of ID, set it first to a basic F1. Only add the exclusive behavior when you determine it should break matches even between entities that might otherwise resolve. Only add the stable behavior when you determine it should cement matches between entities that might otherwise not resolve.
Brian's suggestions
- Always map to NAME_ORG if you know it's an organization name. Otherwise, map to name_full or name_last/first, etc.
- Only specify labels if they are meaningful to you. Otherwise, you get a copy for every different label for the same value. The exception is that mobile phones should have the label MOBILE to increase their importance, and the physical locations of an organization should be labeled BUSINESS (see #4 below). No other labels have meaning to the Senzing analytics, so only use labels if they are essential to you.
- Senzing puts an entity name on the records for quickly differentiating entities. The logic is that if you have names with the label PRIMARY, it will pick the longest one. Once you load your data and most records have "PRIMARY" labels on names, you get the longest anyways. If you really want the "best" name, that is something you will want to select more intelligently yourself (which source? latest? native script? ...)
- Always specify the BUSINESS label for the one physical location of an organization. It will keep chain stores and other subsidiaries from resolving even if they have the same name, phone number, and website.
- Don't be afraid to add new attributes when needed. Contact support@senzing.com for free 30-minute training on how to do this. These will mostly be industry-specific IDs that did not make it into the list we ship with. Don't try to put them all into a generic attribute like OTHER_ID. You will likely want to set their behaviors independently, and it's nice to see exactly what matched in the match_key.
- Be conservative when setting behaviors for new attributes. For instance, on a new kind of ID, set it first to a basic F1. Only add the exclusive behavior when you determine it should break matches even between entities that might otherwise resolve. Only add the stable behavior when you determine that it should cement matches between entities that might otherwise not resolve.
Take note ...
Easily, the single most common cause of overmatching is when attributes that don't belong to an entity get mapped to it. Ask yourself on every field... "does this attribute really belong to the entity I am trying to map?". If the answer is no, either include it as a payload or map it to the entity it does belong to and relate this entity to it.
Please refer to the video linked above for more information on mapping!
Comments
0 comments
Please sign in to leave a comment.