Entities are changing every time you load data. Periodically you will want to take a snapshot to calculate and review the following statistical reports ...
- dataSourceSummary This report shows the matches, possible matches and relationships within each data source.
- crossSourceSummary This report shows the matches, possible matches and relationships across data sources.
- entitySizeBreakdown This report categorizes entities by their size (how many records they contain) and selects a list of entities to review that may be of interest due to multiple names, addresses, DOBs, etc.
Please follow the instructions below and/or watch this Video tutorial
To take a snapshot ...
- If you installed directly in linux, navigate to <your project>/g2/python directory.
- If you installed the senzing-up docker image, execute the <install directory>/docker-bin/senzing-console.sh shell script. From there, navigate to /opt/senzing/g2/python.
If you are not sure where either of these are, please review Exploratory Data Analysis 1 - Loading the truth set demo
From the python directory type ...
./G2Snapshot.py -o demo/truth/demo-snap-v1
Note that the -o (output_file_root) does not have an extension as it automatically adds a .json extension for the snapshot report file used by the G2Explorer program.
Or if you are taking the snapshot to perform an audit, type ...
./G2Snapshot.py -o demo/truth/demo-snap-v1 -a
You should also supply the -a (--for_audit) flag If you are taking the snapshot to perform an audit. This will also write out a .csv file used by the G2Audit program described later in this series.
Your screen should look like this ...
Its a good idea to setup a snapshot and audit directory for your project and to name you snapshot something meaningful.
./G2Explorer.py -s demo/truth/demo-snap-v1.json
Your screen should look like this ... (be sure to type help at the (g2) prompt)
Data Source Summary
From the (g2) prompt, type ...
dataSourceSummary
Your screen should look like this ...
- The Records column shows the number of records in each data source.
- The Entities column shows the number of distinct entities they were reduced to.
- The Compression column computes the percent of duplicate records found. In the above example, the customer file contained 50% duplicates.
- The Singletons column shows the number of entities that didn't match any others.
- The Duplicates column shows how many entities there are with more than one record.
- The Ambiguous column shows how many relationships were created because the entity could match more than one other. For example
- The Possibles column shows how many entities almost matched as they have important attributes both agree and disagree. For instance, they share a name and address, but have a different date of birth.
- The Relationships column shows how many entities were related because they only have lessor attributes in common, like addresses and phone numbers.
At the (g2) prompt, type "help dataSourceSummary" to find out how you can drill into the various statistics to see examples of entities fitting that criteria. In fact you can type "help <anyCommand>" to learn how to use it.
To review the 33 duplicate customers, at the (g2) prompt type ...
dataSourceSummary CUSTOMERS duplicates
Your screen should look like this ...
You are now stepping through examples of customer duplicates. At the prompt, type ...
- P for the prior example
- N for the next example
- S to scroll if the table appears cut off
- D to display the get detail view of the entity
- W to show why screen for this example
- E to export the json records in the example to a file.
- Q to quit browsing the examples in the chosen category
Go ahead step through some records by pressing "N". Show the detail by pressing D or ask why by pressing W on any examples you have questions about. Finally, press Q to quit when done.
Cross Source Summary
From the (g2) prompt, type ...
crossSourceSummary
Your screen should look like this ...
To review the 5 customers on the watchlist, at the (g2) prompt type ...
crossSourceSummary CUSTOMERS WATCHLIST duplicates
Your screen should look like this ...
Just like with the dataSourceSummary, you can step through records by pressing "N" for next or "P" for previous. Show the detail by pressing D or ask why by pressing W on any examples you have questions about. Finally, press Q to quit when done.
Entity Size Breakdown
From the (g2) prompt, type ...
entitySizeBreakdown
Your screen should look like this ...
- The first two columns tell you how many entities there are of each size. An entity's size refers to how many records resolved to it. For example ...
- there are 15 entities that only have 1 record each, aka "singletons"
- there are 28 entities with 2 records each
- there are 3 entities with 3 records each
- The second two columns include the count of entities marked for review and the reasons to review it. For example, we really only expect a person entity to have 1 date of birth and 1 gender, yet they might have up to 4 or 5 addresses or phones. So any entity that has more than the expected number of attributes gets flagged for review. For example in this data set ...
- there is only 1 entity to review for having more than the expected genders.
Entities flagged for review will fall into one of three categories ...
- Explainable due to typos and misspellings or just plain bad data.
- Intentional obfuscation in an attempt to avoid detection (aka fraud). A criminal trying to hide their identity uses certain names with certain addresses, identifiers, dates of birth, etc. Eventually they use the wrong name with the wrong identifier and we resolve them together.
- It is overmatched and should be manually pulled apart. There is always a balance to be struck between under matching and overmatching. Tightening scoring thresholds too much and you miss matches, aka false negatives. Loosen them up too much and you get over-matching, aka false positives.
Our out of the box settings and scoring thresholds favor the false negatives, preferring to demote potential false positives to possible matches where they are still linked to the entity and easily accessible.
However, there are some data anomalies that can cause overmatching, particularly on organizations where there is often not enough information to keep them apart like a tax_id. Or on really dirty data where the address or other feature presented as belonging to the entity really belongs to another.
Usually these are the exception rather than the rule though! So, to manually force records in an overmatched entity apart, see the article Forcing records together or apart.
To drill into entitySizeBreakdown examples, from the (g2) prompt type ...
entitySizeBreakdown = 6
Your screen should look like this ...
Notice this entity did not get flagged for review even though multiple names and SSNs. That is because we recognize that all those names score high enough against each other to be considered the same name. Same with the SSN.
Next try ...
entitySizeBreakdown review
Your screen should look like this ...
Then press "w" for why, and your screen should look like this ...
Do you agree these records belong to the same entity? If not, you could force them apart or even demote rule 111 (in the why result row above) to be a possible match. The Senzing engine is completely tunable!
To learn how to compare snapshots, continue to the next article on Exploratory Data Analysis 4 - Comparing ER results
Comments
0 comments
Please sign in to leave a comment.