The G2Audit utility was designed to compare Senzing ER results with ...
- Postulated ER results from a truth set, see How to create an entity resolution truth set
- ER results from other engines
- Prior Senzing snapshots of the same data set after tuning the Senzing engine
The process of auditing a truth set consists of …
- Load a data set
- Take a snapshot
- Compare the snapshot with the expected results (truth set key) or a prior snapshot
- Explore the results!
For this tutorial, we will be using the truth set that ships with Senzing.
- If you installed directly in linux, navigate to <your project>/g2/python directory.
- If you installed the senzing-up docker image, execute the <install directory>/docker-bin/senzing-console.sh shell script. From there, navigate to /opt/senzing/g2/python.
If you are not sure where either of these are, please review Exploratory Data Analysis 1 - Loading the truth set demo
From the python directory, your directory structure should look like this ...
Comparing the results
Execute the following commands from the python directory ...
Step 1 - Load the truth set data file 1
Note that we are purging the data base with -P and naming the data_source TRUTH_SET1 with the /? directive after the file name.
If you received errors on this step it's likely due to the environment not being properly initialized. Try executing the following statement from the python directory ...
If you are still receiving errors, something went wrong during installation. Please review the prior installation documents or file a support ticket here
Step 2 - Take a snapshot
If you have any trouble with this step, go back and review Exploratory Data Analysis 3 - Taking a snapshot
Step 3 - Compare the snapshot to the truth set key file 1
--newer_csv_file demo/truth/truthset-person-v1-set1-snapshot.csv \
--prior_csv_file demo/truth/truthset-person-v1-set1-key.csv \
The comparison process compares the new snapshot (newer_csv_file) with the truth set's key (prior_csv_file) and writes the statistics and complete results to the (output_file_root).
Note that the --output_file_root does not have an extension as two files will be generated: A json file that holds the statistics and samples, and a csv file that holds all the results so they can be imported into database for further analysis.
The final output of the audit process should look like this ...
computing statistics ...
67.0 prior positives
0 new positives. <--no unexpected matches
0 new negatives <--no unexpected missed matches
1.0 f1-score. <-- 1.0 = 100%, a perfect score!
45 prior entities
45 new entities
45 common entities
0 merged entities
0 split entities
0 overlapped entities
process completed successfully!
For more information on what these statistics mean, please review Understanding the G2Audit statistics
The screen will look slightly different on earlier versions of Senzing. The statistics were simplified in Senzing version 2.1, however the data is basically the same.
Create differences to review
Since the default results are perfect, let’s make some changes to the expected results.
First make a copy of the current key ...
cp demo/truth/truthset-person-v1-set1-key.csv demo/truth/truthset-person-v1-set1-key-edited.csv
Then using your favorite editor, make the following changes to the demo/truth/truthset-person-v1-set1-key-edited.csv.
Re-run the comparison against the edited key
No need to reload or take a new snapshot. We only changed the postulated groupings of records 1004, 1005, 1025 and 1026.
--newer_csv_file demo/truth/truthset-person-v1-set1-snapshot.csv \
--prior_csv_file demo/truth/truthset-person-v1-set1-key-edited.csv \
You will notice you now have some new positives and negatives!
Perform the audit
Next go into the G2Explorer to review the results ...
--snapshot_json_file demo/truth/truthset-person-v1-set1-snapshot.json \
Hint: Type "help" at this prompt at any time to see a list of things you can do!
To analyze audit statistics, type “auditSummary”. You should see the following screen …
Again, the screen will look slightly different on earlier versions of Senzing. The statistics were simplified, but the data is basically the same. The number of prior and new positives are higher due to the replacing the prior "accuracy" calculations with the "pairwise" statistics making it more true to the traditional analysis.
- The first thing to understand about this screen is that matches you expected but didn’t get are the splits.
- But you may also find that Senzing made matches that you did not expect! These are the merges. Sometimes they are pleasant surprises and sometimes they are the unintended result of using too close a name or address on different clusters and the truth set should be corrected.
- The second thing to notice is that although 7 new positives were created, only 1 entity was affected which is often the case.
- Statistics such as Precision, Recall and F1 scores are based on records. However, it is best to review the entities that got split or merged or both!
Reviewing the un-expected matches
To view the merged entities, type “auditSummary merge”. You should see the following screen …
This gives you a summary of the reasons why entities were merged. Usually the sub-category is a match_key, such as NAME+ADDRESS or NAME+PHONE. But sometimes there are multiple reasons as is the case with this example.
To see the actual entities merged, type “auditSummary merge 1” for the list of entities affected by sub-category 1 in this list. You should see the following screen …
Hint: press "S" too scroll any time the report runs off the end of the screen. The report will then be placed into a program that allows you to use arrow keys to scroll up, down, left and right. Be sure to press q to quit the scrolling program when done.
Now you can see the 2 positives that we just created by changing the cluster_id of records 1005 and 1006. It’s kind of odd that the action we took was to split the entity TI-1000-1. However, the audit resulted in a merge. But when you think about it this makes sense. We said the records should not be part of the entity, but the engine said they should. Therefore it’s a merge.
You may also wonder how these two new positives added to the group resulted in 7 new positives. This is due to the pairwise calculation for precision, recall and F1. Record 1004 now matches 1001, 1002, and 1003. Record 1005, now matches records 1001, 1002, 1003, 1004 and 1005.
The next usual action is to ask the system why it merged those records. The prompt “Select (P)revious, (N)ext, (S)croll, (W)hy, (E)xport, (Q)uit ...” tells you the actions you can take. Typing “W” for why should return the following screen …
There is a lot to learn about this screen and what it means. But it is rather obvious that those last two columns are the records that we said should not be part of this entity. Yet the name and address scores are quite high, in the 90s, and nothing else detracts.
This is where you decide what you believe more: the postulated result in the truth set or what this why screen tells you. Sometimes you might even decide to adjust your truth set to what Senzing engine returns!
The next thing to do is type “N” or “P” at the prompt to scroll through the list of entities in this sub-category, asking why on each that don’t seem obvious. Finally type Q to quit when you are done viewing entities in this category.
What the why screen is telling you
When you are back to the regular prompt, type “help why” for an explanation of what the why results colors and symbols mean. You should see the following screen …
Aside from the red, green, yellow score indicators, the bracket legend is particularly useful for when entities don’t match. For instance …
- The ~ indicates that there are just too many entities with that value to be useful. For instance, the name Joe Smith may have 100s of entities and the date of birth 1/2/1980 may have 100s of entities so we stop using them for finding candidate matches. But this is why we create all those name and address keys … because likely there are only a few Joe Smiths born on 1/1/1980.
- The ! indicates that so many entities use this value that is likely garbage data like the name “test customer” or the address “unknown”.
- The # indicates that the value was suppressed as there is a more complete value that better represents the entity. For instance, if “Patrick Smith” and “Patricia Smith” both have an aka of “P Smith”. The best name match is between Patrick and Patricia, not between the shorter P Smith values! Please note this happens more often than you think on actual data sets!
Hint: Note that this help indicates you can ask why for any entities at any time. You don’t have to perform an audit first. This viewer can be used at any time a user wants to know why record matched or didn’t match which makes it a nice back-end tool for more knowledgeable support staff to investigate users requests to understand why a match was made or not and help them tune the engine if need be.
Reviewing the missed matches
This is what you came for! Why didn't Senzing make the matches I expected it to.
So next type “auditSummary split” and “auditSummary split 1” to view the matches you did not get. You should see the following screens …
Here is the entity we changed to have the same cluster_id. Let’s ask the system why by typing “W” at the prompt. You should see the following screen …
At first glance, you might wonder why these records did not resolve. It’s all green! But look at the relationship row highlighted above. They only share a name and date of birth and the name is only close, not exact.
However, the match was not missed! Senzing just classified it as a relationship. You can always ask Senzing to show you the possible matches to any entity whenever you like. We strive to only put records together when we are sure. But we make the possible matches, the ones that fell just short, available if the mission or the user wants to loosen up what they want to see to make the right decision.
There will likely be matches you wish were made but were not as well as matches you didn't think would match but did. Perfection is not so easy to attain, especially using made up data.
This is where the precision, recall and F1 scores come in to help show how far from the expected result you are. Having these in the 90 percentile is what you want to see.
But rest assured, if you are not getting good scores in your audit even after reviewing the why results, something can be done about it! The engine can be tuned: thresholds can be changed. additional keys can be created, rules can be added. If this is the case with you, please contact us and we will help you through it.