This is the process used for Senzing versions prior to 2.0. If using version 2.0 or above, please go to this updated article Exploratory Data Analysis 4 - Comparing ER results
The process of auditing a truth set consists of …
- load the truth set data (usually a csv file)
- take a snapshot
- audit the truth set against the snapshot
- then explore the results
Creating a truth set
Attached is an example truth set with the following columns ...
In the example file above …
- The red cluster_id column groups records that should resolve together. It is a purely made up alphanumeric value.
- The blue data_source, entity_type, and record_id columns are required by Senzing and form the unique key of a record. Please note: record_id’s must be unique within a data source!
- The black columns are the data fields for each record and must use our pre-defined column names described In this article Generic Entity Specification
The audit program needs to know which fields in the truthset file represent the unique key of a record in the file as well as a cluster_id to show how they should be grouped into entities. The companion map file is a simple json structure that is named after the original file with a .map extension. I have attached a truthset csv and its associated map file for you to refer to. Here is the format …
Where to get the POC utilities
The snapshot and viewer are on the Senzing github site located here …
The poc_audit.py program can only be emailed to you directly. Please contact support if you would like a copy.
Place all 3 python scripts on the python directory for your Senzing project (along with G2Loader.py, G2Command.py, etc)
Performing the Audit
Download the files attached to this article and update the shell script with where you placed them. This script assumes you placed them on a directory named /project/data.
echo Step 1 - Load the truth set into Senzing
echo Step 2 - Take a snapshot of the senzing result
echo Step 3 - Audit the Senzing result against original truth set
--prior_csv_file /project/data/truthset-person-v1-set1-data.csv \
--newer_csv_file /project/audit/truthset-person-v1-set1-snapshot.csv \
echo Step 4 - View the statistics and drill into examples to find out why
--snapshot_json_file /project/audit/truthset-person-v1-set1-snapshot.json \
Reviewing the results
If you run the updated shell script, you should see the following result ...
Step 3 - Audit the Senzing result against original truth set
loading /project/data/truthset-person-v1-set1-data.csv ...
loading /project/audit/truthset-person-v1-set1-snapshot.csv ...
45 entities processed at 10:43pm, 8343552 per second, complete!
computing statistics ...
45 prior entities
45 new entities
45 common entities
33 prior clusters
33 new clusters
33 common clusters
67.0 prior pairs
67.0 new pairs
67.0 common pairs
78 prior positives
0 new positives. <--no unexpected matches
0 new negatives <--no unexpected missed matches
1.0 f1-score <-- 1.0 is a 100%, a perfect score
While precision, recall and F1 scores are computed at several levels, the truthset statistics are best represented by the bolded set at the bottom. Please see the following article for more information on what these statistics mean...Understanding the POC audit statistics
Create differences to review
Since the default results are perfect, let’s make some changes to the expected results. Make these changes to the truthset-person-v1-set1-data.csv downloaded from this article by replacing the current character with an X as shown below
Then re-run steps 3 and 4 in right shell script. Notice that the audit results now show …
80 prior positives
2 new positives
1 new negatives
Analyzing the results
Step 4 of the shell script launches the poc_viewer which presents you with a (poc) prompt.
Hint: Type "help" at this prompt at any time to see a list of things you can do!
To analyze the stats in the poc_viewer, type “auditSummary”. You should see the following screen …
- The first thing to understand is that matches you expected but didn’t get are the splits.
- But you may also find that Senzing made matches that you did not expect! These are the merges. Sometimes they are pleasant surprises and sometimes they are the unintended result of using too close a name or address on different clusters and the truth set should be corrected.
- The second thing to notice is that although 2 new positives were created, only 1 entity was affected which often does happen.
- Statistics such as Precision, Recall and F1 scores are based on records. However, it is best to review the entities that got split or merged or both!
Reviewing the un-expected matches
To view the merged entities, type “auditSummary merge”. You should see the following screen …
This gives you a summary of the reasons why entities were merged. Usually the sub-category is a match_key, such as NAME+ADDRESS or NAME+PHONE. But sometimes there are multiple reasons as is the case with this example.
To see the actual entities merged, type “auditSummary merge 1” for the list of entities affected by sub-category 1 in this list. You should see the following screen …
Hint: press "S" too scroll any time the report runs off the end of the screen. The report will then be placed into a program that allows you to use arrow keys to scroll up, down, left and right. Be sure to press q to quit the scrolling program when done.
Now you can see the 2 positives that we just created by changing the cluster_id of rows 5 and 6 in the spreadsheet. It’s kind of odd that the action we took was to split the entity TI-1000-1. However, the audit resulted in a merge. But when you think about it this makes sense. We said the records should not be part of the entity, but the engine said they should. Therefore it’s a merge.
The next usual action is to ask the system why it merged those records. The prompt “Select (P)revious, (N)ext, (S)croll, (W)hy, (E)xport, (Q)uit ...” tells you the actions you can take. Typing “W” for why should return the following screen …
There is a lot to learn about this screen and what it means. But it is rather obvious that those last two columns are the records that we said should not be part of this entity. Yet the name and address scores are quite high, in the 90s, and nothing else detracts.
This is where you decide what you believe more: the postulated result in the truth set or what this why screen tells you. Sometimes you might even decide to adjust your truth set to what Senzing engine returns!
The next thing to do is type “N” or “P” at the prompt to scroll through the list of entities in this sub-category, asking why on each that don’t seem obvious. Finally type Q to quit when you are done viewing entities in this category.
What the why screen is telling you
When you are back to the regular prompt, type “help why” for an explanation of what the why results colors and symbols mean. You should see the following screen …
Aside from the red, green, yellow score indicators, the bracket legend is particularly useful for when entities don’t match. For instance …
- The ~ indicates that there are just too many entities with that value to be useful. For instance, the name Joe Smith may have 100s of entities and the date of birth 1/2/1980 may have 100s of entities so we stop using them for finding candidate matches. But this is why we create all those name and address keys … because likely there are only a few Joe Smiths born on 1/1/1980.
- The ! indicates that so many entities use this value that is likely garbage data like the name “test customer” or the address “unknown”.
- The # indicates that the value was suppressed as there is a more complete value that better represents the entity. For instance, if “Patrick Smith” and “Patricia Smith” both have an aka of “P Smith”. The best name match is between Patrick and Patricia, not between the shorter P Smith values! Please note this happens more often than you think on actual data sets!
Hint: Note that this help indicates you can ask why for any entities at any time. You don’t have to perform an audit first. This viewer can be used at any time a user wants to know why record matched or didn’t match which makes it a nice back-end tool for more knowledgeable support staff to investigate users requests to understand why a match was made or not and help them tune the engine if need be.
Reviewing the missed matches
This is what you came for! Why didn't Senzing make the matches I expected it to.
So next type “auditSummary split” and “auditSummary split 1” to view the matches you did not get. You should see the following screens …
Here is the entity we changed to have the same cluster_id. Let’s ask the system why by typing “W” at the prompt. You should see the following screen …
At first glance, you might wonder why these records did not resolve. It’s all green! But look at the relationship row highlighted above. They only share a name and date of birth and the name is only close, not exact.
However, the match was not missed! Senzing just classified it as a relationship. You can always ask Senzing to show you the possible matches to any entity whenever you like. We strive to only put records together when we are sure. But we make the possible matches, the ones that fell just short, available if the mission or the user wants to loosen up what they want to see to make the right decision.
There will likely be matches you wish were made but were not as well as matches you didn't think would match but did. Perfection is not so easy to attain, especially using made up data.
This is where the precision, recall and F1 scores come in to help show how far from the expected result you are. Having these in the 90s is what you want to see.
But rest assured, if you are not getting good scores in your audit even after reviewing the why results, something can be done about it! The engine can be tuned: thresholds can be changed. additional keys can be created, rules can be added. If this is the case with you, please contact us and we will help you through it.
Additional reports and adhoc research
Remember you can type “help” at any time in the poc_viewer to see all the things you can do.
The the poc_snapshot utility also computes 3 additional reports:
These are described here in more detail … Understanding the POC Snapshot Statistics
All of these reports follow the same paradigm for starting with high level stats and then drilling into actual examples that you can browse with previous or next, asking why at any time you want to know.
And always remember, you can access the poc_viewer at any time without ever taking a snapshot or performing an audit. Here are a few tips ...
- search joe smith (where "joe smith" is the name of an entity you want to lookup)
- get 123 (where 123 is one of the entity_ids returned by the search)
- why 123 (if entity 123 consists of multiple records and you want to know why they resolved)
- compare 123,145 (where 123 and 145 are two entity_ids you want to compare)
- why 123,145 (where 123 and 145 are two entity_ids you want to see why they did not resolve)
- Be sure to type "help why" to understand what the colors and symbols mean.
- Use "scroll" immediately after any table that is cut off as screen wrapping has been turned off. This will allow you to see the entire table and pan left and right, up and down.