Exploratory Data Analysis 2 - Basic exploration – Senzing®

Please follow the instructions below and/or watch this Video tutorial

For the next step ...

If you installed directly in linux, navigate to <your project>/g2/python directory.
If you installed the senzing-up docker image, execute the <install directory>/docker-bin/senzing-console.sh shell script. From there, navigate to /opt/senzing/g2/python.

If you are not sure where either of these are, please review Exploratory Data Analysis 1 - Loading the truth set demo

From the python directory type ...

./G2Explorer.py

Then at the (g2) prompt, type help. Your screen should look like this ...

From here you can ...

Search for entities
Compare them
Get their records
Ask why they merged or did not merge

Tip

Bear in mind that these are all API calls that you can access directly from your own software to offer this kind of functionality.

Using search

At the (g2) prompt, type ...

search robert smith

Your screen should look like this ...

This tells you there are 3 entities that satisfy the search criteria ...

Entity #1 is a Robert Smith with 4 customer records.
Entity #1003 is a Robert Smith who is on the watchlist.
Entity #5 is a Rob Smith Sr who is both a customer and on the watch list.

The match key column tells you what attribute(s) matched (+name) and what internal rule was hit. The match score ranks the entities returned so the best matches are on top.

Info

Don't worry too much about which internal rule was hit for now. The most important thing is the +NAME match key and the score. However, here is a link to an article that describes the rules Principle-Based Entity Resolution

Using compare

At the (g2) prompt, type compare with the list of entity IDs you want to compare like so ...

compare 1, 1003, 5

Your screen should look like this ...

This will place the top 3 search records side by side so that you can see all their attributes. This makes it easy to glean a couple of facts ...

It looks like the two customers live at the same address. The DOBs (date of births) being 24 years apart lets us know we might be looking at a father and son.
The watch list entity in the middle is not related to either customer.

Tip

Depending on your screen resolution you may need to scroll left or right. If at any time the table displayed appears to be cut-off, simply type "scroll" at the (g2) prompt to enable the arrow keys. This places the table in the linux command "less" and all of its functionality applies. Type q to quit "less" and return control to the G2Explorer.

Using get

Let's say you are most interested in entity #1. At the (g2) prompt type ...

get detail 1

The first column shows you which data source and record IDs were resolved to this entity as well as the match_key and rule that fired when the record was loaded.

The second column shows the data on each record that was used for resolution. This is the identifying data such as name, date of birth, address, identifiers, etc.

The third column shows all the other data for each record. From this you can see that Robert has one active and three inactive records. The earliest record is from 2015, the latest in 2018. If you add the amounts together you might consider he has been worth $1000. One might also wonder why he keeps going inactive, then signing up as a new customer the next year changing his identifying information each time. Could it be that his father and his spouse keep getting flagged and put on the watch list?

Tip

When mapping records into Senzing, consider that you are mapping them to help you make a decision and include some key dates, statuses or amounts that will help you decide what to do when a match is made.

Using why and why not

Using why

Lets say you are a bit confused as to why Senzing matched those 4 customer records to entity #1 above. After all, customer record 1004 shares only a weak name match with customer 1001! At the (g2) prompt type ...

why 1

Your screen should look like this ...

The purpose of the this screen is to explain why each record is in the entity. Remember we are doing entity resolution, not record matching. Following the numbered circles on the screen ...

The why result re-computes the match key. Customer 1001 shares a name, dob, phone and email with the rest of the records in the entity.
There are exactly [2] entities with the name Robert Smith. The name it matches best is Bob Smith which happens to be on the Customer 1002 record to the right. The sur name scored 100, the given name scored 95, yielding a full name score of 97.
Customer 1001 had the same DOB as customer 1003. However, customer 1002's DOB (to the right) had a swapped month/day with the rest of the entity and scored a 95. It is still green and part of the match key because 95 is considered close.
Customer 1001's address is definitely different than the rest of the entity. However, its colored yellow because while different it does not detract from the match. People do move.
The remaining _key rows show the internal keys we generate to find candidate matches. They are here mostly to help explain why a match was missed. For instance, if none of the keys are colored cyan, the records simply didn't find each other.

At first glance, a why result looks pretty daunting! But you will get used to it as you look at your own real world examples. Here is a quick synopsis of what you are looking for ...

Green signifies that it met minimum scoring thresholds, red that it didn't, yellow didn't either but matters less.
The entity counts in brackets, especially on the _key fields at the bottom help prove if the records were even able to find each other. They will be colored cyan if they matched.
Values that are dimmed matched, but were discounted because ...
- They are considered "generic" as there were too many entities using it (see the entity count in brackets)
- There was a more complete value to match (if Adam Smith and Andy Smith both have an aka of A Smith, the A smith match will be discounted).
You can type "help why" at the (g2) prompt any time you need a refresher on what the colors and symbols mean.

Senzing is a completely configurable system: Scoring and generic thresholds can be adjusted, rules can be added, additional keys can be created. The why screen helps determine which, if any, of these things should be done.

Using why not

Lets say you are wondering why entity 1 and entity 5, did not come together. At the (g2) prompt type ...

why 1, 5

Your screen should look like this ...

Note: be sure to type "scroll" as this will enable the arrow keys.

The purpose of the this screen is to explain why the two entities did not resolve to each other. Following the numbered circles on the screen ...

When entities don't score high enough to resolve, they are often related. This shows that they are related by name and address.
The why result usually agrees with the relationship above it.
The DOB row shows the two dates of birth that were compared and that they scored 58 which is definitely considered different.
Just a point of interest that the entities were able to find each other on the cyan colored name keys.

Searching for more than just name

Sometimes name alone is not enough to find an entity. For instance, if there are 20 Robert Smiths, you would have to go into each one to see if it is the Robert Smith you are looking for. And if there are 100s of them you might even get this message ...

Searching ...
 No matches found or there were simply too many to return
 Please include additional search parameters if you feel this entity is in the database

In either case, try searching for more than just name. Switch to the json format as defined in the Generic Entity Specification and at the (g2) prompt, type ...

search {"NAME_FULL": "robert smith", "ADDR_FULL": "123 Main Street, Las Vegas NV", "DATE_OF_BIRTH": "3/31/54"}

Your screen should look like this ...

The Rob Smith Sr record is now on top. The match key shows that it has a matching name, dob and address.
The entity previously on top is now second. The match key shows that it has a matching name and address, but the dob is different
Finally, the watch list only entry is on the bottom as it it only matches the name.

Also notice that the match score columns adds the scores of your search attributes together. Three attributes were search for, therefore the highest match score would be 300.

You can search for any attribute you have loaded. Remember we have a json in and json out format that applies to searching as well as loading. If your search parameter is not valid json, the G2Explorer assumes you are searching by name.

Interesting searches in the truth set

Search for any attribute

Try this search at the (g2) prompt ...

 search {"email_address": "Kusha123@hmail.com"}

Your screen should look like this ...

It looks like the whole family uses this email address. Two of the members have been flagged, but the other two have not.

Support for other character sets

Try this search at the (g2) prompt ...

search 张秀英

Your screen should look like this ...

Next ask why to see the scores ...

why 61

Your screen should look like this ...

Click here to learn about Senzing's language and globalization support

You are now ready to continue on to Exploratory Data Analysis 3 - Taking a snapshot

Articles in this section

Exploratory Data Analysis 2 - Basic exploration

Using search

Using compare

Using get

Using why and why not

Using why

Using why not

Searching for more than just name

Interesting searches in the truth set

Search for any attribute

Support for other character sets

Comments

Articles in this section

Using search

Using compare

Using get

Using why and why not

Using why

Using why not

Searching for more than just name

Interesting searches in the truth set

Search for any attribute

Support for other character sets

Related articles