Exploratory Data Analysis Overview (EDA Tools)

So you've just loaded a bunch of data. How many duplicate customers do you have?

Were any of them on a watch list?

What is that ambiguous match?

How does Senzing compare with other matching engines on a "truth" set?

As of version 2.0, Senzing ships with 3 python scripts that help you explore your data:

G2Explorer.py searches and displays entities to see how and why they are resolved and related to each other.
G2Snapshot.py calculates reports that can be displayed in the G2Explorer ...
- Data Source Summary - that tells you how many duplicates you have.
- Cross Source Summary- that tells you how many records in one data source are also in another data source.
- Entity Size Breakdown - that tells you who your largest entities are and whether or not they need to be reviewed.
G2Audit.py compares entity resolution results between Senzing and other entity resolution engines or even between runs of the same data in Senzing as you tune it to your preferences.

These were formerly known as the POC (Proof of Concept) utilities in prior Senzing versions. We renamed them to EDA (Exploratory Data Analysis) tools because while they certainly help POCs they can be used long after to help ensure your system is performing as expected.

Please continue on to the next article in the series to load the demonstration truth set and see these tools in action! Exploratory Data Analysis 1 - Loading the truth set demo

Articles in this section

Comments

Articles in this section

Related articles