In our rapidly changing economy, real-time data analysis is crucial, and dependable data insights are essential to a business’s survival. Entity resolution (ER) software provides a critical advantage in generating real-time dependable data insights.
Senzing is entity resolution.
Getting to Insight Faster
Credible and efficient exploratory data analysis depends on accurate resolution and views of personal, organizational, and entity-relationship linkage. Senzing entity resolution is designed to natively handle dirty data sets to produce high-quality matching and deduplication, thereby reducing any data-cleansing overhead. Senzing automatically reduces the noise across data sets and finds non-obvious relationships.
CEO Jeff Jonas demonstrates Senzing on extremely messy data:
Minimal Data Preparation
Senzing is designed to be tolerant of messy and structurally inconsistent data. Data is mapped and ingested as-is from the source systems, and Senzing takes care of the rest. For example, inconsistent name and address standardization, or DOB formats are natively handled by Senzing.
Senzing delivers significant advantages using its data standardization and parsing routines that perform comparisons on similar data attributes, even if those attributes are structurally inconsistent within source systems. Beyond ingesting and storing data, Senzing considers and supports domain, cultural, and cross-script differences. Senzing utilizes UTF-8 which allows for most languages of the world to be properly captured.
Relationship Awareness
Senzing not only resolve entities, but it also identifies, maintains, and manages multiple of relationships between entities:
-
Disclosed logical relationships are based on the provided information. For example, a guarantor on a credit application or an emergency contact on an employment application
-
Derived relationships are based on logical information. For example, family members sharing an email address
-
Possible matches are created when there is not enough information yet to establish match certainty.
-
Ambiguous matches are an innovative type of possible match unique to Senzing, where records have more than one matching entity, but there isn’t enough information yet to establish certainty. For example, a record containing a name and address for Pat Jones when there is a Patrick Jones and Patricia Jones at the same address. Patrick or Patricia could be the right answer, but picking one would be arbitrary.
Click here to learn more about the invisible false-positive problem caused by the misattribution of records to entities when they should remain ambiguous to avoid real-life harm.
What Senzing does to Aid Exploratory Data Analysis (EDA)
Senzing comes in two product flavors. The desktop app which is a compact demo experience that contains our engine and the deployable API. For the purpose of this educational document, we introduce to you the desktop app as a means of showing the handling of dirty data.
The API experience is far richer, and as of version 2.0 ships with the Senzing Exploratory Data Analysis (EDA) tools in the /python directory that help you explore your data:
- G2Explorer.py searches and displays entities to see how and why they are resolved and related to each other.
- G2Snapshot.py calculates reports that can be displayed in the G2Explorer ...
- Data Source Summary - that tells you how many duplicates you have.
- Cross Source Summary- that tells you how many records in one data source are also in another data source.
- Entity Size Breakdown - that tells you who your largest entities are and whether or not they need to be reviewed.
- G2Audit.py compares entity resolution results between Senzing and other entity resolution engines or even between runs of the same data in Senzing as you tune it to your preferences.
Click here to learn more about the API product experience and the Senzing Exploratory Data Analysis (EDA) tools.
Seeing our Entity Resolution for Yourself- The Senzing Desktop App
PRIVACY AND SECURITY: Senzing is not a cloud computing provider. No personal data ever flows to Senzing. You download Senzing and deploy it locally, on-premise, or in the cloud.
Here we introduce to you the desktop app as a means of showing the handling of dirty data.
The desktop app does not yet support the same level of data exploration tools offered with the API.
Activity |
Action |
What to look for |
---|---|---|
Double-check your computer meets the required specification |
Windows: On your keyboard, press the Windows logo key and Pause/Break key at the same time to open the System window. Mac
|
Supported Operating Systems
Machine Specifications Specs: Recommended 4 cores, 16GB RAM, 250GB flash storage, solid-state drive. Minimum 2 cores, 8GB RAM, 100GBB flash storage |
Get the Senzing Desktop App |
Navigate via browser to www.senzing.com and click the download button. |
A download should begin for the app. It is a large application, so depending on your bandwidth this may take a bit of time. |
Install the app to your machine |
|
Installing applications on a Mac: Find Senzing in your downloads folder, click open and drag the DMG to the applications folder. Installing applications on a PC: Find Senzing in your downloads folder, click to open the .exe and follow the install instructions. |
Open the app |
Locate the program and open it |
The senzing Icon in your application menu |
Exploring Senzing Entity Resolution via our Synthetic Truth Set
Create your first project
VIDEO: https://youtu.be/uT2T7fcqMBA
Goal: To show you Senzing’s ability to work with data as it appears in records and match, find duplicates and link relationships.
We will be walking through setting up your first project and then directing you to look at specific entities and what caused them to match, become possible matches, or relationships.
-
Download the truth set [Here]
-
Click add the data source
-
Select the downloaded truth set
-
Map the data set:
Recommended Mapping:
The app can pick up the correct mapping based on the column header name, and we recommend you verify and adjust as needed.
Include |
Column |
Mapped to |
Feature |
Label |
Notes |
---|---|---|---|---|---|
✓ |
Set Reference |
None/Included |
None/Included |
|
Include columns with data that you will want present in your resolved data set for other data exploration and analysis work. They don’t need to be mapped. |
✓ |
Set Cluster ID |
None/Included |
None/Included |
|
|
✓ |
Notes |
None/Included |
None/Included |
|
|
✓ |
RECORD_TYPE |
None/Included |
None/Included |
|
|
✓ |
RECORD_ID |
RECORD_ID |
RECORD_ID |
|
|
✓ |
PRIMARY_NAME_LAST |
NAME_LAST |
NAME |
|
|
✓ |
PRIMARY_NAME_FIRST |
NAME_FIRST |
NAME |
|
|
✓ |
PRIMARY_NAME_MIDDLE |
NAME_MIDDLE |
NAME |
|
|
No data to include/ Uncheck |
PRIMARY_NAME_PREFIX |
|
|
|
useful for entity resolution if data is present. |
✓ |
PRIMARY_NAME_SUFFIX |
NAME_SUFFIX |
NAME |
|
|
No data to include/ Uncheck |
GENDER |
|
|
|
useful for entity resolution if data is present. |
✓ |
DATE_OF_BIRTH |
DATE_OF_BIRTH |
DOB |
|
|
✓ |
DRIVERS_LICENSE_NUMBER |
DRIVERS_LICENSE_NUMBER |
DRLIC |
|
|
✓ |
DRIVERS_LICENSE_STATE |
DRIVERS_LICENSE_STATE |
DRLIC |
|
|
✓ |
SSN_NUMBER |
SSN_NUMBER |
SSN |
|
|
No data to include/uncheck |
NATIONAL_ID_NUMBER |
|
|
|
useful for entity resolution if data is present. |
No data to include/uncheck |
NATIONAL_ID_COUNTRY |
|
|
|
useful for entity resolution if data is present. |
✓ |
HOME_ADDRESS_FULL |
ADDR_FULL |
ADDRESS |
|
|
No data to include/uncheck |
MAIL_ADDRESS_FULL |
ADDR_FULL |
ADDRESS |
|
useful for entity resolution if data is present. |
✓ |
HOME_PHONE_NUMBER |
PHONE_NUMBER |
PHONE |
HOME |
Labels are used to keep features of the same type distinct. |
✓ |
CELL_PHONE_NUMBER |
PHONE_NUMBER |
PHONE |
MOBILE |
Labels are used to keep features of the same type distinct. |
✓ |
EMAIL_ADDRESS |
EMAIL_ADDRESS |
|
|
|
Exploring the Results
Viewing Results
Click ‘REVIEW RESULTS’ on the data tile.
Duplicates and Matches
In our example project, we have only one data set. Where there were multiple data sets identification of duplicates would also extend to the identification of matches across all data sets in the project.
Review Entity # 1:
Match Key:
-
Take a look at the match keys, they describe what attributes Senzing used to consolidate the records on the entity.
Name data:
-
Notice that Senzing has been able to consolidate the entity across the name permutation
Attribute Data:
-
Notice that despite what appears to be a transposition variation on the date of birth, Senzing still matched the records.
Click the entity ID for Entity 1:
-
This will show you the Entity Resume.
The entity resume gives you full summary of matched records, possible matches of records, possible relationships and disclosed relationships.
Scroll down to look at possible matches:
Questions:
-
Can you determine what match key was not used that caused this entity to be held separate as a possible relationship?
-
What other attributes might contribute to needing to hold this entity out from the Robert Smith Entity?
ANSWERS:
-
Date of birth didn’t match
-
There is a ‘Sr’ suffix
Review Entity # 9:
Name data:
-
Notice that Senzing has been able to consolidate the entity across the name permutations, as well a match up records where the name was not at all similar.
Attribute Data:
-
Notice that despite what appears to be a transposition variation and date format variations, Senzing was able to discern they were all the same date.
Address Data:
-
Take a look through the addresses on Entity 9’s records, you will notice variations in the addresses themselves and the presence of more than one address.
Questions:
-
What attributes did Senzing use to merge Jedi Knight 1 into this entity?
-
Which match key or keys allowed Beeny Kashu to merge along with Kusha Edward on this entity?
ANSWER:
-
+DOB+ADDRESS+SSN
Using Your Own Data
The Senzing desktop app consumes CSV data. If you would like to use your own data extracted from data sources from within your company or institution you will need to locate a minimum of three of the following features and attributes:
Note that the more features and attributes you can provide, the higher the quality of the results.
-
First, Last or full name
-
Date of Birth
-
Full address or the fields that comprise a full address
-
Email address
-
Phone number
-
SSN, (full or last 4 digits)
-
Organization name
-
Record Id
-
Unique identifier
-
More…
This article provides the latest specification for presenting entities to the Senzing engine from your data sources. The CSV level information applies to the app as well as the API version of the product.
This document includes the data dictionary for inputting entities and sample CSV and JSON formats for persons and companies.
Mapping your Data in the App
Review the mapping tutorial and video here.
The Senzing Desktop App comes with the ability to auto map specific data sources, learn more here.
Deep Dive- How Senzing Works (Click Here)
Useful External Data Sets
Industry |
Question Examples |
Data Set Name |
Data Set Location |
Cost |
Details |
---|---|---|---|---|---|
Health Care |
|
US HHS OIG |
https://oig.hhs.gov/exclusions/exclusions_list.asp Layout: https://oig.hhs.gov/exclusions/files/leie_record_layout.pdf |
Public/Free |
All Excluded Entities and Individuals convicted of Medicaid and Medicare Fraud. It contains both People and Business Entities and their nation NPI 10 digit number. |
Health Care |
|
CMS |
Public/ Free |
File 1 - all 6 million USA Medical Entities with an assigned 10 digit NPI. File 2 - all NPI Other Names ONLY |
|
Finance Insurance More |
|
OFAC |
https://www.treasury.gov/ofac/downloads
|
|
OFAC SDN List, no Relationships, includes SSN's on some records |
Data Preparation & Exploratory Data Analysis using the Senzing API
Coming soon…
Comments
0 comments
Please sign in to leave a comment.