Prior References

Basic Terms

The terms resolve, duplicate and match are collectively used to refer to records Senzing has determined are the same entity. When we say two records resolve we mean they refer to the same entity. Within the same data source, this could be called a duplicate. Across data sources this is called a match.

The engine also keeps track of the possible matches and relationships found while looking for matches that refer to the same entity. Records are grouped into entities, and entities can be related at one of the following match levels:

Ambiguous Match – An ambiguous match occurs when an entity could in fact resolve to more than one entity that cannot be resolved to each other. For instance, we may have a Patrick Smith and a Patricia Smith at the same address. If we then got a Pat Smith at that same address, it could be either Patrick or Patricia. Since we can’t be sure which, it is held apart as an ambiguous match to either.
Possible Match – A possible match occurs when two entities share high strength attributes, such as identifiers, yet still cannot be resolved together due to other differences. For instance, if the same Patrick and Patricia records above had the same drivers’ license, they would also be a possible match to each other. If they were married, one entity’s ID might have been mistakenly put on the others account.
Possibly Related – A possibly related match occurs when two entities only share lower strength attributes such as addresses and phone numbers. For instance, the Patrick and Patricia records above would be a possible match because they share the same address (if they didn’t also share an ID number as well).
Disclosed Relationship – A disclosed relationship occurs when two entities are explicitly known to be related such as by marriage, being joint account holders, etc.

Report Definitions

The G2Snapshot.py computes the statistics for the following reports ...

dataSourceSummary – a report that shows how many duplicates were detected within each data source, as well as the possible matches and relationships that were derived. For example, how many duplicate customers do I have, and are any of them related to each other.
crossSourceSummary – a report that shows how many matches were made across data sources. For example, how many employees are related to customers.
entitySizeBreakdown – a report that shows how many entities of what size were created. For instance, some entities are singletons, some might have connected 2 records, some 3, etc. This report is primarily used to ensure there are no instances of over matching. For instance, it’s ok for an entity to have hundreds of records as long as there are only a few name or address variations across them.

Database Level Stats

At the top level in the JSON file, there are database level stats for all data sources that were loaded:

TOTAL_RECORD_COUNT – the count of records across all data sources that were presented to the engine for resolution
TOTAL_ENTITY_COUNT – the count of distinct entities these records were resolved into
TOTAL_COMPRESSION – 1 - (TOTAL_ENTITY_COUNT / TOTAL_RECORD_COUNT) to yield a percentage of matches made. These only include matches that resolved as the same entity. This does not include ambiguous, possible or any other relationships made
TOTAL_AMBIGUOUS_MATCHES – the count of relationships between entities at this level
TOTAL_POSSIBLE_MATCHES – the count of relationships between entities at this level
TOTAL_POSSIBLY_RELATEDS – the count of relationships between entities at this level
TOTAL_DISCLOSED_RELATIONSHIPS – the count of are relationships between entities at this level

Data Source Summary

Under the DATA_SOURCES section, all the above statistics and more are computed for each data source:

RECORD_COUNT – a count of records in this data source that were presented to the engine
ENTITY_COUNT – the count of distinct entities these records were resolved into
COMPRESSION – 1 - (ENTITY_COUNT / RECORD_COUNT) to yield a percentage of matches made. These only include matches that resolved as the same entity. It does not include ambiguous, possible or any other relationships made
SINGLE_COUNT – both a record count and an entity count by its very nature. These are also referred to as singletons
SINGLE_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
DUPLICATE_ENTITY_COUNT – the count of entities that contain more than one record for this data source
DUPLICATE_RECORD_COUNT – the count of records that are in duplicate entities. For instance, one duplicate entity may have 2 records, another 3 records and so on
DUPLICATE_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
AMBIGUOUS_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
AMBIGUOUS_MATCH_RECORD_COUNT – the count of the records in those related entities
AMBIGUOUS_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
POSSIBLE_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
POSSIBLE_MATCH_RECORD_COUNT – the count of the records in those related entities
POSSIBLE_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
POSSIBLY_RELATED_ENTITY_COUNT – the count of relationships between entities at this level
POSSIBLY_RELATED_RECORD_COUNT – the count of the records in those related entities
POSSIBLY_RELATED_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level

Cross Match Summary

Under each data source is a CROSS_MATCH section which contains a list of other data sources with matches to the parent data source. The following statistics are calculated for each other or child data source with matches to the parent:

MATCH_ENTITY_COUNT – the count of entities that contain records in both the parent and child data sources
MATCH_RECORD_COUNT – the count of records from the parent data source only in those matched entities
MATCH_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
AMBIGUOUS_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
AMBIGUOUS_MATCH_RECORD_COUNT – the count of the records from the parent data source only in those related entities
AMBIGUOUS_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
POSSIBLE_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
POSSIBLE_MATCH_RECORD_COUNT – the count of the records from the parent data source only in those related entities
POSSIBLE_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
POSSIBLY_RELATED_ENTITY_COUNT – the count of relationships between entities at this level
POSSIBLY_RELATED_RECORD_COUNT – the count of the records from the parent data source only in those related entities
POSSIBLY_RELATED_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level

Each statistic reported is from the parent data source point of view, and both sides are reported. For instance, the MATCH_ENTITY_COUNT under data source A, as the parent data source to data source B, as the child data source will be the same as it is when drilling down from data source B as the parent

However, the record counts at this level are only from the parent data source point of view. For instance, if an entity contains two records from data source A and one record from data source B, the MATCH_RECORD_COUNT from data source A’s point of view is two, while it is only one from data source B’s point of view

Entity Size Breakdown

There is also an ENTITY_SIZE_BREAKDOWN section at the root level, which contains the following statistics computed for each entity size:

ENTITY_SIZE – the size of the entity these statistics are for. For instance, singletons have an entity size of one, whereas duplicates with only two records have an entity size of two, and so on
ENTITY_COUNT – the count of entities of this size
REVIEW_COUNT – a count of the entities of this size that were flagged for review
REVIEW_REASONS – a list of reasons why the entities should be reviewed

REVIEW items are suggestions of records to look at because they contain multiple names, addresses, dobs, etc. They may be over matches or they may just be large entities with lots of values.

Articles in this section

Understanding the G2Snapshot statistics

Prior References

Basic Terms

Report Definitions

Database Level Stats

Data Source Summary

Cross Match Summary

Entity Size Breakdown

Comments

Articles in this section

Prior References

Basic Terms

Report Definitions

Database Level Stats

Data Source Summary

Cross Match Summary

Entity Size Breakdown

Related articles