Prior References
Basic Terms
The terms resolve, duplicate and match are collectively used to refer to records Senzing has determined are the same entity. When we say two records resolve we mean they refer to the same entity. Within the same data source, this could be called a duplicate. Across data sources this is called a match.
The engine also keeps track of the possible matches and relationships found while looking for matches that refer to the same entity. Records are grouped into entities, and entities can be related at one of the following match levels:
- Ambiguous Match – An ambiguous match occurs when an entity could in fact resolve to more than one entity that cannot be resolved to each other. For instance, we may have a Patrick Smith and a Patricia Smith at the same address. If we then got a Pat Smith at that same address, it could be either Patrick or Patricia. Since we can’t be sure which, it is held apart as an ambiguous match to either.
- Possible Match – A possible match occurs when two entities share high strength attributes, such as identifiers, yet still cannot be resolved together due to other differences. For instance, if the same Patrick and Patricia records above had the same drivers’ license, they would also be a possible match to each other. If they were married, one entity’s ID might have been mistakenly put on the others account.
- Possibly Related – A possibly related match occurs when two entities only share lower strength attributes such as addresses and phone numbers. For instance, the Patrick and Patricia records above would be a possible match because they share the same address (if they didn’t also share an ID number as well).
- Disclosed Relationship – A disclosed relationship occurs when two entities are explicitly known to be related such as by marriage, being joint account holders, etc.
Report Definitions
The G2Snapshot.py computes the statistics for the following reports ...
- dataSourceSummary – a report that shows how many duplicates were detected within each data source, as well as the possible matches and relationships that were derived. For example, how many duplicate customers do I have, and are any of them related to each other.
- crossSourceSummary – a report that shows how many matches were made across data sources. For example, how many employees are related to customers.
- entitySizeBreakdown – a report that shows how many entities of what size were created. For instance, some entities are singletons, some might have connected 2 records, some 3, etc. This report is primarily used to ensure there are no instances of over matching. For instance, it’s ok for an entity to have hundreds of records as long as there are only a few name or address variations across them.
Database Level Stats
At the top level in the JSON file, there are database level stats for all data sources that were loaded:
- TOTAL_RECORD_COUNT – the count of records across all data sources that were presented to the engine for resolution
- TOTAL_ENTITY_COUNT – the count of distinct entities these records were resolved into
- TOTAL_COMPRESSION – 1 - (TOTAL_ENTITY_COUNT / TOTAL_RECORD_COUNT) to yield a percentage of matches made. These only include matches that resolved as the same entity. This does not include ambiguous, possible or any other relationships made
- TOTAL_AMBIGUOUS_MATCHES – the count of relationships between entities at this level
- TOTAL_POSSIBLE_MATCHES – the count of relationships between entities at this level
- TOTAL_POSSIBLY_RELATEDS – the count of relationships between entities at this level
- TOTAL_DISCLOSED_RELATIONSHIPS – the count of are relationships between entities at this level
Data Source Summary
Under the DATA_SOURCES section, all the above statistics and more are computed for each data source:
- RECORD_COUNT – a count of records in this data source that were presented to the engine
- ENTITY_COUNT – the count of distinct entities these records were resolved into
- COMPRESSION – 1 - (ENTITY_COUNT / RECORD_COUNT) to yield a percentage of matches made. These only include matches that resolved as the same entity. It does not include ambiguous, possible or any other relationships made
- SINGLE_COUNT – both a record count and an entity count by its very nature. These are also referred to as singletons
- SINGLE_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
- DUPLICATE_ENTITY_COUNT – the count of entities that contain more than one record for this data source
- DUPLICATE_RECORD_COUNT – the count of records that are in duplicate entities. For instance, one duplicate entity may have 2 records, another 3 records and so on
- DUPLICATE_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
- AMBIGUOUS_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
- AMBIGUOUS_MATCH_RECORD_COUNT – the count of the records in those related entities
- AMBIGUOUS_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
- POSSIBLE_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
- POSSIBLE_MATCH_RECORD_COUNT – the count of the records in those related entities
- POSSIBLE_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
- POSSIBLY_RELATED_ENTITY_COUNT – the count of relationships between entities at this level
- POSSIBLY_RELATED_RECORD_COUNT – the count of the records in those related entities
- POSSIBLY_RELATED_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
Cross Match Summary
Under each data source is a CROSS_MATCH section which contains a list of other data sources with matches to the parent data source. The following statistics are calculated for each other or child data source with matches to the parent:
- MATCH_ENTITY_COUNT – the count of entities that contain records in both the parent and child data sources
- MATCH_RECORD_COUNT – the count of records from the parent data source only in those matched entities
- MATCH_SAMPLE – a random sample of up to 10 resolved entity IDs in this category
- AMBIGUOUS_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
- AMBIGUOUS_MATCH_RECORD_COUNT – the count of the records from the parent data source only in those related entities
- AMBIGUOUS_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
- POSSIBLE_MATCH_ENTITY_COUNT – the count of relationships between entities at this level
- POSSIBLE_MATCH_RECORD_COUNT – the count of the records from the parent data source only in those related entities
- POSSIBLE_MATCH_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
- POSSIBLY_RELATED_ENTITY_COUNT – the count of relationships between entities at this level
- POSSIBLY_RELATED_RECORD_COUNT – the count of the records from the parent data source only in those related entities
- POSSIBLY_RELATED_SAMPLE – a random sample of up to 10 resolved entity ID pairs that are at this level
Each statistic reported is from the parent data source point of view, and both sides are reported. For instance, the MATCH_ENTITY_COUNT under data source A, as the parent data source to data source B, as the child data source will be the same as it is when drilling down from data source B as the parent
However, the record counts at this level are only from the parent data source point of view. For instance, if an entity contains two records from data source A and one record from data source B, the MATCH_RECORD_COUNT from data source A’s point of view is two, while it is only one from data source B’s point of view
Entity Size Breakdown
There is also an ENTITY_SIZE_BREAKDOWN section at the root level, which contains the following statistics computed for each entity size:
- ENTITY_SIZE – the size of the entity these statistics are for. For instance, singletons have an entity size of one, whereas duplicates with only two records have an entity size of two, and so on
- ENTITY_COUNT – the count of entities of this size
- REVIEW_COUNT – a count of the entities of this size that were flagged for review
- REVIEW_REASONS – a list of reasons why the entities should be reviewed
REVIEW items are suggestions of records to look at because they contain multiple names, addresses, dobs, etc. They may be over matches or they may just be large entities with lots of values.
Comments
0 comments
Please sign in to leave a comment.