[Advanced] Real-time replication and analytics – Senzing®

Overview

One of the key features of the Senzing API is the ability to support passing changes to downstream systems for analytics and/or replication.

The fundamentals of how this is achieved are straightforward, not requiring any queuing or framework.

Fundamentals

WithInfo Functions

All functions in the Senzing API that may modify resolution have a "WithInfo" version. This version returns a JSON document with AFFECTED_ENTITIES like the following:

{
    "DATA_SOURCE": "TEST",
    "RECORD_ID": "1",
    "AFFECTED_ENTITIES": [
       {
            "ENTITY_ID": 1
       }
    ],
    "INTERESTING_ENTITIES": []
}

The AFFECTED_ENTITIES is the list of entity IDs impacted by the API function. This allows you to create an eventually consistent view of the entities and relationships that Senzing has built, even in a very asynchronous parallel processing system.

What to do with the affected entities?

There are a few key steps in processing:

You can have multiple threads or processes handling the affected entities, but you must ensure you aren't processing the same entity ID in parallel.
Call getEntityByEntityID() for each entity ID to determine the current state of the entity and its relationships. Often all you need are the data sources and records IDs depending on how you are using the information.
1. If the entity doesn't exist then it was moved or deleted from Senzing and you should reflect that in your processing
2. If the entity does exist then determine if any changes need to be addressed in your processing

Performance

This can be an incredibly fast process if you:

1. Only ask for precisely the information you need from getEntityByEntityID() as the more information you ask for the more intensive the workload will be

2. If you intend to be replicating the data from the entities and records, land or keep the records in your downstream system outside of Senzing in parallel to when you load the records into Senzing. For instance, if you have a data warehouse with your person records, add an entity ID column and a relationship table so you can efficiently overlay the Senzing entity and relationship maps rather than have to push/transform all the data too.

If you want to further improve efficiency, often there are multiple changes to the same entities in a small period of time. Some customers, especially ones replicating to systems that prefer data in batches (e.g., ElasticSearch), will create an app that takes the affected entities and distinct them over a period of time (5-10 minutes) and then process them directly or pass them to a downstream system.

Examples

ElasticSearch

This is my favorite example as it is the simplest. The others hook into a "listener framework" which has a bunch of requirements on your Senzing solution. This simply has classes that transform a Senzing getEntity* response into something useful to ElasticSearch (you will want to adapt this to your use case). Then an example Java function that calls the bulk export API of Senzing, does bulk inserts into ElasticSearch, and then has some thoughts on using Kibana to search.

Find this on GitHub here: https://github.com/Senzing/elasticsearch

Risk Score Calculator

This process does analytics on entities looking for interesting data quality characteristics, such as entities with more than one TAXID or a TAXID shared by more than one entity. The process manages a table with one row for every entity in Senzing.

1. When an entity is affected and no longer exists, it is removed from the table

2. When an entity is affected and does exist, it is added/updated in the table

Find this on GitHub here: https://github.com/Senzing/risk-score-calculator

Neo4J Connector

This process assumes the records already exist in Neo4J and the connector is simply making entity nodes with relationships to the records in Neo4J and between the entities.

1. When an entity is affected and no longer exists, it is removed from Neo4J

2. When an entity is affected and does exist, it is added/updated in the table. Note, that for simplicity it currently completely re-writes the entity rather than identify deltas

Find this on GitHub here: https://github.com/Senzing/connector-neo4j

Why doesn't WithInfo tell you the changes?

This is a common question the core answer is fairly simple.

1. The Senzing API operates completely in parallel and, based on how multi-processing works, you can't know that the operation that returns first actually impacted the system first.

For instance, if you called addRecordWithInfo(<record A>) and addRecordWithInfo(<record B>) at the same time, they could both impact the same entities but you could not reasonably derive which order they occurred in. You could have one causing the creation of an entity and the next causing a delete. Processing them in the wrong order will cause you to be out of sync with the source.

2. You might ask why Senzing doesn't provide a sequential ID that would tell you what order the operations are in. Even if this could be done without dramatically harming the performance of the system, it would put the burden on you to make sure all these changes were performed in order and figure out how to deal with missed deltas. This would make it even harder for you to parallelize downstream processing.

There are other reasons and nuances, but the paradigm in place has been proven to work and perform very well.

Comments

2 comments

jp

February 23, 2022 18:55

Edited
Thanks Brian, so if I understand the dataset must be preloaded and pre-processed (Resolved Entity Data) in Senzing, and you are using the API to "lookup" records with similar attributes. Is there any way to load data via the API, which are then resolved and could be queried as if it was pre-loaded?
-1

Comment actions Permalink
Brian M.

March 31, 2022 18:42
I'm not sure what you mean. The point is to reflect changes from additions/changes to the records/resolution in Senzing to downstream systems.
0

Comment actions Permalink

Please sign in to leave a comment.

Articles in this section