NOTE: If you are looking for replication for large systems or in real-time please see this article.

Overview

For the purpose of this article, we will assume:

You have the records processed with Senzing in your data warehouse and keyed off the same RECORD_ID
You understand the replication capabilities of your RDBMS
You understand SQL

Generally, Senzing prefers you interact with the APIs provided, abstracting you from the internals of the underlying repository implementation. There is an (current) exception to that when you need real time or periodic replication of the entities and relationships to join with the original data stored in your data warehouse. There is a lot of overhead in doing this through the Senzing export API but it is generally easy using the underlying database.

Also, Senzing is an SDK that leverages a SQL database to store state. It is NOT an SDK that delivers a data warehouse intended for users to query (in most cases). The SDK will provide proper validations and filtering of the data that is not readily reproduced on SQL queries.

Why not Export?

Export can be very useful, especially with small systems and PoCs. Senzing even comes with a G2Export.py python script to do just that. At some point your system gets bigger and exports take longer and longer to do. This is cumbersome and means those utilizing the exports downstream in processing are using results that are further and further out of date.

With [near] real-time replication those downstream users can be always using current information and the cumbersome process with the export becomes seamless. This is important for all enterprise solutions and critically important for large ones.

Database Tables Involved

DSRC_RECORD

This table keeps track of the original source records processed with Senzing, it is used for de-duplication and reporting of unmapped attributes. It's also used for records with no mapping to RECORD_ID (called keyless), this article assumes you provided RECORD_IDs allowing you to tie the records to your data warehouse.

The key fields are:

DSRC_ID: The identifier for the source system. This is seen in the CFG_DSRC section of the g2config.json file.
RECORD_ID: This is the RECORD_ID provided when you processed a record (e.g. using the addRecord API). This RECORD_ID should be the same value that is in your data warehouse to identify the record.
ENT_SRC_KEY: This is an internal identifier used to identify the record to the engine - possibly deduplicating the record.
ETYPE_ID: This is the Entity Type ID for the ETYPE_CODE provided with the record. Typically this value is set in JSON by "ENTITY_TYPE":"GENERIC"

OBS_ENT

This table is the internal keyed representation of unique records from DSRC_RECORD. Simply, it matches your provided RECORD_ID into the internal OBS_ENT_ID.

The key fields are:

OBS_ENT_ID: The unique identifier used internally to identify the deduplicated record
DSRC_ID: The identifier for the source system. This is seen in the CFG_DSRC section of the g2config.json file.
ENT_SRC_KEY: The value that matches the same named value in DSRC_RECORD
ETYPE_ID: The value that matches the same named value in DSRC_RECORD

RES_ENT_OKEY

This table describes which records are part of which resolved entity.

The key fields are:

RES_ENT_ID: The grouping identifier that describes a resolved entity
OBS_ENT_ID: The record that is part of the group

RES_RELATE

This table describes the relationships between resolved entities. Each row describes one relationship between two RES_ENT_IDs creating one relationship record per edge.

The key fields in this table are:

MIN_RES_ENT_ID: The resolved entity ID with the lower value
MAX_RES_ENT_ID: The resolved entity ID with the higher value
MATCHED_KEY: A description of what feature types supported or detract from the relationship

Putting it together

Resolved Entities

By joining the DSRC_RECORD, OBS_ENT, and RES_ENT_OKEY tables we can provide a single entity map:

SELECT REO.RES_ENT_ID, 
       DR.DSRC_ID, 
       DR.RECORD_ID, 
       REO.MATCH_KEY 
  FROM DSRC_RECORD DR, 
       OBS_ENT OE, 
       RES_ENT_OKEY REO 
 WHERE DR.ENT_SRC_KEY = OE.ENT_SRC_KEY 
   AND OE.OBS_ENT_ID = REO.OBS_ENT_ID
   AND DR.ETYPE_ID = OE.ETYPE_ID
   AND DR.DSRC_ID = OE.DSRC_ID;

RES_ENT_ID	DSRC_ID	RECORD_ID	MATCH_KEY
1	1001	SDN-537
2	1001	SDN-576
2	1001	SDN-590	+NAME+DOB+ADDRESS
4	1001	SDN-701
4	1001	SDN-767	+NAME+EMAIL
6	1001	SDN-899
7	1001	SDN-1043
8	1001	SDN-1316
9	1001	SDN-1555
10	1001	SDN-1700

Indexing considerations: You would index this table on RECORD_ID to lookup the resolved entity ID (RES_ENT_ID) assigned to a given record. Also by RES_ENT_ID to lookup all the RECORD_IDs for a given resolved entity ID.

Entity Relationships

By selecting the 3 key columns from the RES_RELATE table we can provide a single simple entity relationship map:

SELECT MIN_RES_ENT_ID, 
       MAX_RES_ENT_ID, 
       MATCH_KEY 
  FROM RES_RELATE;

MIN_RES_ENT_ID	MAX_RES_ENT_ID	MATCH_KEY
5091	8028	+NAME+DOB+NATIONALITY
8058	10125	+SURNAME+ADDRESS+NATIONALITY-DOB
9090	10121	+SURNAME+ADDRESS-DOB
140	9090	+SURNAME+ADDRESS-DOB
4127	5153	+SURNAME+ADDRESS-DOB
11098	11099	+EMAIL
10155	11098	+EMAIL
3201	5198	+SURNAME+ADDRESS+NATIONALITY-DOB-PASSPORT
5198	7113	+SURNAME+ADDRESS+NATIONALITY-DOB-PASSPORT
1217	10161	+ADDRESS

Indexing considerations: You would want to have two indexes on this table: one on MIN_RES_ENT_ID and the other on MAX_RES_ENT_ID and query for either field to match a given RES_ENT_ID to return all the relationships for it.

Thoughts on Replication

Table replication: You can simply replicate the 4 involved tables and create a [materialized] view at destination using the queries above. If possible/beneficial you could only replicate the necessary columns from each table.
Indexes: As noted above you would want to add indexes to satisfy queries by your RECORD_ID or our RES_ENT_IDs as needed.
View replication: Some RDBMS may allow for [materialized] view replication or just creating a table based on the view may be fast enough.
Replication impact: Do weigh the method of replication against the 24x7 real-time operational state. If you only snapshot the information once each day rather than in real-time, one mechanism may be better than another. Involve your DBA.
Clustering: We don't recommend putting these particular tables in separate instances but if you did so, the view provided would not work without replicating the tables to a single database first.

Articles in this section

[Advanced] Replicating the Senzing results to a Data Warehouse

Overview

Why not Export?

Database Tables Involved

DSRC_RECORD

OBS_ENT

RES_ENT_OKEY

RES_RELATE

Putting it together

Resolved Entities

Entity Relationships

Thoughts on Replication

Comments

Articles in this section

Overview

Why not Export?

Database Tables Involved

DSRC_RECORD

OBS_ENT

RES_ENT_OKEY

RES_RELATE

Putting it together

Resolved Entities

Entity Relationships

Thoughts on Replication

Related articles