For the purpose of this article, we will assume:
- You have the records processed with Senzing in your data warehouse and keyed off the same RECORD_ID
- You understand the replication capabilities of your RDBMS
- You understand SQL
Generally, Senzing prefers you interact with the APIs provided, abstracting you from the internals of the underlying repository implementation. There is an (current) exception to that when you need real time or periodic replication of the entities and relationships to join with the original data stored in your data warehouse. There is a lot of overhead in doing this through the Senzing export API but it is generally easy using the underlying database.
Why not Export?
Export can be very useful, especially with small systems and PoCs. Senzing even comes with a G2Export.py python script to do just that. At some point your system gets bigger and exports take longer and longer to do. This is cumbersome and means those utilizing the exports downstream in processing are using results that are further and further out of date.
With [near] real-time replication those downstream users can be always using current information and the cumbersome process with the export becomes seamless. This is important for all enterprise solutions and critically important for large ones.
Database Tables Involved
This table keeps track of the original source records processed with Senzing, it is used for de-duplication and reporting of unmapped attributes. It's also used for records with no mapping to RECORD_ID (called keyless), this article assumes you provided RECORD_IDs allowing you to tie the records to your data warehouse.
The key fields are:
- DSRC_ID: The identifier for the source system. This is seen in the CFG_DSRC section of the g2config.json file.
- RECORD_ID: This is the RECORD_ID provided when you processed a record (e.g. using the addRecord API). This RECORD_ID should be the same value that is in your data warehouse to identify the record.
- ENT_SRC_KEY: This is an internal identifier used to identify the record to the engine - possibly deduplicating the record.
This table is the internal keyed representation of unique records from DSRC_RECORD. Simply, it matches your provided RECORD_ID into the internal OBS_ENT_ID.
The key fields are:
- OBS_ENT_ID: The unique identifier used internally to identify the deduplicated record
- ENT_SRC_KEY: The value that matches the same named value in DSRC_RECORD
- ETYPE_ID: The value that matches the same named value in DSRC_RECORD
This table describes which records are part of which resolved entity.
The key fields are:
- RES_ENT_ID: The grouping identifier that describes a resolved entity
- OBS_ENT_ID: The record that is part of the group
This table describes the relationships between resolved entities. Each row describes one relationship between two RES_ENT_IDs creating one relationship record per edge.
The key fields in this table are:
- MIN_RES_ENT_ID: The resolved entity ID with the lower value
- MAX_RES_ENT_ID: The resolved entity ID with the higher value
- MATCHED_KEY: A description of what feature types supported or detract from the relationship
Putting it together
By joining the DSRC_RECORD, OBS_ENT, and RES_ENT_OKEY tables we can provide a single entity map:
FROM DSRC_RECORD DR,
WHERE DR.ENT_SRC_KEY = OE.ENT_SRC_KEY
AND OE.OBS_ENT_ID = REO.OBS_ENT_ID;
Indexing considerations: You would index this table on RECORD_ID to lookup the resolved entity ID (RES_ENT_ID) assigned to a given record. Also by RES_ENT_ID to lookup all the RECORD_IDs for a given resolved entity ID.
By selecting the 3 key columns from the RES_RELATE table we can provide a single simple entity relationship map:
Indexing considerations: You would want to have two indexes on this table: one on MIN_RES_ENT_ID and the other on MAX_RES_ENT_ID and query for either field to match a given RES_ENT_ID to return all the relationships for it.
Thoughts on Replication
- Table replication: You can simply replicate the 4 involved tables and create a [materialized] view at destination using the queries above. If possible/beneficial you could only replicate the necessary columns from each table.
- Indexes: As noted above you would want to add indexes to satisfy queries by your RECORD_ID or our RES_ENT_IDs as needed.
- View replication: Some RDBMS may allow for [materialized] view replication or just creating a table based on the view may be fast enough.
- Replication impact: Do weigh the method of replication against the 24x7 real-time operational state. If you only snapshot the information once each day rather than in real-time, one mechanism may be better than another. Involve your DBA.
- Clustering: We don't recommend putting these particular tables in separate instances but if you did so, the view provided would not work without replicating the tables to a single database first.