There are times the Senzing engine determines additional work needs to be performed on an entity. In some cases it will automatically decide this work should be done at a different time, for instance:
- Cleaning up decisions made based on attributes that are determined to no longer be important (most common)
- Records being loaded in parallel around the same cluster of entities causing conflicts
- Automatic corrections
When this happens, an special record is written to the SYS_EVAL_QUEUE table for future processing. These entries and known as REDOs or redo records. When using the G2Loader.py utility, G2Loader will periodically switch from ingesting records to processing redo records if any are available.
The SYS_EVAL_QUEUE table is comprised of:
- LENS_CODE - A key for an advanced feature which is not in use at the moment
- ETYPE_CODE - Entity type code, typically something like PERSON or COMPANY
- DSRC_CODE - The user provided data source identifying code
- ENT_SRC_KEY - A key, often internally generated, to identify a specific record
- MSG - The internally formatted message to be processed by the engine
The first 4 fields comprise the unique key.
When building your own applications to ingest data you will need to be mindful of periodically processing the redo records too.
Logically the data is processed as follows:
- Call the getRedoRecord API
- It is best to do this in blocks of records (e.g. 100) in case there are numerous redo records
- It is recommended to do this periodically during process (e.g. every 10000 records) and to limit to a maximum number of blocks processed at a time (e.g. max 10 blocks) to balance new data and redo records
- For each record from the getRedoRecord API
- Call process(redo record)
- Handle errors as you would with addRecord or similar API functions
You can also construct an architecture with worker node(s) performing ingestion to Senzing and a separate node dedicated to processing only redo records.
The G2Loader.py utility that ships with Senzing API, by default, will periodically pause data ingestion to process any redo records; if there are any. You will notice output similar to the following:
198000 rows processed at 04:52pm, 236 records per second
199000 rows processed at 04:52pm, 172 records per second
200000 rows processed at 04:52pm, 242 records per second
Waiting for processing queue to empty to start redo...
Pausing loading to process redo records...
1000 redo records processed at 04:52pm, 902 records per second
2000 redo records processed at 04:52pm, 808 records per second
2771 reevaluations completed
Redo processing complete resuming loading...
201000 rows processed at 04:52pm, 110 records per second
202000 rows processed at 04:52pm, 260 records per second
203000 rows processed at 04:52pm, 237 records per second
G2Loader has an optional argument (--noRedo or -n) to disable redo record processing and only ingest source data records. When this argument is specified redo processing is disabled for this instance of the G2Loader utility.
If redo processing is disabled, it can be performed after ingestion has been completed or in parallel to the ingesting G2Loader. This is achieved by using another optional argument to only perform redo record processing; --redoMode or -R
Starting in redo only mode, processing redo queue (CTRL-C to quit)
1000 redo records processed at 05:35pm, 630 records per second
2000 redo records processed at 05:35pm, 937 records per second
Redo queue empty, 2653 total records processed. Waiting 60 seconds for next cycle at 05:35:21pm (CTRL-C to quit at anytime)...
Redo queue empty, 2653 total records processed. Waiting 60 seconds for next cycle at 05:36:20pm (CTRL-C to quit at anytime)...