You've mapped and loaded your data and are eager to analyze the entity resolution insights and outcomes. You may also be interested in importing a high level overview of this information into another system or analytical tool. How can you do that? One method is to use the G2Export.py utility.
In your Senzing project python directory is the G2Export utility. G2Export extracts a high level overview of the current state of the Senzing repository: resolved entities, possible matches, relationships and associated source record information.
Loading Sample Data
If you currently don't have any data loading into Senzing and/or you'd like to load the sample demo data used in this article you can load the supplied sample data with the following commands.
If you already have data loaded into your system and you do not wish to lose it do not run the following. The -P argument will purge the Senzing repository of currently loaded data. To test the steps in this article without losing any existing data you could create a new project to experiment in.
cd <project_root/python/>
python3 G2Loader.py -P -p demo/sample/project.csv
G2Export Options
G2Export - like the other Python tools - accepts --help (-h) to list it's arguments.
g2@debian:~/senzing/python$ python3 G2Export.py --help
usage: G2Export.py [-h] [-c INIFILE] -o OUTPUTFILE [OUTPUTFILE ...]
[-f {0,1,2,3,4,5}] [-F {CSV,JSON}] [-x]
[-of OUTPUTFREQUENCY] [-cf [{1,2,3,4,5,6,7,8,9}]] [-xcr]
optional arguments:
-h, --help show this help message and exit
-c INIFILE, --iniFile INIFILE
Path and file name of optional G2Module.ini to use.
-o OUTPUTFILE [OUTPUTFILE ...], --outputFile OUTPUTFILE [OUTPUTFILE ...]
Path and file name to send output to.
Use -o - to send to stdout. When using this option G2Export messages and
statistics are written to g2export.log.
-f {0,1,2,3,4,5}, --outputFilter {0,1,2,3,4,5}
Specify only one value for the output filter - 0 through 5.
Filter 0 requests all the match levels (1 through 5).
Filters 2 through 5 include the prior filter values. For example, 3 would
output all entities & possible matches & relationships.
1 = All entities, without any relationships
2 = All entities, also include possible matches for each entity
3 = All entities, also include relationships for each entity
4 = All entities, also include name-only relationships for each entity
5 = All entities, also include disclosed relationships for each entity
Note: Filter 4 for name-only is for Senzing internal use.
Default: 0
-F {CSV,JSON}, --outputFormat {CSV,JSON}
Data format to export to, JSON or CSV.
Default: CSV
-x, --extended
Return extended details, adds RESOLVED_ENTITY_NAME & JSON_DATA.
Adding JSON_DATA significantly increases the size of the output and execution time.
When used with CSV output, JSON_DATA isn't included for the related entities
(RELATED_ENTITY_ID) for each resolved entity (RESOLVED_ENTITY_ID). This reduces
the size of a CSV export by preventing repeating data for related entities. JSON_DATA
for the related entites is still included in the CSV export and is located in the
export record where the RELATED_ENTITY_ID = RESOLVED_ENTITY_ID.
WARNING: This is not recommended! To include the JSON_DATA for every CSV record see the
--extendCSVRelates (-xcr) argument.
-of OUTPUTFREQUENCY, --outputFrequency OUTPUTFREQUENCY
Frequency of export output statisitcs.
Default: 1000
-cf [{1,2,3,4,5,6,7,8,9}], --compressFile [{1,2,3,4,5,6,7,8,9}]
Compress output file with gzip. Compression level can be optionally specified.
If output file is specified as - (for stdout), use shell redirection instead to compress:
G2Export.py -o - | gzip -v > myExport.csv.gz
Default: 6
-xcr, --extendCSVRelates
WARNING: Use of this argument is not recommended!
Used in addition to --extend (-x), it will include JSON_DATA in CSV output for related entities.
Only valid for CSV output format.
Running G2Export
The basic usage of G2Export is:
python3 G2Export.py -o myExport.csv
This will create a CSV file in the current directory named myExport.csv.
The following example exports to /tmp/test-export.csv and includes both resolved entity data and possible matches:
- python3 G2Export.py -o /tmp/my-g2export.csv -f2
This example exports as JSON, includes extended details and compresses the file:
- python3 G2Export.py -o /tmp/my-g2export.csv -F json -x -cf
Interpreting G2Export Output
G2Export output contains the following columns:
- RESOLVED_ENTITY_ID - A unique ID assigned to one of more source records. All source records with the same ID have been determined to be the same and all contribute to form a single entity within Senzing.
- RESOLVED_ENTITY_NAME - Name used for display purposes for a single entity (RESOLVED_ENTITY_ID). The name is derived from the lowest record ID from the source records belonging to the entity.
- RELATED_ENTITY_ID - When an entity identified by RESOLVED_ENTITY_ID has possible matches or relationships this is the unique ID of the resolved entity (RESOLVED_ENTITY_ID) the relationship is to. To identify the source record(s) for the related entity see RECORD_ID.
- MATCH_LEVEL - For each row with the same RESOLVED_ENTITY_ID, signifies how the source record contributes to this unique entity. That is, did it resolve together with other source records or is it a possible match or have a relationship to this entity.
- 0 = Initial or source record for the entity
- 1 = Duplicate/similar source record that resolved to the entity
- 2 = Possible match
- 3 = Relationship
- 4 = Name Only (Internal not enabled by default)
- 11 = Disclosed Relationship
- MATCH_KEY - A representation of which source record attributes matched. For example, +NAME+ADDRESS-DOB indicates name and address matched but the date of birth did not. + means contributed to the match and - means detracted from the match. - is only shown for exclusive features that could break a match.
- DATA_SOURCE - Identifier code assigned to the original JSON or CSV data source file. Examples might be CUSTOMERS, PROSPECTS, WATCHLIST, etc.
- RECORD_ID - The unique key/identifier of the record within the data source identified by DATA_SOURCE. Provides provenance back to a source system.
- JSON_DATA - The raw data for the source record in JSON format. JSON is used as resolved entities may only contain singular attributes such as 1 name or 1 address. They could also contain 3 names or 4 addresses. JSON provides a well structured format to represent singular and multiple attribute scenarios.
- LENS_CODE - Reserved for future use.
Consider the following example export in CSV format from the sample demo data included with Senzing. G2Export was run with the extended argument (-x) but the JSON_DATA and LENS_CODE columns have been removed for brevity. Coloring has been added to aid visualization.
This snippet from the export informs us:
- There are 3 resolved entities identified by the RESOLVED_ENTITY_IDs 1, 3 and 4
- The resolved entity identified as 1 - ROBERT M JONES JR
- Consists of 2 records from the DATA_SOURCE PEOPLE
- The unique key/identifiers of these 2 records from the data source are 1001 and 1002
- Record 1002 resolved to record 1001 as indicated by MATCH_LEVEL = 1
- The original source record for this entity was 1001 indicated by MATCH_LEVEL = 0
- Resolution occurred due to same or similar name, date of birth, gender and address
- Has a possible match to RESOLVED_ENTITY_ID 3 - MARTIN JONZE - indicated in the RELATED_ENTITY_ID column on a shared passport shown in MATCH_KEY
- Has a relationship to RESOLVED_ENTITY_ID 4 - ELIZABETH R JONES - indicated in the RELATED_ENTITY_ID column on shared or similar surname and address but gender, social security number, driving license and passport are different as shown in MATCH_KEY
- Is related to RESOLVED_ENTITY_ID 1001 - PRESTO COMPANY - (not shown in the example for brevity) from the DATA_SOURCE describing COMPANIES via a phone number and address. Two records are shown because the entity 1001 has resolved together from the source records 2001 and 2002 from COMPANIES
- The resolved entities 3 and 4 reflect the same relationships outlined for RESOLVED_ENTITY_ID = 1 - ROBERT M JONES JR
- The resolved entities 3 and 4 both consist of only one source record from the PEOPLE data source. 1003 for resolved entity 3 and 1004 for resolved entity 4, this is indicated by the single same colored row and MATCH_LEVEL = 0 with no other records containing MATCH_LEVEL = 1
A visual representation of the export example:
Formatting of G2Export Data
The format of the export data was designed to facilitate a number of purposes, including:
- Immediately ready for use in analytical system scoring, for example
- Risk Score = 30 if RESOLVED_ENTITY_ID has both a good data source record (CUSTOMER) and a bad data source record (WATCHLIST) at a MATCH_LEVEL of 2 or less
- Risk Score += 10 if JSON_DATA.{BUSINESS_TYPE} = "Jewelry Store"
- Link information can be imported into other systems
- If the records were extracted from a source database, simply update them with the RESOLVED_ENTITY_ID by joining the data source table(s) with the exported data and RECORD_ID
- You may wish to create a related entity table as well with the following fields - RESOLVED_ENTITY_ID, RELATED_ENTITY_ID, MATCH_LEVEL (when > 1) and MATCH_KEY
- If creating a new database, e.g. elastic search, and don't want to use the JSON_DATA field to get the original field values for each record, simply do the same join on data source (original CSV file) and RECORD_ID
- Create sub-lists for import into a graph database:
- Source record nodes: select distinct data_source, record_id, json_data (or join to original CSV file if desired)
- Resolved entity nodes: select distinct data_source, resolved_id, resolved_name where match_level <= 1
- Links between source records and resolved entities: select data_source, resolved_id, record_id where match_level <=1
- Links between resolved entities: select distinct resolved_id, related_id where resolved_id < related_id
JSON Output Format
Using the --outputFormat (-F) argument the output format of G2Export can use JSON:
python3 G2Export.py --outputFormat JSON
Without using the --extended (-x) argument the output of G2Export using JSON contains less data than CSV output. To obtain additional information when using the JSON output format be sure to use the --extended argument.
Example of JSON output format, this has been pretty printed with:
jq . < /tmp/default-export.json
If you wish to use jq you may need to install it on your system.
Each line in the output file describes an entity. The RECORDS object under RESOLVED_ENTITY details the source records that have resolved to form a single entity. The RELATED_ENTITIES object details any relationships the same entity has.
{ "RESOLVED_ENTITY": { "ENTITY_ID": 5, "LENS_CODE": "DEFAULT", "RECORDS": [ { "DATA_SOURCE": "CUSTOMERS", "RECORD_ID": "1005", "ENTITY_TYPE": "GENERIC", "INTERNAL_ID": 5, "ENTITY_KEY": "39DAE53669DC4071D5AAADCEE37238A126F0C6AF", "ENTITY_DESC": "Rob E Smith", "MATCH_KEY": "", "MATCH_LEVEL": 0, "MATCH_LEVEL_CODE": "", "MATCH_SCORE": 0, "ERRULE_CODE": "", "REF_SCORE": 0, "LAST_SEEN_DT": "2021-09-09 14:15:47.150" }, { "DATA_SOURCE": "WATCHLIST", "RECORD_ID": "1006", "ENTITY_TYPE": "GENERIC", "INTERNAL_ID": 100001, "ENTITY_KEY": "9F111B0689C209355CF433FA 1D446427614BC246", "ENTITY_DESC": "Rob Smith Sr", "MATCH_KEY": "+NAME+DRLIC", "MATCH_LEVEL": 1, "MATCH_LEVEL_CODE": "RESOLVED", "MATCH_SCORE": 13, "ERRULE_CODE": "SF1_CNAME", "REF_SCORE": 8, "LAST_SEEN_DT": "2021-09-09 14:16:13.285" } ] }, "RELATED_ENTITIES": [ { "ENTITY_ID": 1, "LENS_CODE": "DEFAULT", "MATCH_LEVEL": 2, "MATCH_LEVEL_CODE": "POSSIBLY_SAME", "MATCH_KEY": "+NAME+ADDRESS-DOB", "MATCH_SCORE": 12, "ERRULE_CODE": "CNAME_CFF_DEXCL", "REF_SCORE": 5, "IS_DISCLOSED": 0, "IS_AMBIGUOUS": 0, "RECORDS": [ { "DATA_SOURCE": "CUSTOMERS", "RECORD_ID": " 1001" }, { "DATA_SOURCE": "CUSTOMERS", "RECORD_ID": "1002" }, { "DATA_SOURCE": "CUSTOMERS", "RECORD_ID": "1003" }, { "DATA_SOURCE": "CUSTOMERS", "RECORD_ID": "1004" } ] } ] }
Comments
0 comments
Please sign in to leave a comment.