Overview
This article outlines installing the Senzing APIs on Linux, performing loading and entity resolution, analysis and exploration of the outcomes of entity resolution and how to prepare and load your own data to Senzing. Use this quickstart if you want to do a POC in a standard Linux environment without the use of example Docker containers.
Senzing provides 100k source records for ingestion and evaluation for free. If you require additional records for an evaluation, or any assistance when following this guide, please contact support.
Installation
The installation steps add the Senzing software repository to your Linux distribution, these steps only need to be completed once. During installation you will be asked to accept the End User License Agreement (EULA). On Red Hat based distributions you will also be prompted to accept the Senzing public key.
For Air-Gapped install, use our Air-Gapped guide to install the packages and then return here to complete.
Debian Based Distributions
sudo apt install apt-transport-https
The new APT senzingrepo
v2 repository package works only for Senzing versions >= 3.10.0 It detects architecture and platform. If a prior Senzing version is required, you must install the older senzingrepo v1 repository package: https://senzing-production-apt.s3.amazonaws.com/senzingrepo_1.0.1-1_amd64.deb
. Please contact Senzing Support if you have any questions.
wget https://senzing-production-apt.s3.amazonaws.com/senzingrepo_2.0.0-1_all.deb
sudo apt install ./senzingrepo_2.0.0-1_all.deb
sudo apt update
sudo apt install senzingapi
Continue with Creating a Senzing Project...
Red Hat Based Distributions
The new APT senzingrepo
v2 repository package works only for Senzing versions >= 3.10.0 It detects architecture and platform. If a prior Senzing version is required, you must install the older senzingrepo v1 repository package: https://senzing-production-yum.s3.amazonaws.com/senzingrepo-1.0.0-2.x86_64.rpm
. Please contact Senzing Support if you have any questions.
sudo yum install https://senzing-production-yum.s3.amazonaws.com/senzingrepo-2.0.0-1.noarch.rpm
sudo yum install senzingapi
Creating a Project and Configuration
To begin using Senzing, first create a project. A project is a self-contained instance of Senzing, deployed into a specified path. This path must not already exist, it will be created.
Creating and using projects provides independent and isolated instances of Senzing. Multiple projects can be created, for example, to test a newer version of Senzing or experiment with new data and scenarios.
The following command will create a project in your home in a new directory named senzing.
python3 /opt/senzing/g2/python/G2CreateProject.py ~/senzing
To expedite getting started an embedded SQLite database is configured for use when creating a Senzing project. SQLite is easy to evaluate with, for production systems an enterprise level RDBMS such as Postgres would be used. For additional information see Technical - Database
Environment Configuration
cd ~/senzing
source setupEnv
Add Senzing Configuration to the Database
A Senzing instance is configured with a JSON document, on a fresh installation this document needs to be registered in the Senzing database. This step only needs to be performed once initially for a new project. From the root of your project directory run the following command and enter 'y' when prompted.
./python/G2SetupConfig.py
Loading the Sample Truth Set Data
You can now load some sample demo data into Senzing using the G2Loader utility. G2Loader is a sample application for loading data that calls the Senzing APIs, the same APIs you would call when building your own applications or embedding Senzing into other systems or processes.
Add Data Source Codes
The three sample files to load represent three different data sources: customers, a watchlist and reference data. Records loaded into Senzing have an identifier attribute called DATA_SOURCE, this is an arbitrary value to describe and identify where source records originated from and is useful designation when analyzing and reporting on entities.
Each of the records in the three files to load use one of the DATA_SOURCE codes: CUSTOMERS, REFERENCE or WATCHLIST. Before data can be loaded using these values, they need to be added to the Senzing configuration. This only needs to be completed once for each DATA_SOURCE value. The G2ConfigTool utility performs this configuration change, to start G2ConfigTool:
./python/G2ConfigTool.py
Once at the (g2cfg) prompt enter the following commands:
addDataSource CUSTOMERS
addDataSource REFERENCE
addDataSource WATCHLIST
save
y
quit
Loading
With the data source codes added, load each file with the following commands:
./python/G2Loader.py -f python/demo/truth/customers.json
./python/G2Loader.py -f python/demo/truth/reference.json
./python/G2Loader.py -f python/demo/truth/watchlist.json
Senzing operates in real-time, as each record is loaded it completes the entity resolution process. The outcome is every record within and across each file has been entity resolved against all other data and the outcomes persisted in the Senzing database.
To learn more about the entity resolution process check out these white papers.
Exploring Entity Resolution Outcomes
- G2Explorer for understanding how and why entities are resolved and related
- G2Snapshot for calculating reports to be viewed with G2Explorer
- G2Audit for comparing results between Senzing and other technologies or comparing Senzing results between configurations
To begin exploring the EDA tools review the Exploratory Data Analysis (EDA) tools articles. Once you have an overview of EDA tools and their functionality it is recommended to explore G2Explorer and G2Snapshot on the previously loaded truth set data.
The EDA tools articles outline loading the truth set data, this doesn’t need to be completed it was completed in the prior step.
To get started with G2Explorer try the following.
./python/G2Explorer.py
help
help get
The EDA tools have built in help.
get customers 1070
The get command displays details for an entity, in this instance looked up by the data source code and record id.
search {"name_full": "robert smith", "date_of_birth": "11/12/1978"}
Perform a search for entities. You'll learn about the JSON structure in the next section - Mapping and Loading Your Own Data.
Try out the other examples in the G2Explorer article and explore the commands and their options using help.
Mapping and Loading Your Own Data
Mapping
At this point you are ready to map and load your own data. Mapping is the process of converting your source data into a structure Senzing understands ready to load.
Consider these examples, in your data an attribute describing a personal full name is in a database table with the column name fullname. In Senzing a full name is represented by the term NAME_FULL. Similarly for address line 1, your database column is named addressline1, in Senzing this is represented by the term ADDR_LINE1.
Your task in mapping is to determine which attributes in your data source(s) are appropriate for use in entity resolution, extract those attributes and construct the structure describing those attributes to send to Senzing. The following is an example of a Senzing mapped JSON structure for an entry from a data source.
{
"DATA_SOURCE": "CUSTOMERS",
"RECORD_ID": "1001",
"RECORD_TYPE": "PERSON",
"PRIMARY_NAME_LAST": "Smith",
"PRIMARY_NAME_FIRST": "Robert",
"DATE_OF_BIRTH": "12/11/1978",
"ADDR_TYPE": "MAILING",
"ADDR_LINE1": "123 Main Street, Las Vegas NV 89132",
"PHONE_TYPE": "HOME",
"PHONE_NUMBER": "702-919-1300",
"EMAIL_ADDRESS": "bsmith@work.com",
}
To learn more about mapping, the dictionary of terms and samples to help prepare your own data sources for loading and entity resolving review the Generic Entity Specification. Additionally, you can view the files for the sample truth set data under the /python/demo/truth path in your project. Review the customers.json, reference.json and watchlist.json files.
Loading
Once you have mapped your own data source(s) it’s time to load them. Before loading your own data, you’ll want to purge the Senzing database which contains the sample truth set data. Purging the Senzing database completely removes all previously loaded data and entity resolution outcomes, use with caution!
The G2Command utility is one method of purging the Senzing database, to start G2Command:
./python/G2Command.py
Once at the (g2cmd) prompt enter the following commands:
purgeRepository
y
quit
Upon mapping your data, you would have provided the DATA_SOURCE attribute. Just as when loading the truth set data if you have used a new value for DATA_SOURCE it needs to be added to the Senzing configuration first with G2ConfigTool.
./python/G2ConfigTool.py
Once at the (g2cfg) prompt enter the following commands where datasourcecode is the value you used for DATA_SOURCE during mapping:
addDataSource datasourcecode
save
y
quit
You are now ready to load your data, again using the G2Loader utility as previously used for loading the sample truth set data. For example, assume you have a file containing mapped data describing prospects, the following command would load the file:
./python/G2Loader.py -f prospects.json
Once loading completes, revisit using the EDA tools to explore and analyze the outcomes of entity resolution on your data.
Don't forget you can reach out to support if you need any assistance with getting started with Senzing.
Comments
0 comments
Please sign in to leave a comment.