Overview

Senzing's data repository scalability on very large datasets is limited to existing database technologies. This is caused by the cross-talk required to provide real-time entity-centric resolution. We have long seen the need to be able to scale out the database layer both horizontally and vertically to meet the unique requirements of an entity resolution workload.

The most intensive entity resolution processing occurs during the initial loading of historical data or during the addition of new data source(s) into an existing repository. The largest impact and benefit of scaling will be seen during these operations.

To address the scaling requirements, Senzing has implemented sharding to provide broad application-aware horizontal scaling. Coupled with the removal of database transactions - contention is handled directly by Senzing - this eliminates the need for expensive cross-talk between database servers allowing individual database instances to be used for scaling.

Sharding in its full form provisions the distribution of tables across multiple services and requires operational and hardware overhead. This is so ingrained into Senzing, the API data layer automatically optimizes joins and secondary index usage on the fly to best utilize tables co-located in the same database instance. Due to the complexity of management and immaturity of tooling, full sharding capability is considered experimental. That said, the ability to move groups of tables to different database instances for scaling is mainstream; the Senzing App ships configured utilizing a three SQLite database cluster out of the box.

The improvement in throughput and performance with sharding varies based on the deployment hardware and data being processed. To give an example, with consideration to the Senzing App and an initial 10 million record ingestion, it's usual to see a two to three-fold increase using the default three SQLite nodes versus only a single one.

Now we'll explore setting up a three-node database cluster for the Senzing APIs to balance simplicity and efficiency.

Prerequisites

Install Senzing APIs
Understanding of configuring Senzing database connections
3 available database instances

Prepare the Database Servers

To simplify the install, create the same database schema on all three servers (nodes). This will result in 3 separate nodes, referred to as:

CORE
RES
LIBFEAT

SQLite

Copy the G2C.db file to 3 different file names, e.g. G2C.db, G2_RES.db and G2_LIBFEAT.db. A base G2C.db file can be found in <project_path>/resources/templates/G2C.db.template. For SQLite, the 3 nodes are represented by the 3 different database files.

MySQL

Follow the Red Hat or Debian instructions through to 'Configure and Test ODBC' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.

PostgreSQL

Follow the Red Hat or Debian instructions through to 'Configure G2Module.ini' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.

Db2

Follow the setup instructions through to 'Configure and Test ODBC' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.

Modify G2Module.ini

On each Senzing client (where you usually run G2Loader.py or your own applications), configure the G2Module.ini you are using to specify a HYBRID back end with the 2 auxiliary database nodes forming the cluster.

Edit G2Module.ini and add the BACKEND keyword to the SQL section. Note, the CONNECTION string is set to be the CORE node.
```
[SQL]
BACKEND=HYBRID
CONNECTION=sqlite3://na:na@/opt/senzing/g2/sqldb/G2C.db
```

Add a new section defining each auxiliary cluster using the database connection URI for your database system. This example is for SQLite.

[C1]
CLUSTER_SIZE=1
DB_1=sqlite3://na:na@/opt/senzing/g2/sqldb/G2_RES.db

[C2]
CLUSTER_SIZE=1
DB_1=sqlite3://na:na@/opt/senzing/g2/sqldb/G2_LIBFEAT.db

Add another section describing which tables sit in each cluster

[HYBRID]
RES_FEAT_EKEY=C1
RES_FEAT_LKEY=C1
RES_FEAT_STAT=C1
LIB_FEAT=C2
LIB_FEAT_HKEY=C2

Info

The connection strings above are for a simple SQLite example. Modify the connection strings for your database system, details can be found in the applicable articles in the technical database sub-section. In this SQLite example, the connection strings each point to a separate SQLite DB file.

In an enterprise-level RDBMS, the connection string for each of the 3 nodes would point to each distinct RDBMS instance and associated schema.

Safety First

At this point, it is wise to go back to the CORE database and rename the LIB_FEAT, RES_FEAT_STAT, and RES_FEAT_EKEY tables so that no one accidentally runs against the system without the clustered configuration.

Run!

That's it! You can now run G2Loader.py, or your own applications normally with Senzing utilizing all three database nodes.

Note: If you have been provided SQL queries to run for Senzing health analysis, recognize the tables above have been moved so you will need to run the query on the correct database node.

Articles in this section

Scaling Out Your Database With Clustering

Overview

Prerequisites

Prepare the Database Servers

SQLite

MySQL

PostgreSQL

Db2

Modify G2Module.ini

Safety First

Run!

Comments

Articles in this section

Overview

Prerequisites

Prepare the Database Servers

SQLite

MySQL

PostgreSQL

Db2

Modify G2Module.ini

Safety First

Run!

Related articles