Senzing's data repository scalability on very large datasets is limited to existing database technologies. This is caused by the cross-talk required to provide real-time entity-centric resolution. We have long seen the need to be able to scale out the database layer both horizontally and vertically to meet the unique requirements of an entity resolution workload.
The most intensive entity resolution processing occurs during the initial loading of historical data or during the addition of new data source(s) into an existing repository. The largest impact and benefit of scaling will be seen during these operations.
To address the scaling requirements, Senzing has implemented sharding to provide broad application-aware horizontal scaling. Coupled with the removal of database transactions - contention is handled directly by Senzing - this eliminates the need for expensive cross-talk between database servers allowing individual database instances to be used for scaling.
Sharding in its full form provisions the distribution of tables across multiple services and requires operational and hardware overhead. This is so ingrained into Senzing, the API data layer automatically optimizes joins and secondary index usage on the fly to best utilize tables co-located in the same database instance. Due to the complexity of management and immaturity of tooling, full sharding capability is considered experimental. That said, the ability to move groups of tables to different database instances for scaling is mainstream; the Senzing App ships configured utilizing a three SQLite database cluster out of the box.
The improvement in throughput and performance with sharding varies based on the deployment hardware and data being processed. To give an example, with consideration to the Senzing App and an initial 10 million record ingestion, it's usual to see a two to three-fold increase using the default three SQLite nodes versus only a single one.
Now we'll explore setting up a three-node database cluster for the Senzing APIs to balance simplicity and efficiency.
- Install Senzing APIs
- Understanding of configuring Senzing database connections
- 3 available database instances
Prepare the Database Servers
To simplify the install, create the same database schema on all three servers (nodes). This will result in 3 separate nodes, referred to as:
Copy the G2C.db file to 3 different file names, e.g. G2C.db, G2_RES.db and G2_LIBFEAT.db. A base G2C.db file can be found in <project_path>/resources/templates/G2C.db.template. For SQLite, the 3 nodes are represented by the 3 different database files.
Follow the Red Hat or Debian instructions through to 'Configure and Test ODBC' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.
Follow the Red Hat or Debian instructions through to 'Configure G2Module.ini' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.
Follow the setup instructions through to 'Configure and Test ODBC' to connect your G2Loader.py client to each database node. For the CORE node, complete the entire set of instructions to verify it is working.
On each Senzing client (where you usually run G2Loader.py or your own applications), configure the G2Module.ini you are using to specify a HYBRID back end with the 2 auxiliary database nodes forming the cluster.
- Edit G2Module.ini and add the BACKEND keyword to the SQL section. Note, the CONNECTION string is set to be the CORE node.
Add a new section defining each auxiliary cluster using the database connection URI for your database system. This example is for SQLite.
- Add another section describing which tables sit in each cluster
The connection strings above are for a simple SQLite example. Modify the connection strings for your database system, details can be found in the applicable articles in the technical database sub-section. In this SQLite example, the connection strings each point to a separate SQLite DB file.
In an enterprise-level RDBMS, the connection string for each of the 3 nodes would point to each distinct RDBMS instance and associated schema.
At this point, it is wise to go back to the CORE database and rename the LIB_FEAT, RES_FEAT_STAT, and RES_FEAT_EKEY tables so that no one accidentally runs against the system without the clustered configuration.
That's it! You can now run G2Loader.py, or your own applications normally with Senzing utilizing all three database nodes.
Note: If you have been provided SQL queries to run for Senzing health analysis, recognize the tables above have been moved so you will need to run the query on the correct database node.