True Entity Resolution (ER) engines have the potential requirement to access any piece of data that has been loaded, at any time. This means ER performance is dependent on the hosting system providing:-
- Enough memory to cache at minimum the indexes to the data, and preferably hot table data too
- Disk storage with high random read performance. In real world tests for non-cacheable random reads, traditional HDDs or SANs may only achieve 100 IOPS. In contrast, a good quality SSD/NVMe may achieve 5000 real world IOPS.
In other words, a single SSD/NVMe device is as fast as 50 rotational HHDs. This directly impacts the performance achieved by G2 when loading and evaluating data.
What if I don't have SSD?
If you are using SQLite3 with G2, Linux tmpfs may be an option - tmpfs creates a file system backed by the computer's memory, rather than traditional disk.
This can be enabled for /tmp in RedHat 7 and CentOS 7 with the following command:-
sudo systemctl enable tmp.mount
Once enabled, re-configure the system to use the tmpfs and move the G2 SQLite database:-
- View the G2Module.ini located in /opt/senzing/g2/python. Take note of the SQLite DB file referred to by the CONNECTION parameter - e.g. /opt/senzing/g2/sqldb/G2C.db
- Copy this file to /tmp
cp /opt/senzing/g2/sqldb/G2C.db /tmp
- Edit the G2Module.ini file, and change the G2C.db CONNECTION parameter to /tmp
- Edit the G2Project.ini (/opt/senzing/python/), change the G2Connection parameter to match 3. above
You can now run G2Loader.py as normal. The main limitation to this solution is the amount of memory in your system. The G2 repository will use approximately 8GB per 1 million source records loaded.
What should I know about SQLite3?
G2 uses SQLite3 for it’s evaluation software versions. SQLite3 is a nice, simple database with no extra processes or maintenance to be concerned with. It comes with specific limitations that more robust databases like IBM DB2 or MySQL do not suffer from:-
- SQLite3 only allows one writing process at a time. This means rich data or slow storage can disproportionately impact performance; even more of a reason to get an SSD device!
- Because of the serialized writes, performance typically starts to degrade once if you try to increase the number of threads in the G2Project.ini past 8-12.
- SQLite3 performance drops off quickly once the database can no longer be effectively cached by the file system cache. Increasing memory in the system can help but eventually the database will grow larger than memory when under significant use. Another reason to get an SSD device - when the file system cache can’t satisfy the requirements, the disk read requests are 50x faster.