Why SSD / NVMe?
True Entity Resolution (ER) engines have the potential requirement to access any piece of data that has been loaded, at any time. This means ER performance is dependent on the hosting system providing:
- Enough memory to cache at minimum the indexes to the data, and preferably hot table data too.
- Disk storage with high random read performance. In real world tests for non-cacheable random reads, traditional HDDs or SANs may only achieve 100 IOPS. In contrast, a good quality SSD/NVMe may achieve 5000 real world IOPS.
In other words, a single SSD/NVMe device is as fast as 50 rotational HDDs. This directly impacts the performance achieved by Senzing when loading and evaluating data.
What if I don't have SSD?
Option 1: tmpfs
If you are using SQLite3 with, Linux tmpfs may be an option - tmpfs creates a file system backed by the computer's memory and swapspace, rather than traditional disk.
This can be enabled for /tmp in RedHat 7 and CentOS 7 with the following command:-
sudo systemctl enable tmp.mount
Once enabled, re-configure the system to use the tmpfs and move the G2 SQLite database:
- View the G2Module.ini located in /opt/senzing/g2/python. Take note of the SQLite DB file referred to by the CONNECTION parameter - e.g. /opt/senzing/g2/sqldb/G2C.db
- Copy this file to /tmp
cp /opt/senzing/g2/sqldb/G2C.db /tmp
- Edit the G2Module.ini file, and change the G2C.db CONNECTION parameter to /tmp
- Edit the G2Project.ini (/opt/senzing/python/), change the G2Connection parameter to match 3. above
You can now run G2Loader.py as normal. The main limitation to this solution is the amount of memory in your system. The G2 repository will use approximately 8GB per 1 million source records loaded.
Option 2: ramdisk
If you don't want any possibility of the data being swapped out to disk you can instead use a ramfs mount like:
mkdir -p /mnt/ram
mount -t ramfs -o size=20g ramfs /mnt/ram
And then follow the rest of the instructions using '/mnt/ram' instead of '/tmp'
But I have SSD!
Increasingly we are finding customers that have SSD but their IT department has attached it via a NAS (Network Attached Storage). This configuration is almost certainly slower than regular spinning disk and most database servers won't function well on SAMBA, AFP, or NFS mounted devices. From user testimonials, a SQLite DB running on NFS mounted SSD will run 200-400x slower than local SSD.
What should I know about SQLite3?
Senzing uses SQLite3 for it’s evaluation software versions. SQLite3 is a nice, simple database with no extra processes or maintenance to be concerned with. It comes with specific limitations that more robust databases like IBM Db2 or MySQL do not suffer from:
- SQLite3 only allows one writing process at a time. This means rich data or slow storage can disproportionately impact performance; even more of a reason to get an SSD device!
- Because of the serialized writes, performance typically starts to degrade once if you try to increase the number of threads in the G2Project.ini past 8-12.
- SQLite3 performance drops off quickly once the database can no longer be effectively cached by the file system cache. Increasing memory in the system can help but eventually the database will grow larger than memory when under significant use. Another reason to get an SSD device - when the file system cache can’t satisfy the requirements, the disk read requests are 50x slower.