Why SSD / NVMe?
True entity resolution (ER) engines have the potential requirement to access any piece of data that has been loaded, at any time. This means ER performance is dependent on the hosting system providing:
- Enough memory to cache at minimum the indexes to the data, and preferably hot table data too
- Disk storage with high random read performance. In real world tests for non-cacheable random reads, traditional HDDs or SANs may only achieve 100 IOPS. In contrast, a good quality SSD/NVMe may achieve 5000 real world IOPS
In other words, a single SSD/NVMe device is as fast as 50 rotational HDDs. This directly impacts the performance achieved by Senzing when loading and evaluating data.
What if I don't have SSD/NVMe storage?
Option 1: tmpfs
If you are using SQLite3 with, Linux tmpfs may be an option. tmpfs creates a file system backed by the computer's memory and swap space rather than traditional disk.
This can be enabled for /tmp in RedHat 7 and CentOS 7 with the following command:
sudo systemctl enable tmp.mount
Once enabled, re-configure the system to use the tmpfs and move the Senzing SQLite database:
- View G2Module.ini located in <project_path>/etc. Take note of the SQLite DB file referred to by the CONNECTION parameter - e.g., <project_path>/var/sqlite/G2C.db - copy this file to /tmp
- Edit G2Module.ini, and change the G2C.db CONNECTION parameter to /tmp
- Edit G2Project.ini, change the G2Connection parameter to match as above
You can now run G2Loader.py as normal. The main limitation to this solution is the amount of memory in your system. The Sezning repository will use approximately 8GB per 1 million source records loaded.
Option 2: ramdisk
If you don't want any possibility of the data being swapped out to disk you can instead use a ramfs mount like:
mkdir -p /mnt/ram
mount -t ramfs -o size=20g ramfs /mnt/ram
Next, follow the previous instructions for using tmpfs '/mnt/ram' instead of '/tmp'
But I have SSD!
Increasingly we are finding customers have SSD but their IT department has attached it via a NAS (Network Attached Storage). This configuration is almost certainly slower than regular spinning disk and most database servers won't function well on SAMBA, AFP, or NFS mounted devices. From user testimonials, a SQLite DB running on NFS mounted SSD will run 200-400x slower than local SSD.
What should I know about SQLite3?
Senzing uses an embedded SQLite3 database out-of-the-box to accelerate getting started. SQLite3 is a nice, simple database with no extra processes or maintenance to be concerned with. It comes with specific limitations that more robust and enterprise level databases like IBM Db2:
- SQLite3 only allows one writing process at a time. This means rich data or slow storage can disproportionately impact performance; even more of a reason to get an SSD/NVMe device!
- Because of serialized writes, performance typically starts to degrade once you try to increase the number of threads in a Senzing process past 8-12
- SQLite3 performance drops off quickly once the database can no longer be effectively cached by the file system cache. Increasing memory in the system can help but eventually the database will grow larger than memory when under significant use. Another reason to get an SSD/NVMe device - when the file system cache can’t satisfy the requirements, the disk read requests are 50x slower