If you arrived here looking for sizing information on the Senzing App, it is limited to 4 cores and will use less than 8GB RAM. If you are looking at scaling Senzing with the Senzing API, you are in the right place.
Small: Up to 10 million records.
8 cores, 48GB of ram, 100GB of SSD or NVMe storage
AWS: i3.2xlarge ~$0.63/hr
This should load typical data at 100 records per second, 10 million would load in about a day.
Medium: Up to 50 million records.
16 cores, 96GB of ram, 500GB of SSD or NVMe storage
AWS: i3.4xlarge ~$1.25/hr
This should load typical data at 200 records per second, 50 million would load in under 3 days.
Large: Up to 100 million records.
32 cores, 192GB of ram, 1TB of SSD or NVMe storage
AWS: i3.8xlarge ~$2.50/hr
This should load typical data at 200 records per second, 100 million would load in under a week.
The general guideline on storage planning is to allocate 10KB of Flash storage per input record. This comes to roughyl 1TB of Flash storage per 100M records. You will also want to account for additional system software, logs, source record files, etc but that can be placed on general purpose storage.
These estimates are for single node systems, you will install the database engine and Senzing. Of course there are hardware configurations that can achieve significantly faster performance and handle even larger data sets in the billions! Senzing can scale both the database and Senzing horizontally.
What to think about in cloud environments
Latency, latency, and less latency
Cloud environments are great in that the elasticity and ease of resource allocation decrease the time to bring up new environments. Senzing is a perfect fit for this with horizontally scalable, share only the database engines where you can bring up/down nodes based on the current load at the time. At the same time, this ease means certain details are harder to control. Senzing performance is sensitive to latency and to IOPS, so there are a few things to make sure you watch out for.
Co-located: If your systems are sitting far away from each other in the data center, or in different data centers, the network latencies are going to be much higher than if they are co-located in the same switch.
Local Flash on the DB: A single locally attached NVMe will achieve more than 100k IOPS where a remote SAN may only achieve 2k IOPS. In cloud environments, be particularly aware of the random read/write IOPS capabilities on your database modes.
Why Flash Storage is required
To perform real-time entity resolution you must read from the database more than you write to it. Flash is much faster than traditional spinning disks and have become so affordable they are becoming standard equipment. You can still use spinning disks, but it may take 10x longer to load your data.
The performance expectations above are based upon typical person or company data sets such as master customer lists, prospect lists, employee lists, watch lists, national registries, etc.
You can run into data sets that have extra large records or highly related data – meaning everybody is related to everybody else. The nice thing is that you can increase performance by adding more cores and RAM at 6GB per core. If you want to think about sizing per thread, you will need 1.5GB RAM per thread with a minimum of 6GB per node.
If you run into slow data sets, please feel free to contact us as this often means the data was mis-mapped or could be mapped differently to achieve your performance needs. We are constantly improving our way to guide you to the proper mapping as well as automatically tolerate ineffective mapping or data.