The guidelines herein are approximations of the hardware required to ingest the outlined number of input records from your data source(s). The exact hardware specifications and load time will vary depending on the data characteristics of your data sources.
Note, these figures are guidelines only and outline estimates of the initial historical ingestion of your source data. Senzing performs entity resolution in real-time and constantly reevaluates prior analytical assertions and outcomes as data is ingested. Once the initial ingestion of your historical input records is complete the entity resolution processing has also completed; there are no subsequent analytical processes to run.
To increase the ingestion rate of large historical input records a Senzing system can initially be deployed on more substantial hardware to complete the ingestion faster. If ongoing production demands don't require as substantial hardware - e.g. additions, delta changes, searches - the hardware provisions can be reduced to match. For additional details please contact us.
Single Node Deployments
Up to 10 million records
Recommended: 8 cores, 48GB of RAM, 200GB of direct-attached SSD or NVMe storage
Approximate ingestion rate of typical data at 100 records per second, 10 million would load in about 1 day.
Up to 50 million records
Recommended: 16 cores, 96GB of RAM, 1TB of direct-attached SSD or NVMe storage
Approximate ingestion rate of typical data at 200 records per second, 50 million would load in under 3 days.
Up to 100 million records
Recommended: 32 cores, 192GB of RAM, 2TB of direct-attached SSD or NVMe storage
Approximate ingestion rate of typical data at 400 records per second, 100 million would load in under a week.
Multi-Node Deployments
Multi-node deployments easily support billions of input records. Please contact us for further information and sizing information to meet your requirements.
Storage Guidelines
The general guideline on storage planning is to allocate 20kB of flash-based storage per input record. This equates to approximately 1TB of storage per 50M records.
20kB is the baseline estimate, but it can increase if your data is very feature-rich and varied. We suggest that you base your storage requirements on an analysis after loading a large sample data source.
You will also need to account for additional system software, logs, source record files, etc. This can be placed on general-purpose storage.
Scaling Systems
A deployment of Senzing includes both a database server(s) and Senzing, the single node deployments above are indicative of having both the database and Senzing installed on a single machine.
In addition to hardware configurations that can achieve significantly faster performance and handle even larger data sets - into the billions - Senzing can scale both the database and Senzing horizontally.
Cloud Environment Considerations
Latency, latency, and less latency!
Cloud environments are great at providing elasticity, ease of resource allocation, and decreasing the time to provision new environments. Senzing is a perfect fit for this with horizontally scalable and sharing only the database engines enabling the increase and decrease of Senzing nodes to meet current or expected demands. At the same time, this ease means certain details are harder to control. Senzing performance is sensitive to latency and IOPS, there are a few things to watch out for.
Co-located: If your systems are sitting far away from each other in the data center, or in different data centers, the network latencies are going to be much higher than if they are co-located in the same switch.
Local Flash on the database: A single locally attached NVMe will achieve more than 100k IOPS where a remote SAN may only achieve 2k IOPS. In cloud environments, be particularly aware of the random read/write IOPS capabilities on your database modes.
Burstable and tier limits: Understand carefully - especially with IO systems - any burstable or tier limits that may apply to resources you provision. Are those guaranteed 10k IOPS only for a burstable peek period? Do you only get these for a set size throughput and then they drop substantially?
Why Flash Storage Is Required
To perform real-time entity resolution you must read from the database more than you write to it. Flash storage is much faster than traditional spinning disks and has become so affordable they are becoming standard equipment. You can still use spinning disks, but it may take 10x longer to load your data.
Performance Considerations
The performance expectations above are based upon typical person or company data sets such as master customer lists, prospect lists, employee lists, watch lists, national registries, etc.
You can run into data sets that have extra-large records or highly related data – meaning everybody is related to everybody else. The nice thing is that you can increase performance by adding more cores and RAM at 6GB per core. If you want to think about sizing per thread, you will need 1.5GB RAM per thread with a minimum of 6GB per node.
If you run into slow data sets, please feel free to contact us as this often means the data was mis-mapped or could be mapped differently to achieve your performance needs. We are constantly improving our way to guide you to the proper mapping as well as automatically tolerate ineffective mapping or data.
Comments
0 comments
Article is closed for comments.