Senzing performs entity resolution in real-time and constantly reevaluates prior analytical assertions and outcomes as data is ingested. Once the initial ingestion of your historical input records is complete, the entity resolution processing has also completed; there are no subsequent analytical processes to run.
To put this another way:
- You do the historical load only once: Senzing is transactional, sequence neutral, and employs Entity Centric Learning (tm). That next record after the billion+ historical load goes in subsequent along with all the other searches, finds, gets, deletes, and other adds/updates... no outage, no boil the ocean. It is a one-time cost.
- You are getting a fully persisted entity graph: Senzing maintains the entity graph as it processes, so any time you ask it a question, it already has the answer. This makes it fully capable of real-time workloads (with real-time updates) and enables the constant streaming of resolved/related entities being replicated to other systems (Elastic Search, Graph DB, RAG, real-time analysis, etc). There is no situation where the data was simply "landed," where every question then has to be asked and calculated to obtain the answer.
To increase the ingestion rate of large historical input records, a Senzing system can initially be deployed on more substantial hardware to complete the ingestion faster. If ongoing production demands don't require as substantial hardware - e.g., additions, delta changes, searches - the hardware provisions can be reduced to match. For additional details, please contact us.
Deployments easily support billions of input records. The hardware specifications and load time will vary depending on the data characteristics of your data sources. Please contact us for further information and sizing information to meet your requirements.
Storage Guidelines
The general guideline for storage planning is to allocate 20 KB of flash-based storage per input record. This equates to approximately 1 TB of storage per 50 million records.
20kB is the baseline estimate, but it can increase if your data is very feature-rich and varied. We suggest that you base your storage requirements on an analysis after loading a large sample data source.
You will also need to account for additional system software, logs, source record files, and other related items. This can be placed on general-purpose storage.
Cloud Environment Considerations
Latency, latency, and less latency!
Cloud environments are great at providing elasticity, ease of resource allocation, and decreasing the time to provision new environments. Senzing is a perfect fit for this with horizontally scalable and sharing only the database engines enabling the increase and decrease of Senzing nodes to meet current or expected demands. At the same time, this ease means certain details are harder to control. Senzing performance is sensitive to latency and IOPS, there are a few things to watch out for.
Co-located: If your systems are sitting far away from each other in the data center, or in different data centers, the network latencies are going to be much higher than if they are co-located in the same switch.
Local Flash on the database: A single locally attached NVMe will achieve more than 100k IOPS where a remote SAN may only achieve 2k IOPS. In cloud environments, be particularly aware of the random read/write IOPS capabilities on your database modes.
Burstable and tier limits: Understand carefully - especially with IO systems - any burstable or tier limits that may apply to resources you provision. Are those guaranteed 10k IOPS only for a burstable peek period? Do you only get these for a set size throughput and then they drop substantially?
Why Flash Storage Is Required
To perform real-time entity resolution, you must read from the database more than you write to it. Flash storage is much faster than traditional spinning disks and has become so affordable they are becoming standard equipment. You can still use spinning disks, but it may take 10x longer to load your data.
Performance Considerations
The performance expectations above are based upon typical person or company data sets such as master customer lists, prospect lists, employee lists, watch lists, national registries, etc.
You can run into data sets that have extra-large records or highly related data, meaning everybody is related to everybody else. The nice thing is that you can increase performance by adding more cores and RAM at 6GB per core. If you want to consider sizing per thread, you will need 1.5GB of RAM per thread, with a minimum of 6GB per node.
If you run into slow data sets, please feel free to contact us as this often means the data was mis-mapped or could be mapped differently to achieve your performance needs. We are constantly improving our way to guide you to the proper mapping as well as automatically tolerate ineffective mapping or data.
Comments
0 comments
Article is closed for comments.