Overview
The general guidance for completing a successful Senzing Proof of Concept (PoC) will be in-line with nearly any other PoC you undertake: understand the required technical and business goals, scope and define success to measure against these goals, ensure the correct and capable human and system resources are available, regular progress checks, etc.
This article will highlight key items of consideration for a successful PoC. It isn't a PoC project plan.
The Senzing MCP server puts a Senzing expert directly inside your LLM — Claude Code, Claude Desktop, Cursor, or any MCP-compatible tool. It can help you map your data sources, generate SDK code, troubleshoot errors, search Senzing documentation, and guide you through every phase of a PoC. Think of it as a coworker who knows Senzing inside and out, available on demand throughout your evaluation.
Scoping Success
If you are considering evaluating Senzing, it's almost certain you have at least one use case and various goals you'd like to explore and aim to achieve. Some of the common goals and aspects we hear from our customers include the low entry cost of using Senzing, quality of results and outcomes, rapid return on investment (ROI), rapid deployment and time to value, ease of adding new data, the availability of the embeddable SDK, and that Senzing never 'calls home' or needs to send data externally. With consideration to these aspects, nearly all evaluations of Senzing focus on the ease of adding data and the quality of the results. We will focus on those.
Senzing does not sell services; a consideration of the value of a Senzing evaluation — and subsequent deployment(s) — is your own resources lead the evaluation with Senzing support available as required. Your team will learn Senzing. If they are experienced with entity resolution systems, they will rapidly realize that using Senzing removes an order of magnitude of complexity for performing entity resolution.
Rightsizing for Success
Generally, the most important considerations to start with and understand are:
What data do you need? What data is required to demonstrate success to the business? Where is the data, which sources of data are required, who owns the data and can it be used for the evaluation, how many records are required to satisfy success outcomes?
What systems are available? What system resources are quickly and easily available to support the evaluation? Review the System Requirements, API Hardware Sizing Guide and Disk I/O Performance articles.
Who will run the evaluation? What human resources and skills are available to support the evaluation and iterative analysis of the results?
If your evaluation use case is to ingest, entity resolve, and analyze 1 billion records across multiple data sources, but all you have is one part-time person and a Windows VM, then there is a mismatch between expectations and resources. A part-time resource and a Windows VM would be great for 1 million records using the Senzing Desktop App, but not a large-scale Senzing API evaluation.
Selecting the right data
Typically you will be doing a subset of your overall data, both in the number of records and the number of sources. When selecting data it is important that the data will actually support matching.
Include multiple data sources. Entity resolution value comes from cross-source matching — connecting records from different systems that represent the same real-world entity. A single-source PoC dramatically under-represents the real benefit. Aim for at least 2-3 sources with expected entity overlap.
Take a vertical slice, not a random sample. Don't randomly select data. Instead, pick everyone with a last name that starts with 'A', from a specific state/city/postal code, or some similar approach. This way there is a general expectation that you'll get multiple records for the same person or organization. If you are selecting from multiple sources, slice by the same dimension (e.g., the same geographic region from each source) so the same real-world entities appear across your selected data.
Pick sources that demonstrate your business cases. Choose data sources that reflect the matching scenarios you care about — e.g., matching claims to customers — and that have at least a couple of overlapping attributes — e.g., name and address.
Don't limit the features you send to Senzing. If one source has name, address, phone, etc. and the other has name, phone, and date of birth, make sure to send all the features even though they aren't in common across the sources. More features means more ways to match and more confidence in the results.
Don't sanitize away identity features. If your data has date of birth, SSN, email, or phone number — send them. These are the features that drive high-confidence matching. If you must mask sensitive data for a PoC, discuss with Senzing support first — naive masking (e.g., truncating SSN to last 4 digits) destroys match value.
Include messy data, not just clean records. If your production data has abbreviations, misspellings, missing fields, or mixed formats, your PoC data should too. Clean-only subsets produce unrealistically optimistic results that won't reflect production performance. Senzing is specifically designed to handle real-world data quality issues — let it demonstrate that.
Consider including a truth set. If you have a small set of records where you know which ones represent the same entity, include them. This lets you measure precision and recall objectively. Even 50-100 labeled pairs is useful. See How to create an entity resolution truth set for ideas.
Mock up specific test cases if needed. If your organization has specific data scenarios that are important to evaluate, don't be afraid to create records to demonstrate them in case the selected data doesn't happen to have examples.
Deployment Platform
It is important that the environment you choose for your PoC fits with your team's skills and capabilities. Some options to consider:
Desktop App. This is limited to the default Senzing configuration (Senzing on 'full autopilot') and 1 or 2 million records. It will run well on many Windows or macOS laptops and requires minimal IT skills. The Desktop App can be downloaded from the senzing.com website.
Bare metal Linux. The most common method of deployment, Senzing is natively installed on Red Hat or Debian-based systems. Getting started instructions can be found in the Quickstart Guide.
Docker. If your team loves one of these platforms, see the Quickstart for Docker.
Mapping Data
Mapping data is the process of informing Senzing what the fields in your data sources represent. Consider a CSV file where an individual's full name is stored in a column called NAME. To inform Senzing this field describes all the tokens comprising an individual name, you would modify the header row of the CSV file and change NAME to NAME_FULL. NAME_FULL is the term informing Senzing what to expect in this column and how to use it for all functions of entity resolution.
Unlimited support is included with an evaluation license; we will help you map your data sources. Typically, mapping data for processing in Senzing is straightforward with the out-of-the-box configuration covering most scenarios. The initial mapping process usually takes less than 30 minutes per data source.
It's important to have resources that understand the desired outcomes with the available data, the schema of the data, and access to utilize it during a PoC.
There are a few things to be cognizant of:
Structured data is required. Senzing requires structured data but is very flexible in how each attribute can be provided for ingestion. You do need to field the data appropriately.
One entity per record. Each record must include attributes that identify one and only one entity. For instance, if you are mapping a contact from your address book, the name, home address, email, phone number, etc. are for the person entity. Their company name, company address, company phone number, company website as their employer is a separate and distinct entity. In such a scenario the person entity and company entity would be extracted from the data source and mapped accordingly, with each being a distinct record sent to Senzing.
Tell Senzing the name type. Senzing uses highly sophisticated domain-aware name processing which includes culturally aware person name matching and organizational name domain knowledge. Senzing does need to be told that a name is a personal name (NAME_FIRST, NAME_MIDDLE, NAME_LAST or NAME_FULL) or an organizational name (NAME_ORG).
Data mapping is beyond the scope of this article. For additional information on getting started with data mapping using both CSV and JSON, see the Generic Entity Specification.
Evaluating Results and Outcomes
During a PoC you may only want — for example — a CSV output of the results that connect records into entities and how those entities are related. You may want to export or replicate similar information to a warehouse for further analysis and consider joining to the original data sources. Both and similar scenarios are easily accomplished. Additionally, see the Exploratory Data Analysis Tools that can be used to dynamically explore, compare, and analyze results.
We're ready and waiting to help with accelerating your Senzing evaluation and discussing any topics herein further. If you'd like to reach out to us please do so at support@senzing.com or support ticket.
Comments
2 comments
This method would not catch anyone who may have changed their name via marriage or the like
That is just one idea, another is pick a region. Nothing is perfect, but it gives a higher likelihood of getting data that demonstrations matching.
Please sign in to leave a comment.