Exploring the Details of How Senzing Works
Defining the term Entity Resolution
Entity resolution is the process of recognizing when two records relate to the same entity, despite having been described differently. And conversely, recognizing when two records do not relate to the same entity, despite having been described similarly.
Purpose-Built AI for Entity Resolution, Senzing is entity resolution
The Senzing team created a purpose-built AI for entity resolution that includes two unique properties:
The ability to make human-intelligent decisions on extremely small and extremely large data sets, without any pretraining or pretuning
Gets smarter over time, as it autonomously learns and adapts in real-time, without reloading
The AI in Senzing is composed of two tightly coupled classes of algorithms: common sense and real-time machine learning. A 2018 story in Wired, How to Teach Artificial Intelligence Common Sense, discusses the importance of this duality.
Common Sense in the Senzing AI
Unlike many AI machine learning techniques that must be initially trained using extremely large data sets, Senzing comes pre-built with common sense that includes principle-based entity resolution and advanced knowledge.
Common sense allows Senzing to be smart on day one, even with data sets as small as two records. Common sense also helps Senzing ensure its real-time learning is not fooled by newly introduced anomalies e.g., mismapped fields or other errors.
Principle-Based Entity Resolution
Principles are a special form of generalized knowledge that draws on common attribute behaviors.
The use of principles is a key reason Senzing does not need training, tuning or specialized knowledge to deploy into new domains or to add new data sets, new languages, etc. The difference between the rules in some other entity resolution systems and the principles in Senzing are distinct. Imagine telling your child to quit throwing rocks at cars. Only to realize the next day you have to tell him to quit throwing baseballs at SUVs. Then, a few days later, you have to tell him not to throw golf balls at trucks, fire engines and ambulances. Instead of all these rules, why not one simple principle: “Don’t throw things at other people’s stuff.”
The principles in Senzing are based on expected attribute behaviors. For example, only one person should have an SSN, while many people can share the same date of birth (DOB), even though each person should only have one. Senzing assigns these common-sense behaviors to attributes based on the following three expected behavior settings:
Frequency – does one, few or many entities generally share the same value e.g., an SSN is commonly used by one entity, an address is shared by a few, and date of birth (DOB) is shared by many
Exclusivity – does an entity typically have only one such value e.g., an entity should only have one SSN or DOB, but an entity could rightfully have more than one credit card number
Stability – is the value typically stable over the lifetime of an entity or not e.g., an SSN and DOB are typically stable over a lifetime, but a home address is usually not
Notably, Senzing recognizes that messy, real-world data may not always behave as expected. For example, if multiple people reportedly have the same SSN or one person has multiple DOBs, Senzing automatically detects these anomalies and adjusts accordingly. Senzing ships with approximately 30 default entity resolution principles it uses to determine when entities are the same, possibly the same or related. Each principle considers the three feature behaviors described above plus names. Here are two examples of the types of principles built into Senzing:
If entities have a close name and the same frequency one feature e.g., an SSN, they are likely the same entity
If entities have a close name and a frequency few feature e.g., address, but have a contradictory exclusive feature e.g., different DOBs, they are considered related (not the same)
In a radical departure from other entity resolution methods, the single set of default principles Senzing provides automatically work as delivered for a wide range of entity types e.g., people, companies, vessels and planes.
Go Deeper: Principal based Entity Resolution
The AI in Senzing includes more than 10 pre-built comparison routines containing deep knowledge about specific attributes such as phone numbers, SSNs, dates, etc. Since culturally-aware name recognition and global address matching are most critical for achieving high-quality ER, the comparators Senzing uses for these attributes are particularly advanced.
- Global name recognition – IBM Global Name Management comparison technology is built into Senzing. This culturally-aware name library, pretrained on 800M global names, was created over decades by a team of linguists at a cost of tens of millions of dollars. https://www.ibm.com/products/ibm-infosphere-global-name-management
Senzing AI immediately understands synonyms e.g., Bob and Robert or Elizabeth and Liz, and transliterations e.g., Mohamed, Mohammed, Mhd and dozens of other spellings. It also resolves names across different alphabets and scripts7 e.g., Arabic, Mandarin and Roman.
- Global address comparison – Senzing uses Libpostal, an open-source library for global address parsing and normalization, to assess address similarity with uncanny precision. This library for programmers was trained, using machine learning, on the hundreds of millions of global addresses in the OpenStreetMap database.
Libpostal is embedded into Senzing and wrapped with custom logic that provides exceptional matching accuracy and eliminates the need to pre-parse address data prior to loading. The Senzing architecture allows advanced users to add additional attributes, such as height, weight, hair color, eye color, voice, fingerprints, etc., by writing custom plug-ins that standardize, express and compare new attribute data. For example, one source system may store height data in inches and another in centimeters. A set of custom plug-ins could automatically standardize height data into centimeters, create an expression of the data to the nearest tenth to help with matching.
Real-time Machine Learning in the Senzing AI
The AI in Senzing uses real-time machine learning (ML) to get smarter over time. The real-time algorithms deliver entity-centric learning, anomaly detection, and sequence-neutral processing. https://senzing.com/sequence-neutrality/
Senzing retains the history and attribute variations for each entity as it resolves new records against existing entities e.g., learning every name, address and phone variation. Over time, based on the accumulated variations, the learns nicknames, alternative email addresses, common typographical errors, etc., including intentionally fabricated information.
Senzing uses its entity-centric learning when comparing records during ER. Entity-centric learning is what allows it to make higher quality entity resolution decisions than most other systems that use the more popular, but very basic, record-to-record matching. This is critical for catching clever criminals. https://senzing.com/channel-separation-the-primary-tradecraft-of-clever-bad-guys/
Go Deeper: Entity Resolution Processes
Senzing actively tracks feature statistics in real-time as it resolves and relates entities. Based on the information it has seen to date, Senzing keeps detailed statistics about its entity repository, e.g., it contains approximately 150M males, 500 people with the same DOB, and exactly seven people who have lived at 123 Main Street.
By comparing actual statistics to expected feature behaviors, Senzing detects anomalies such as garbage values. e.g., if the SSN value 121212121 is used by hundreds of entities, Senzing recognizes this as an exception, since SSNs generally belong to one person. When such anomalies are detected, Senzing automatically self-tunes to account for them going forward by either assigning them less value or disregarding them altogether.
Sequence Neutrality (Self-Correcting the Past)
Based on what it learns about entities and anomalies, Senzing continuously evaluates its earlier assertions to determine if they need to be corrected. Sequence neutrality allows Senzing to self-correct the past in real-time, whether it received record A first then B, or record B first then A.12 Humans self-correct all the time e.g., you think you know what someone means, but as they keep talking you realize they meant something else. The ability of Senzing to fix the past in real-time at scale, as new data streams in, is extremely difficult to achieve. Without sequence neutrality, the error rates of entity resolution systems increase between the periodic reloads required to bring them up to date. With the sequence-neutral processing in Senzing, your system is always to up to date, overall error rates decrease over time as new information reverses earlier assertions, and reloading is never required.
Senzing is designed to natively support real-time operations, including the following:
Real-time adds and changes – immediately resolves new data as it is received
Real-time queries – instantly delivers resolved entity data resulting from user queries
Real-time decision systems – supports real-time business transactions by instantaneously providing systems with resolved entity data
Real-time deletion – immediately removes data and the consequences of that data e.g., right to be forgotten requests required by emerging privacy regulations ]
Real-time maintenance – performs scheduled or emergency maintenance on live systems to eliminate downtime or the need to run two instances of an operational system
Real-time replication – replicates its database of resolved entities to data marts or data warehouses and Real-time publishing (coming soon)
Senzing is a true online transaction processing (OLTP) engine that reduces operational risks and helps ensure organizations never have to make decisions based on outdated entity information.
Speed and Scalability
The Senzing team spent 12 months, over 2009 and 2010, designing and proving out a database schema to ensure Senzing supports unprecedented speed, scalability and flexibility. Because of this work, Senzing runs up to 400M records on a $5,000 commodity server and performs millions of new entity resolutions a day in real-time with sub-second response rates, without ever reloading. Senzing is specifically designed to scale vertically and horizontally in cloud computing infrastructures. As record volumes grow to billions, it is easy to scale out across heterogeneous clusters.