Introduction
Privacy by Design (PbD) is a significant concept here at Senzing. One of the predominant data categories ingested into Senzing is Personally Identifiable Information (PII) and/or attributes that can be used to aid in uniquely identifying an entity. Keeping this data secure is crucial, in addition to organizational security measures.
In situations where you have data you do not wish to load as cleartext, or if you have data that has already been hashed in order to protect it, it can be loaded in a pre-hashed state.
Consider the following:
The table outlines 2 attributes first showing their clear text unhashed values followed by a pre-hashed unreadable value. In addition to being able to load the unhashed clear text values in Senzing the pre-hashed values of these two attributes (and many others) can be loaded.
The current available attributes that can be mapped as pre-hashed and are understood are:
In addition to mapping to the usual mapping terms above (Attribute Column), there is an additional PREHASHED attribute (Pre-hashed column) used to inform Senzing the used attributes will contain pre-hashed values upon ingestion. See Data Mapping for additional information on how to map your source data to Senzing terms.
Example JSON and CSV mappings for the above attributes:
JSON
- Cleartext un-hashed
{
"SSN_NUMBER": "311-11-1111",
"PHONE_NUMBER": "702-222-2222"
} - Pre-hashed
{
"SSN_NUMBER": "a1e89c6b142f25e26ea1de32f91a1958",
"SSN_PREHASHED": "1",
"PHONE_NUMBER": "1365094f8a2d60d772665c6809bebb3a",
"PHONE_PREHASHED": "1"
}
CSV
- Cleartext un-hashed
SSN_NUMBER , CELL_PHONE_NUMBER 311-11-1111, 702-222-2222
- Pre-hashed
SSN_NUMBER ,SSN_PREHASHED,CELL_PHONE_NUMBER ,CELL_PHONE_PREHASHED a1e89c6b142f25e26ea1de32f91a1958,1 ,1365094f8a2d60d772665c6809bebb3a,1
Note when using PREHASHED flag the value is set to 1 to indicate the corresponding attribute is pre-hashed. The system will remember that this data is in a hashed form and process it correctly for entity resolution. Note: You can load attributes in both cleartext and pre-hashed form, Senzing will keep them separate and process them distinctly and appropriately.
Considerations for Fuzzy Matching
To accurately find matches between entities, one methodology used is fuzzy matching e.g. the phone numbers 345-743-6436, 345-643-6436 and 743-6436 are nearly identical, but have either a missing area code or typo in one of the digits.
Fuzzy matching is more challenging with hashed data. The hashes will be different even when the data is virtually identical. For this reason, when loading pre-hashed data it is best to also include attributes which provide partial data.
For example, in addition to PHONE_NUMBER you can map to PHONE_LAST_5 and PHONE_LAST_10; partial and incomplete versions of the phone number. Including these partial hashes allows matching against similar data even when hashed.
Consider the following phone numbers where only the hashes are loaded. The hashes of the whole phone number do not match, yet mapping the last column as PHONE_LAST_5 would still allow matching and this to be considered during entity resolution.
Comments
2 comments
Does Senzing gracefully handle an input file containing pre-hashed values in some of it's records, but it's corresponding pre-hashed attribute always set to 1 on all records?
Nigel - thanks for clarity via email. If you send in a mapping with a blank value for the feature but it does have its accompanying _PREHASHED feature mapped and set to 1 this will load and be handled just like a normal unhashed feature and seen as a blank value.
Please sign in to leave a comment.