From the very first day, in 2008, when we started reinventing entity resolution from the ground up, we intentionally had the goal of baking into Senzing as many privacy-enhancing features as we could conceive. As such, following 28 months of skunk works development, we announced G2 (the then code name for what is Senzing today) Friday, January 28th, 2011 on international Data Privacy Day with internationally recognized privacy commissioner, Ann Cavoukian, as she hosted a few hundred privacy executives and practitioners from around the world in Toronto Canada at her Privacy by Design: Time to Take Control conference.
During my keynote entitled “Confessions of an Architect” I highlighted seven (7) exciting features that we baked into G2 and persist today within Senzing, specifically:
- Full Attribution
- Data Tethering
- Analytics in the Anonymized Data Space
- Tamper-Resistant Audit Logs
- False Negative Favoring Methods
- Self-Correcting False Positives
- Information Transfer Accounting
Here is a summary of the above seven PbD features:
Every observation (record) needs to know from where it came and when. There cannot be merge/purge data survivorship processing whereby some observations or fields are discarded. Why is this so important?
- If received data does not contain its data source and transaction pedigree, then system-to-system reconciliation and audits are virtually impossible, especially in large information sharing environments.
- If the system merges and purges observations, only later to discover the wrong observations were merged or purged, without full attribution correcting these earlier mistakes can be difficult if not impossible. The typical alternative would be periodic batch reprocessing.
- The Universal Declaration of Human Rights has four articles containing the word “arbitrary” e.g., Article 9 reads “No one shall be subjected to arbitrary arrest, detention or exile.”. If you don’t know where the data came from or when, how can any resulting action be anything but arbitrary?
Additions, updates and deletions occurring in systems of record must be accounted for, in real-time. Why is this so important?
- Data currency in information sharing environments is important, especially if one is using data to make important, difficult to reverse decisions that affect people’s freedoms or privileges.
- When derogatory data is removed or corrected in a system of record, it is vital to reflect such corrections immediately. For example, if someone is removed from a watch list, how long should they have to wait before their name is cleared?
Analytics On Anonymized Data
The ability to perform advanced analytics (including some fuzzy matching) over cryptographically altered data means organizations can anonymize more data before information sharing. Why is this so important?
- With every copy of data, there is an increased risk of unintended disclosure.
- Data anonymized before transfer and anonymized at rest reduces the risk of unintended disclosure.
- If organizations can now share information in an anonymized form and still get a materially similar result, why would organizations want to share information any other way?
Technical Note: As every anonymized value maintains full attribution, re-identification is by design to support Data Tethering as well as reconciliation and audit. For further details see the article on Selective Feature Hashing
Tamper Resistant Audit Logs
Each record of who searches for what should be logged in a tamper-resistant manner, even the database administrator should not be able to alter the evidence contained in this audit log. Why is this so important?
- Every now and then people with access and privilege take a look at records without a legitimate business purpose, e.g., should an employee at a financial services institution take a peek into their roommate’s file.
- Tamper-resistant logs make it possible to audit user behavior.
- And, when the word gets out to the work force that such accountability exists, this can negate misuse.
Important Note: Senzing is not a tamper-resistant audit log, if this capability is needed, one acquires a tamper-resistant audit logging system or device e.g., Blockchain technologies.
False Negative Favoring Methods
The ability to more strongly favor false negatives is of critical importance in systems that could be used to affect someone’s civil liberties. Why is this so important?
- In many business scenarios, it is better to miss a few things (false negatives) than inadvertently make claims that are not true (false positives). False positives can feed into decisions that adversely affect people’s lives – e.g., the police find themselves knocking down the wrong door or an innocent passenger is denied the ability to board a plane.
Technical Note: Sometimes a new observation can lead to multiple conclusions. Systems that are not false negative favoring may select the strongest conclusion and ignore the remaining conclusions. But had the strongest conclusion not existed, the second strongest conclusion would be asserted. One false negative favoring method remedies such a condition, for example by reversing an earlier conclusion should a future observation bring to light that fact that multiple possible conclusions now exist.
Self-Correcting False Positives
With every new observation presented, prior assertions are re-evaluated to ensure they are still correct, and if no longer correct, these earlier assertions can often be repaired – in real-time. Why is this so important?
- False positives occur when an assertion (claim) is made, but is not true. If relied upon to make a decision, false positives can adversely affect people’s lives e.g., consider someone who cannot board a plane because he or she shares a similar name and date of birth as someone else on a watch list.
- Without self-correcting false positives, databases start to drift from the truth and become provably wrong (even to the naked eye) – necessitating periodic (batch) reloading to true-up the database.
- Periodic reloading to correct for false positives means wrong decisions are possible for the entire period between reloads, even though the system had the useful data at it's disposal.
Technical Note: Reversing earlier assertions in real-time at scale, as new observations present themselves, is computationally non-trivial. Imagine making an assertion that two people are the same because they share exactly the same name, address and home phone number – only later to learn through another series of observations that these are really two different people (a junior and a senior). Our “self-correcting false positives” feature self-corrects for these rare cases, in real-time. We consider our ability to perform sequence neutrality at scale one of several breakthrough aspects of our work.
Information Transfer Accounting
Every secondary transfer of data, whether to human eyeball or tertiary system, can be recorded to allow stakeholders (e.g., data custodians or the consumers themselves) to determine how their data is flowing. Why is this so important?
- It is often cumbersome to learn who has seen what records, or what records have been shared with tertiary systems.
- Much like a US credit report that contains an inquiries section exposing the list of recent inquiring parties, now so can your medical or financial file.
- Users can now be easily provided with such disclosures, increasing transparency and control e.g. enabling a consumer in some cases to request an information recall.
- When there is a series of leaks, information transfer accounting makes discovery of who accessed all records in the series quite trivial. This can narrow an investigation when looking for criminals within.
For more information about Privacy by Design (PbD) and the unique privacy-enhancing features of Senzing: Ann Cavoukian, (who at the time of publishing had been Information and Privacy Commissioner, Ontario, Canada) and I released a joint paper entitled “Privacy by Design in the Era of Big Data” available here.