From the very first day, in 2008, when we started reinventing Entity Resolution form the ground up, we intentionally had the goal of baking into G2 as many privacy-enhancing features as we could conceive. As such, following 28 months of skunk works development, we announced G2 Friday, January 28th, 2011 on international Data Privacy Day with internationally recognized privacy commissioner, Ann Cavoukian, as she hosted a few hundred privacy executives and practitioners from around the world in Toronto Canada at her Privacy by Design: Time to Take Control conference.
During my keynote entitled “Confessions of an Architect” I highlighted seven (7) exciting features that we baked into G2, specifically:
- Full Attribution
- Data Tethering
- Analytics in the Anonymized Data Space
- Tamper-Resistant Audit Logs
- False Negative Favoring Methods
- Self-Correcting False Positives
- Information Transfer Accounting
Here is a summary of the above seven PbD features:
Every observation (record) needs to know from where it came and when. There cannot be merge/purge data survivorship processing whereby some observations or fields are discarded. Why is this so important?
- If received data does not contain its data source and transaction pedigree, then system-to-system reconciliation and audit are virtually impossible, especially in large information sharing environments.
- If the system merges and purges observations, only later to discover the wrong observations were merged or purged, then without full attribution correcting these earlier mistakes can be difficult if not impossible. The typical alternative being periodic batch reprocessing.
- The Universal Declaration of Human Rights has four articles containing the word “arbitrary” e.g., Article 9 reads “No one shall be subjected to arbitrary arrest, detention or exile.”. If you don’t know where the data came from or when, how can any resulting action be anything but arbitrary?
Adds, changes and deletes occurring in systems of record must be accounted for, in real-time, in sub-seconds. Why is this so important?
- Data currency in information sharing environments is important, especially if one is using data to make important, difficult to reverse decisions that affect people’s freedoms or privileges.
- When derogatory data is removed or corrected in a system of record, it is vital to reflect such corrections immediately. For example, if someone is removed from a watch list, how long should they have to wait before their name is cleared?
Analytics On Anonymized Data
The ability to perform advanced analytics (including some fuzzy matching) over cryptographically altered data means organizations can anonymize more data before information sharing. Why is this so important?
- With every copy of data, there is an increased risk of unintended disclosure.
- Data anonymized before transfer and anonymized at rest reduces the risk of unintended disclosure.
- If organizations can now share information in an anonymized form and still get a materially similar result, why would organizations want to share information any other way?
Technical Note: As every anonymized value maintains full attribution, re-identification is by design to support Data Tethering as well reconciliation and audit. For further details see the article on Selective Feature Hashing
Tamper Resistant Audit Logs
Each record of who searches for what should be logged in a tamper-resistant manner – even the database administrator should not be able to alter the evidence contained in this audit log. Why is this so important?
- Every now and then people with access and privilege take a look at records without a legitimate business purpose, e.g., should an employee at a financial services institution take a peek into their roommate’s file.
- Tamper-resistant logs make it possible to audit user behavior.
- And, when the word gets out to the work force that such accountability exists, this can cause a chilling effect on misuse.
Important Note: G2 is not a tamp-resistant audit log, if this this capability is needed, one acquires a tamper-resistant audit logging system or device e.g., BlockChain technologies.
False Negative Favoring Methods
The ability to more strongly favor false negatives is of critical importance in systems that could be used to affect someone’s civil liberties. Why is this so important?
- In many business scenarios, it is better to miss a few things (false negatives) than inadvertently make claims that are not true (false positives). False positives can feed into decisions that adversely affect people’s lives – e.g., the police find themselves knocking down the wrong door or an innocent passenger is denied the ability to board a plane.
Technical Note: Sometimes a new observation can lead to multiple conclusions. Systems that are not false negative favoring may select the strongest conclusion and ignore the remaining conclusions. But had the strongest candidate not existed, the second strongest conclusion would be asserted. One false negative favoring method involves remedy such a condition, for example by reversing an earlier conclusion should a future observation bring to light that fact that multiple possible conclusions now exist.
Self-Correcting False Positives
With every new observation presented, prior assertions are re-evaluated to ensure they are still correct, and if no longer correct, these earlier assertions can often be repaired – in real-time, not end of month. Why is this so important?
- False positives occur when an assertion (claim) is made, but is not true. If relied upon to make a decision, false positives can adversely affect people’s lives e.g., consider someone who cannot board a plane because he or she shares a similar name and date of birth as someone else on a watch list.
- Without self-correcting false positives, databases start to drift from the truth and become provably wrong (even to the naked eye) – necessitating periodic (batch) reloading to true-up the database.
- Periodic monthly reloading to correct for false positives means wrong decisions are possible all month until the next reload, even though the system had everything it needed to know beforehand.
Technical Note: Reversing earlier assertions in real-time at scale, as new observations present themselves, is computationally non-trivial. Imagine making an assertion that two people are the same because they share exactly the same name, address and home phone number – only later to learn through another series of observations that these are really two different people (a junior and a senior). Our “self-correcting false positives” feature self-corrects for these rare cases, in real-time. We consider our ability to perform sequence neutrality at scale one of several breakthrough aspects of our work.
Information Transfer Accounting
Every secondary transfer of data, whether to human eyeball or tertiary system, can be recorded to allow stakeholders (e.g., data custodians or the consumers themselves) to determine how their data is flowing. Why is this so important?
- It is often cumbersome to learn who has seen what records, or what records have been shared with tertiary systems.
- Much like a US credit report that contains an inquiries section exposing the list of recent inquiring parties, now so can your medical or financial file.
- Users can now be easily provided with such disclosures, increasing transparency and control e.g. enabling a consumer in some cases to request an information recall.
- When there is a series of leaks, information transfer accounting makes discovery of who accessed all records in the series quite trivial. This can narrow an investigation when looking for criminals within.
For more information about Privacy by Design (PbD) and the unique privacy-enhancing features of G2: Ann Cavoukian, at the time the Information and Privacy Commissioner, Ontario, Canada and I released a joint paper entitled “Privacy by Design in the Era of Big Data” (June 8, 2012) available here.