Data privacy is no longer “nice to have”, but rather imperative to proper business operation.
In this article we lay out the top data privacy misconceptions that need to be refuted right now to avoid invaluable corporate Intellectual Property (IP) or customer data from being compromised.
1. “We anonymize the data, so we can safely utilize it.”
Data anonymization is the technique of removing personally identifiable information (PII) from a dataset in the hope of retaining enough useful information to uncover hidden links within the data and make data-informed decisions without invading privacy. In practice, it is astonishingly easy how such data can be re-identified and traced back to the individuals that comprise it. This is known as a linkage attack, where the “anonymized” information is combined with auxiliary information to reveal a specific individual.
Some well-publicized examples are:
- The 1997 re-identification of Massachusetts Governor William Weld's medical data within an insurance data set which had been stripped of direct identifiers.
- Latanya Sweeney's research (2000) showed that 87 percent of the United States population could be uniquely identified by just their five-digit ZIP code, combined with their gender and date of birth (among many other combinations).
- The “How To Break Anonymity of the Netflix Prize Dataset (2006)” paper where two researchers demonstrated how little information is required to de-anonymize the published Netflix dataset.
Moreover, from a practical perspective, removing or ignoring some of the features in the data might hinder performance as rich data is crucial to extracting accurate insights and AI models: “data cannot be fully anonymized and remain useful” (Cynthia Dwork).
Some may think that a simple solution is to avoid sharing raw data, however, analyzing raw data comes with its own privacy risks as we demonstrate next.
2. “We ask aggregated queries over encrypted data; therefore, no one can reveal any sensitive information about individuals.”
Outputs from eyes-off analysis of data can still reveal sensitive information. We already established that if PIIs are excluded from the computation then the result might not be as useful and accurate since we’re leaving crucial value on the table, but if PIIs are included in the computation then we’re at risk of re-identification even without access to single raw records, as demonstrated by the following simple differencing attack:
- Query 1: “How many people in the dataset have trait X?”
- Query 2: “How many people in the dataset, not named “Bob” have trait X?”
The combined outputs let us know the X status of Bob.
A possible solution would be Online Query Auditing, the process of denying queries that could potentially cause a breach of privacy. The fundamental problem is that query denials might inadvertently leak information, and, given a rich query language - query auditing can simply be computationally infeasible.
3. “We analyze data internally by trusted employees, so there is practically no risk involved.”
In most cases, employees are not malicious and do not seek to harm the company. However, this does not make privacy incidents caused by human mistakes any less harmful to a business. The 2022 Cost of Insider Threats Global Report reveals that insider threats caused by careless or negligent employees are the most prevalent (56% of incidents) with a $6.6M average annual cost to remediate each such incident.
But even in an imaginary setting where no privacy regulations apply and companies are free to utilize their data assets as they see fit, organizations should still be wary of putting their valuable IP into the wrong hands since losing IP means losing a competitive advantage. Moreover, consumers are becoming increasingly aware of their privacy and expect companies to take active measures in protecting it, irrespective of any specific privacy laws. According to a PCI Pal report, 21% of consumers will never return to a business post data breach, so it’s safe to say that customers will hold companies accountable even if the law won’t.
4. “We use the data to train machine learning models, so private data can't leak.”
Companies increasingly incorporate machine learning algorithms as part of their systems. Such models are exposed to reconstruction attacks that occur when the attacker has access to the output of the model and can then reconstruct the private training data corresponding to those outputs.
The topic of privacy in Machine Learning seems to be more of a niche compared to general data privacy, despite the fact that breaking the privacy of a supervised neural network has proven to be rather easy.
Recent examples include:
- Researchers directly extracting images on which a facial recognition model was trained.
- Researchers determining whether an individual was part of a model's training set or not ("the membership test").
In some cases, drilling down to an individual's data record is inevitable: if the company has a finance team responsible for tracking how much each customer owes for different services, or a shipping department that needs access to personal addresses to deliver products, then clearly privacy cannot be fully preserved. However, most of the risk lies where larger groups of employees are granted access to sensitive data when the task is actually high-level analysis. Then, the above misconceptions are the riskiest both for the sensitive data and the company’s IP, and most importantly - avoidable. But neither removing PIIs nor eyes-off analysis will do the trick, so what’s left?
To cover the blind spots left by anonymization and encryption techniques, Differential Privacy was invented: allowing to learn nothing about an individual while learning useful information about a population, a concept that should be incorporated to allow companies to safely expose data teams to sensitive data. We give an introduction to Differential Privacy in our blog post "What's Differential Privacy?".
Stay tuned for more privacy-related posts!
Rina Galperin on Aug. 28, 2022