Top 4 Data Privacy Misconceptions by Corporates

Data privacy is no longer “nice to have”, but rather imperative to proper business operation.

In this article, we lay out the top data privacy misconceptions that need refutation right now to avoid compromising invaluable corporate Intellectual Property (IP) or customer data.

1. “We anonymize the data, so we can safely utilize it.”

Data anonymization is the technique of removing personally identifiable information (PII) from a dataset in the hope of retaining enough useful information to uncover hidden links within the data and make data-informed decisions without invading privacy. In practice, re-identifying and tracing back such data to the individuals who comprise it is astonishingly easy. This process, known as a linkage attack, involves combining the ‘anonymized’ information with auxiliary data to reveal a specific individual.

Some well-publicized examples are:

The 1997 re-identification of Massachusetts Governor William Weld’s medical data within an insurance data set which had been stripped of direct identifiers.
Latanya Sweeney’s research (2000) showed that 87 percent of the United States population could be uniquely identified by just their five-digit ZIP code, combined with their gender and date of birth (among many other combinations).
The “How To Break Anonymity of the Netflix Prize Dataset (2006)” paper where two researchers demonstrated how little information is required to de-anonymize the published Netflix dataset.

Furthermore, in a practical sense, removing or ignoring some of the features in the data might impede performance, as rich data is crucial for extracting accurate insights and training AI models: “data cannot be fully anonymized and remain useful” (Cynthia Dwork).

Some may think that a simple solution is to avoid sharing raw data, however, analyzing raw data comes with its own privacy risks as we demonstrate next.

2. “We ask aggregated queries over encrypted data; therefore, no one can reveal any sensitive information about individuals.”

Outputs from eyes-off analysis of data can still reveal sensitive information. We already established that if PIIs are excluded from the computation then the result might not be as useful and accurate since we’re leaving crucial value on the table, but if PIIs are included in the computation then we’re at risk of re-identification even without access to raw records, as demonstrated by the following simple differencing attack:

Query 1: “How many people in the dataset have trait X?”
Query 2: “How many people in the dataset, not named “Bob” have trait X?”

The combined outputs let us know the X status of Bob.

A possible solution would be Online Query Auditing, the process of denying queries that could potentially cause a breach of privacy. The fundamental problem is that query denials might inadvertently leak information, and, given a rich query language – query auditing can simply be computationally infeasible.

3. “We analyze data internally by trusted employees, so there is practically no risk involved.”

In most cases, employees are not malicious and do not seek to harm the company. However, this does not make privacy incidents caused by human mistakes any less harmful to a business. The 2022 Cost of Insider Threats Global Report reveals that insider threats caused by careless or negligent employees are the most prevalent (56% of incidents) with a $6.6M average annual cost to remediate each such incident.

Insider threats caused by careless or negligent employees are the most prevalent.

But even in an imaginary setting where no privacy regulations apply and companies are free to utilize their data assets as they see fit, organizations should still be wary of putting their valuable IP into the wrong hands since losing IP means losing a competitive advantage. Moreover, consumers are becoming increasingly aware of their privacy and expect companies to take active measures in protecting it, irrespective of any specific privacy laws. According to a PCI Pal report, 21% of consumers will never return to a business post data breach, so it’s safe to say that customers will hold companies accountable even if the law won’t.

4. “We use the data to train machine learning models, so private data can’t leak.”

Companies increasingly incorporate machine learning algorithms as part of their systems. Such models are exposed to reconstruction attacks that occur when the attacker has access to the output of the model and can then reconstruct the private training data corresponding to those outputs.

The topic of privacy in Machine Learning seems to gain less traction compared to general data privacy, despite the fact that breaking the privacy of a supervised neural network has proven to be rather easy.

Recent examples include:

Researchers directly extracting images used to train a facial recognition model (“Model Inversion Attack”).
Researchers determining whether an individual was part of a model’s training set or not (“The Membership Test”).

Reconstruction attacks occur when the attacker has access to the output of the model and can then reconstruct the private training data.

Takeaway

In some cases, drilling down to an individual’s data record is inevitable: if the company has a finance team responsible for tracking how much each customer owes for different services, or a shipping department that needs access to personal addresses to deliver products, then clearly privacy cannot be fully preserved. However, most of the risk lies where larger groups of employees are granted access to sensitive data when the task is actually high-level analysis. Then, the above misconceptions are the riskiest both for the sensitive data and the company’s intellectual properties, and most importantly – avoidable. But neither removing PIIs nor eyes-off analysis will do the trick, so what’s left?

To cover the blind spots of anonymization and encryption techniques, Differential Privacy was invented: allowing to learn nothing about an individual while learning useful information about a population, a concept that should be incorporated to allow companies to safely expose data teams to sensitive data. We give an introduction to Differential Privacy in our blog post “What’s Differential Privacy?“.

Stay tuned for more privacy-related posts!

Top 4 Data Privacy Misconceptions by Corporates

Data privacy is no longer “nice to have”, but rather imperative to proper business operation.

1. “We anonymize the data, so we can safely utilize it.”

2. “We ask aggregated queries over encrypted data; therefore, no one can reveal any sensitive information about individuals.”

3. “We analyze data internally by trusted employees, so there is practically no risk involved.”

4. “We use the data to train machine learning models, so private data can’t leak.”

Takeaway

Latest blog posts

The Malicious Supabase Agent Attack

Differential Privacy VS Data Masking: The Evolution of Privacy-preserving Methods

Building a Successful Data Platform Strategy Today

Essential Features of an Enterprise Data Platform for Optimized Performance

Data Access Management in the AI Era

The European Data Act: A New Era for Privacy Regulation

Why Your Financial Data Infrastructure is Critical

A Comprehensive Explanation of The Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence

Effective Strategies for Access Control and Permission Management in Sensitive Environments

What is Data Access Governance, and Why Is it so Important?

PVML: Virtualizing
Databases for AI.

Top 4 Data Privacy Misconceptions by Corporates

Data privacy is no longer “nice to have”, but rather imperative to proper business operation.

1. “We anonymize the data, so we can safely utilize it.”

2. “We ask aggregated queries over encrypted data; therefore, no one can reveal any sensitive information about individuals.”

3. “We analyze data internally by trusted employees, so there is practically no risk involved.”

4. “We use the data to train machine learning models, so private data can’t leak.”

Takeaway

Latest blog posts

PVML: Virtualizing Databases for AI.

PVML: Virtualizing
Databases for AI.