How Can Companies Prevent AI-based Re-Identification Attacks?

A new powerful form of privacy breach is on the rise, leveraging AI to re-identify individuals based on behavior patterns that can be inferred from data. These AI-based re-identification attacks can unravel anonymized data, exposing individual identities. This growing risk underscores the urgent need for privacy-preserving technologies – particularly differential privacy – as a secure and forward-looking solution, even in the face of such sophisticated attacks. In light of this real threat, what actionable steps can bolster protection, and how can differential privacy help prevent these attacks?

In this article,

Understanding AI-Based Re-Identification Attacks
- What is an AI-based attack?
- Types of attacks
The Importance of Data Anonymization Techniques
Implementing Differential Privacy
- How can differential privacy help?
- A case study
Prevention strategies for companies
Conclusion

Understanding AI-Based Re-Identification Attacks

What is an AI-based attack?

An AI-based re-identification attack is a sophisticated method for uncovering the identities of individuals from datasets that have undergone anonymization using advanced artificial intelligence techniques.

Consider a scenario where a company collects customer data, including demographics and purchasing behavior. To safeguard privacy, the company anonymizes the data by removing direct identifiers such as names and replacing them with random identifiers. However, despite these measures, it remains possible for individuals to be linked to their anonymized data points through additional information.

For instance, if an attacker gains access to the anonymized dataset alongside social media data, they can employ AI algorithms to analyze patterns and correlations between the two datasets. By identifying similarities between a user’s purchasing behavior and their social media activity, the attacker can potentially unveil their identity.

As mentioned in our previous article, “The Most Common Data Anonymization Techniques,” one example of a re-identification attack is the 2007 Netflix Prize Attack, where two data scientists successfully re-identified a database of over 480,000 names that Netflix had anonymized. They demonstrated that it was possible to re-identify individuals in the anonymized Netflix dataset by cross-referencing it with publicly available information from the Internet Movie Database (IMDb). They showed that by matching movie ratings and timestamps between the two datasets, they could link anonymized Netflix records to specific IMDb user accounts, thereby identifying individuals.¹

Another example could involve healthcare data (as described below). Suppose a hospital releases a dataset for research purposes, anonymizing patient information by removing names and other direct identifiers. However, by cross-referencing the anonymized healthcare data with publicly available information on medical conditions or treatments, AI algorithms can potentially re-identify patients (see the case of the Mayo Clinic experiment mentioned in our previous article: “The Impact of Privacy Preserving Technologies on Data Privacy”).

In essence, AI-based re-identification attacks exploit patterns and correlations in anonymized datasets, combined with auxiliary information, to unveil individuals’ identities. These attacks highlight the importance of robust data anonymization techniques and the need for proactive measures to protect sensitive information.

Types of attacks

Privacy regulations such as GDPR and CCPA define anonymity in a dataset as the inability to reasonably re-identify any records with natural persons or households. In the past, simply masking direct identifiers like names or social security numbers was considered sufficient for anonymization. However, it is now understood that the composition of remaining attributes can still lead to instant re-identification of individuals. While masking may increase the effort required for manual re-identification, it doesn’t prevent AI-based attacks.

Understanding how re-identification attacks work and how they evolve due to technological advancements is crucial. These attacks can be driven by various motives, such as financial gain, malicious intent, or research curiosity. Re-identification attacks can be broadly classified into three types: linkage attacks, inference attacks, and reconstruction attacks.²

Linkage attacks are the most prevalent type of re-identification attacks, involving matching anonymized data with other data sources containing identifying information, such as names or addresses. These attacks work by linking a dataset that hasn’t been identified yet with auxiliary information on specific individuals. The attack entails finding overlapping matches between the common attributes of these two datasets, allowing direct identifiers to be attributed to the supposedly anonymous data records. In 2018, researchers from Imperial College London and the Belgian research group Data & Society conducted an experiment to demonstrate the vulnerability of anonymized data.³ They obtained an anonymized dataset from the UK government’s NHS Digital, containing information on diagnoses and treatments for nearly 1.1 million patients. Despite the removal of direct identifiers to comply with privacy regulations, the researchers linked this anonymized healthcare data with publicly available voter registration records from the UK’s electoral roll. Using advanced data analysis techniques and machine learning algorithms, they successfully re-identified individuals in the healthcare dataset. For instance, they identified specific individuals based on unique combinations of demographic characteristics like age, gender, and postcode present in both datasets. By cross-referencing these shared attributes, they matched healthcare records with voter registration information, thereby revealing individuals’ identities.
Inference attacks rely on statistical or machine learning methods to extract sensitive information from anonymized data, such as gender, age, or health status. A notable example of an inference attack is the SafeGraph dataset incident.⁴ In 2018, SafeGraph, a data company, released a dataset containing anonymized location information from mobile devices. The aim was to assist researchers in understanding human mobility patterns while preserving individual privacy. However, researchers from the University of Washington and Stanford University demonstrated that the anonymized location data in the SafeGraph dataset was susceptible to both linkage and inference attacks. By merging the SafeGraph dataset with publicly available information, such as social media posts and online reviews, the researchers were able to infer individuals’ identities and sensitive details. In fact, through sophisticated analysis and correlation techniques, the researchers could link anonymized location data to specific individuals or groups of individuals. Furthermore, by uncovering patterns and associations between the anonymized data and auxiliary information, they inferred sensitive details about individuals’ routines, habits, and preferences.
Reconstruction attacks represent the most sophisticated form of re-identification attacks, utilizing multiple anonymized datasets or queries to reconstruct the original data or a close approximation of it. In 2018, researchers from the University of Melbourne and the CSIRO’s Data61 conducted a study demonstrating the re-identification of individuals from supposedly anonymized medical records. Despite removing direct identifiers such as names and addresses, the dataset still contained indirect identifiers like age, gender, and medical procedures, which the researchers leveraged to infer individuals’ identities. By merging the anonymized medical records with publicly available information, such as voter registration records and online social media profiles, the researchers successfully reconstructed the identities of individuals in the dataset. This reconstruction attack exposed the vulnerability of ostensibly anonymized medical datasets and raised significant concerns regarding the privacy implications of releasing such data for research purposes.⁵

The Importance of Data Anonymization Techniques

Three types of data anonymization techniques

Data anonymization techniques are pivotal in safeguarding sensitive data while retaining its usefulness for analysis and AI applications. Through anonymization, organizations can mitigate the threat of re-identification attacks and ensure compliance with data privacy regulations. Let’s delve into actionable strategies for companies to thwart AI-based re-identification attacks:

Pseudonymization involves substituting identifiable information with pseudonyms or aliases. By replacing direct identifiers like names or social security numbers with pseudonyms, organizations can obscure individuals’ identities while maintaining data integrity and usability.
Generalization entails aggregating data to a higher level of abstraction, thereby reducing its granularity. For instance, data can be grouped into age ranges instead of recording precise ages. By generalizing data, companies can prevent the disclosure of sensitive information while preserving overall data trends and patterns.
Noise Addition: introducing random noise to datasets can help mask sensitive details and prevent the inference of individual attributes. This technique involves adding random perturbations to data points, making it more challenging for attackers to discern meaningful information. By injecting noise into the data, organizations can enhance privacy protections without significantly impacting data utility.

The most common techniques

The following techniques help protect against AI-based re-identification attacks (see also our article: “The Most Common Data Anonymization Techniques”):

Differential privacy: this technique adds noise to query responses or datasets to ensure individual data points cannot be distinguished, thus safeguarding privacy while allowing for accurate analysis.
Generalization and suppression: generalization involves replacing specific data values with more generalized categories to reduce the risk of identification. Suppression removes or masks certain attributes altogether from datasets to prevent disclosure.
Data perturbation: perturbing data by introducing random noise or altering values makes it more difficult for attackers to re-identify individuals while preserving the overall integrity of the dataset.
K-anonymity ensures that each record in a dataset is indistinguishable from at least k-1 other records for any k-1 combination of individuals, with respect to certain quasi-identifiers, preventing the identification of individuals.
L-diversity extends k-anonymity by ensuring that each group of k-anonymous records contains at least l distinct sensitive attribute value, reducing the risk of attribute disclosure.
T-closeness requires that the distribution of sensitive attribute values within each k-anonymous group is similar to the distribution in the overall dataset, minimizing the risk of attribute disclosure.
Data masking and tokenization: data masking involves removing the values (so either the full column is inaccessible or the values are marked as NULL or *) or unique tokens (tokenization) to protect privacy while allowing for analysis and processing.
Hashing: a unique hash code is created to identify individuals without disclosing their PII.

How data anonymization methods are employed

By employing a combination of these data anonymization techniques, organizations can effectively mitigate the risk of AI-based re-identification attacks and safeguard sensitive data from unauthorized disclosure. Here’s how each technique can be utilized to counter specific types of attacks:

To prevent linkage attacks:

Employ robust anonymization techniques such as k-anonymity, l-diversity, or t-closeness. These ensure that each anonymized record is indistinguishable from at least k other records and that each group of records exhibits sufficient diversity and closeness of sensitive values.
Avoid using unique or quasi-identifiers like social security numbers or zip codes. Instead, they replace them with random or synthetic values to prevent easy identification.

To prevent inference attacks:

Implement noise injection, which adds random or controlled errors to the data, reducing its accuracy and utility for attackers.
Utilize differential privacy to ensure that the presence or absence of an individual in the data does not influence the outcome of any analysis.
Limit the amount and granularity of data released or shared and employ access control and encryption mechanisms to protect sensitive information.

To prevent reconstruction attacks:

Implement secure multiparty computation, enabling multiple parties to perform computations on their data without revealing it to each other or a third party.
Utilize homomorphic encryption, allowing operations to be performed on encrypted data without decryption.
Adopt federated learning, enabling distributed learning from local data without centralizing it, thereby reducing the risk of data exposure. Differential privacy should still be incorporated to provide an additional layer of privacy protection for individual data points in the training process.

By incorporating these strategies into their data protection practices, organizations can significantly enhance their resilience against AI-based re-identification attacks and uphold the privacy and security of sensitive data.

Implementing Differential Privacy

How can differential privacy help?

Indeed, we find ourselves in an era where vast amounts of data points are collected for everyone, forming unique digital fingerprints that make it increasingly effortless for AI to detect matching behavioral patterns. The more attributes captured for an individual, the more conspicuous they become. Consequently, legacy anonymization methods often prove inadequate against linkage and inference attacks, unless a significant portion of the information is sacrificed.

To effectively mitigate the risk of re-identification by AI, organizations must embrace technologies that enable them to achieve commercial objectives without compromising data privacy and security. Advancements in privacy-preserving techniques and differential privacy offer promising solutions. Differential privacy serves as a potent tool for safeguarding sensitive data while enabling meaningful analysis:

“Differential Privacy ensures that the presence or absence of an individual’s information does not significantly influence the outcome of queries.”

By integrating differential privacy mechanisms into their data processing workflows, companies can bolster data protection anonymization efforts and reduce the likelihood of AI-based re-identification attacks.

A case study: Implementing differential privacy in genomics⁶

Machine learning holds significant implications for genomics applications, particularly in precision medicine, where treatment is tailored to a patient’s clinical and genetic features.⁷ The rapid proliferation of genomics datasets to support statistical analyses and machine learning research presents a true privacy concern. As seen above, linkage attacks exploit overlaps between information in public databases and sensitive datasets.

Illustrative examples of linkage attacks include instances where de-identified hospital records were linked with a voter registration database, leading to the successful identification of the Governor of Massachusetts’s patient profile.⁸

Traditional solutions to mitigate this risk include de-identification and k-anonymization. However, these approaches have limitations: de-identification may lead to the loss of meaningful information crucial for analyses, while k-anonymization lacks formal privacy guarantees and remains vulnerable to linkage attacks.

Differential privacy offers notable benefits for addressing these challenges. It protects against linkage attacks and operates in interactive and non-interactive settings. In the interactive setting, queries to non-public databases are either injected with noise or only summary statistics are released, while in the non-interactive setting, noise is injected into public data. However, Differential privacy also presents drawbacks, including the challenge of balancing privacy with utility and restrictions on query types, limiting flexibility to preset queries such as returning p-values or the location of top K SNPs. These considerations underscore the complex trade-offs involved in adopting differential privacy approaches for genomic applications (see also the “Privacy Gradient” in our article: “The Most Common Data Anonymization Techniques”).

Prevention strategies for companies:

AI-based re-identification attacks pose significant risks for companies. Below, we outline three key steps companies can take to proactively address these risks:

1. Conduct privacy assessments:

Regularly assess privacy risks associated with datasets and AI systems to identify vulnerabilities.
Evaluate the effectiveness of current anonymization techniques, employ differential privacy, if possible, and implement additional safeguards as necessary.

2. Implement access controls:

Restrict access to sensitive datasets and AI models to authorized personnel.
Employ robust access controls and encryption mechanisms to prevent unauthorized data access or manipulation.

3. Educate employees:

Provide comprehensive training on data privacy best practices and the risks of AI-based re-identification attacks.
Cultivate a culture of privacy awareness within the organization to ensure employees understand their roles in safeguarding sensitive data.

Conclusion

As we have seen, the risks associated with AI-based re-identification attacks are multifaceted and can have severe consequences for individuals and organizations alike, such as:

Breach of confidentiality
Violation of privacy regulations
Loss of trust and reputation
Legal implications and financial penalties

Preventing AI-based re-identification attacks requires a multi-faceted approach that combines robust privacy-preserving techniques with proactive measures and ongoing vigilance. By implementing differential privacy and other anonymizing data methods, companies can effectively protect sensitive data and mitigate the risks posed by re-identification attacks. Safeguarding data isn’t just a legal requirement-it’s essential for maintaining trust and integrity in an increasingly data-driven world.

Analyze Structured Data with AI using PVML

PVML helps you unlock access to sensitive data using free text without compromising data privacy by combining our privacy-enhancing technology with AI. Our platform enforces permissions when data is accessed with AI, without the need to tag personally identifiable information (PII) or mask the data and without risking customers’ privacy.

¹ Kris Kuo, “You can be identified by your Netflix watching history”, 31 July 2020, Artificial Intelligence in Plain English, https://ai.plainenglish.io/ahh-the-computer-algorithm-still-can-find-you-even-there-is-no-personal-identifiable-information-6e077d17381f
² https://www.linkedin.com/advice/3/what-most-effective-techniques-minimizing-da
³ https://www.grcworldforums.com/privacy-and-technology/why-anonymisation-does-not-work-for-big-data/337.article
⁴ Bennett Cyphers, “SafeGraph’s Disingenuous Claims About Location Data Mask a Dangerous Industry”, 6 May 2022, Electronic Frontier Foundation, https://www.eff.org/deeplinks/2022/05/safegraphs-disingenuous-claims-about-location-data-mask-dangerous-industry
⁵ Cameron Abbott, “De-identification of Data Privacy“, 28 February 2018, The National Law Review, https://www.natlawreview.com/article/de-identification-data-and-privacy
⁶ This case is drawn from a study carried out by Openmined: https://blog.openmined.org/use-cases-of-differential-privacy/#f2
⁷ Bonnie Berger, “Emerging technologies towards enhancing privacy in genomic data sharing” 2 July 2019, Genome Biology, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1741-0
⁸ Id as above