Preserving Privacy in AI: Advanced Strategies and Solutions

‘Primum non nocere!’ This well-known Latin phrase in medicine means ‘first, do no harm.’ This principle is crucial because causing damage to one part of a patient while treating another part undermines the fundamental goal of medicine, which is to promote healing and improve overall health. Although this golden principle was originally intended for medicine, it can also be extended to other domains. Indeed, it is highly relevant to the revolutions we are witnessing in AI today, particularly concerning the protection of individuals’ privacy alongside technological advancements. Clearly, technological advancement in AI must proceed without harming people’s privacy.

Risks brought by new AI-enabled products

From a business perspective, respecting privacy is imperative to building and maintaining a profitable business because it has a significant impact on the success of a product in the market. In other words, people’s concerns about the potential violation of their privacy must be taken seriously when releasing new products. For example, consider the newly emerging smart home appliances that interact with people through voice, cameras, or other sensors, such as robot vacuum cleaners, air conditioners, and refrigerators. These products collect data about users’ preferences, decisions, and routines. They also perceive their surroundings through their sensors and function autonomously to enhance people’s daily lives by processing all this data using AI-based algorithms. However, people are concerned about what might happen to the data collected from these smart devices, seeing it as a potential threat to their privacy, which poses a significant obstacle to the market acceptance and success of such products. Therefore, it is crucial for businesses to address people’s privacy concerns about products, and privacy-enhancing technologies (PETs) can be extremely useful in this regard by, for example, manipulating data to preserve privacy while maintaining its usability for AI algorithms. Below, we explain how this can be achieved using advanced privacy-preserving techniques.

Differential Privacy

Differential Privacy (DP) is a robust mathematical framework used for protecting privacy while sharing information, becoming increasingly popular with the advent of data-driven techniques. In essence, DP involves the idea of disrupting data points by adding noise to varying extents while maintaining the statistical features of the data as much as possible. Thus, it can prevent the deduction of individual data points from a given distribution or result. The level of disruption can be adjusted according to the level of privacy to be guaranteed. This is called epsilon privacy, where epsilon represents the level of privacy.

With the rise in machine learning and AI applications, new privacy threats have emerged, such as model inversion or membership inference attacks. Model inversion attack refers to attempts to construct input data from a model’s output through rever se engineering of the model architecture. This is often achieved through iterative attempts using optimization techniques. On the other hand, in a membership inference attack, an adversary aims to infer whether a particular input was in the training dataset of a given model. This could pose a severe threat when the dataset contains sensitive information. DP can provide strong protection against these types of attacks by introducing random noise to the data points, which makes any inference attempts infeasible.

In a machine learning case, DP can be implemented in a few different ways. One way is to apply DP to the original input data, where the data is disrupted by adding noise. Another way is to apply DP to the output while using the input data in its original form. There is also another approach specific to machine learning or neural networks, where DP is applied to model weights. More specifically, in the third approach, model weights are deliberately disrupted to some extent after training to prevent the mentioned kind of privacy attacks on a model. Regardless of which approach is adopted, DP can provide effective protection against privacy threats on ML models. It has recently been shown that DP can also be applied to unstructured data such as images to protect users’ personal information in images¹, which is beneficial for business cases, including cameras, as exemplified above.

One challenge in implementing DP is appropriately adjusting the added noise level. Adding noise more than necessary may degrade the usefulness of the data, while failing to add enough noise may lead to privacy vulnerabilities. Determining the appropriate level of DP requires some analysis and examination of the data. There is a delicate balance between the level of privacy protection and utility.

Federated Learning

Suppose you are engaged in a collaboration involving multiple parties, each possessing its own data. The objective of this collaboration is to develop a joint machine-learning model using the collective data of all participants. However, owing to privacy concerns, no party wishes to share or disclose its individual data. Federated Learning (FL) addresses this situation by enabling model training in a non-centralized manner. In FL, the model, rather than the data, is shared or transferred in a typical environment. For example, Apple uses federated learning in its keyboard suggestions, where personalized language models are trained on-device using user interactions, and only the encrypted model updates are sent to improve the overall system. This way, Apple can enhance user experience while maintaining strict privacy standards.

Specifically, the FL process involves a central coordinator node and the participating parties. The central node initiates the operation by creating a joint model, which may be initialized with random weights. This model is then transferred to all the involved parties. Upon receiving the shared joint model, each party updates it with its own local data by training it for several epochs (iterations). Subsequently, each party returns the updated joint model to the central node. Upon receiving the locally updated models from all participating parties, the central node re-creates the common joint model by averaging the weights of all the returned models. This process constitutes one round. It is repeated until the joint model converges or the overall loss decreases sufficiently. In this way, a joint model is trained over distributed data without sharing the raw data itself.

Although FL provides significant privacy protection by eliminating the need for data sharing during joint model training, there may still be some privacy threats against the joint model and individual data points involved in the training data. For example, an adversary with access to the joint model may attempt a membership inference attack targeting the parties involved in FL. To protect this type of threat, DP can be integrated into the FL pipeline, which can be implemented in a few different ways, such as applying DP on original data, joint model outputs, or model weights.

Several tools and libraries have emerged to facilitate the implementation of federated learning. One notable option is Flower², a PyTorch-based framework that simplifies the development of FL models by providing abstractions for communication and aggregation. TensorFlow Federated³ (TFF) is another powerful library that extends TensorFlow to enable the implementation of federated learning across various devices. These libraries provide a flexible and scalable platform for defining federated computations and orchestrating communication between clients and a central server, offering essential functionalities, such as secure aggregation, to ensure privacy preservation during model training across decentralized devices.

Homomorphic Encryption

Homomorphic encryption (HE) emerges as a game-changer solution in the realm of data security. Fundamentally, typical encryption serves as a robust safeguard for protecting sensitive information and ensuring privacy in digital communications. However, in typical encryption methods, performing mathematical operations on encrypted data is not possible. To achieve this, the data needs to be decrypted first, potentially leaving sensitive information vulnerable to privacy breaches. HE addresses this vulnerability by enabling computations on encrypted data, which is amazing. It allows the sharing of data in an encrypted form while still facilitating specific computational tasks. For example, IBM researchers used homomorphic encryption to conduct machine learning analysis on completely encrypted banking data. The fully homomorphic encryption scheme enabled accurate predictions comparable to those derived from models based on unencrypted data⁴.

The three types of homomorphic encryption are as follows:

Partially Homomorphic Encryption (PHE) restricts operations to a single mathematical function when performed on encrypted values. In PHE, either addition or multiplication can be executed an unlimited number of times on the encrypted data.
Somewhat Homomorphic Encryption (SHE) offers a broader scope than PHE, supporting homomorphic operations involving both addition and multiplication. However, SHE has a constraint on the total number of operations that can be carried out on the encrypted data.
Fully Homomorphic Encryption (FHE) surpasses the limitations of PHE and SHE by allowing addition and multiplication operations without any restrictions on the number of times these operations can be performed on the encrypted data.

Homomorphic encryption has the potential to pave the way for conducting AI model training on encrypted data, which would be a ground-breaking revolution in AI, satisfying both the imperative need for enhanced data security and the growing demand for collaborative and privacy-preserving machine learning practices.

Conclusion

The rapid advancements in AI have sparked concerns regarding data collection, sharing, and handling, emphasizing the need to safeguard privacy. Addressing these concerns is crucial for the formation of a healthy, profitable, and sustainable ecosystem. It is imperative that AI advancements prioritize the protection of individuals’ privacy. Differential privacy, federated learning, and homomorphic encryption are three prominent and distinct methods in this regard, paving the way for collecting, sharing, and processing data in a privacy-preserving manner. Each of these methods may be more feasible or advantageous for different use scenarios; therefore, they can be evaluated and implemented separately or in combination based on the specific use case.

¹ https://arxiv.org/abs/2103.07073
² https://flower.ai/
³ https://www.tensorflow.org/federated
⁴ https://research.ibm.com/blog/fhe-progress-milestones