Introduction to Privacy-Preserving AI
We live in extraordinary times where we can do many things with AI systems. For example, we can unlock the iPhone with faces, and a doctor can detect disease at earlier stages than ever before. Translate text from one language to different languages automatically. It's essential to keep in mind that AI systems are often built on Machine Learning. These machine-learning systems rely on and are shaped by increasingly private and sensitive data. So, there is a need to find another way to unlock all of this power of artificial intelligence while still respecting and protecting data privacy-preserving.
Privacy-preserving AI is a big word. To explore it, we need to understand the current approaches to privacy-preserving AI in the context of AI and machine learning.
What are the Current Approaches to Privacy and Machine Learning?
There are two significant pillars. One among them is user control, and another one is data protection.
Suppose we want to give industries like retail to get insights into the data to know what other users may say. For this, the user may want to know who is collecting the data, for how long, and for what purpose.
For example, a user surfs the web and clicks on adding red shoes. After that, every website the user visits also shows more ads for red shoes. The user does not desire this behaviour. Here, data protection is another pillar of privacy-preserving AI or machine learning.
Data protection has two components: anonymized data and encrypted data. These components currently have some gaps concerning machine learning that need to be addressed.
-
Anonymized Data: Anonymization of Data is insufficient to cross out names and addresses and think that data can be tied back to their original owners. It is much easier to anonymize data than ever before.
-
Encrypted Data: In the same way, Encryption of Data at rest is easy, or data is encrypted typically when it's in transit, but because we are doing machine learning.
Because of the nature of machine learning, we need to operate on the data. That typically means we have to decrypt the data at some point to work on it and then create a new vulnerability. There is a need for additional protection.
Why do Current Approaches to AI Require Complex Webs of Trust?
Another gap between privacy and machine learning is trust. We must understand that the data we deal with and the models are digital Assets. Whenever a user shares a digital asset with someone, it's equivalent to sharing information by trusting that he will not do something wrong with the data.
An additional fact is that machine learning is fundamentally a multi-stakeholder computation. In a machine-learning situation, multiple entities need to work together. One entity owns training data, another set of Entities may own inference data, and the third entity may provide a machine learning server running on the inference that the model may perform. That model is owned by someone else. Further, it runs on infrastructure from a very long supply chain, so many parties are involved. Because of this digital data property, these entities must trust each other in a complex Chain. This web of Trust is becoming increasingly unmanageable.
What if Untrusted Parties Could Do Machine Learning Together?
Imagine a world where parties who don't necessarily trust each other can still come together and do machine learning and unlock all these benefits of AI. If we could, one could do things like banks that are otherwise rivals in the marketplace. They may decide jointly to work together and take private customer data. Without sharing data, they can jointly build a model of money laundering.
A local radiologist makes one hospital diagnosis in the Healthcare field. Still, they may get a second or third opinion from some highly trained AI systems. They need to do that without revealing personal identity while respecting that patient's data.
Suppose untrusted parties come together in retail to do machine learning. They need to monetize their retail data to guarantee the privacy of users who contributed to it.
What are the Building Blocks of Privacy-Preserving Machine Learning?
Privacy-preserving learning techniques are merging techniques that help preserve the user's privacy. The building blocks of privacy-preserving machine learning are Federated learning, homomorphic Encryption, and differential privacy. They can use tricks from Cryptography and statistics.
For example, it is possible for a group of separate entities to collectively train a machine learning model by pooling their data without explicitly sharing it, which can be accomplished using Federated Learning or multiparty computation. Another technique is to perform machine learning on encrypted data that stays encrypted throughout, a technique known as homomorphic Encryption.
Another technique is to do statistics on datasets so that your calculation's output cannot be tied to the presence or absence of any individuals in that data set. This technique is known as differential privacy. These techniques are further amplified by hardware and software-based techniques known as a trusted execution environment.
A Case Study on Monetising Private Data and Insights
Suppose a bank needs a model that detects fraud when somebody comes in with a transaction. So, they will go to an AI company to build a model for the fraud detection system. The company has a data scientist, but they don't have the required data. They need data to build this model. The Bank provides a list of retailers with pools of data that could be used to address this problem.
Firstly, AI systems and retailers use federated learning to jointly put a model out of their data without seeing it. Based on that, the AI company would release the initial version of the model to all of these Federation members. They all use their Private data to improve that model and then send their progress back to the company to aggregate these improvements, develop a new version of the upgrades and then send that out again for further improvements. There might be a case in which these improvements to the model leak some information about the underlying data. So, this aggregation process needs to be done securely.
One way to do that is by using a trusted execution environment at the AI company. Also, models can learn and memorize certain aspects of the data on which they were trained. That is not good because somebody with that model could extract that information later to prevent that from happening. Different privacy-preserving AI techniques are needed during the training process, and we can add a bit of noise to prevent the model from overfitting.
Now, the Bank wants to host the model and test it by doing some transactions to check whether it's a fraud. However, this individual transaction may contain susceptible data, i.e., somebody's credit card number or someone's purchase. To enable it, we can use Homomorphic Encryption to encrypt the transaction and send it over to be processed purely in the encrypted way. When the answer comes back, it's the encrypted version of the response that only the Bank can unlock.
Another challenge is banks that might not want the Federation members to see the training model. Here, AI companies can use multiparty computation techniques to keep that separate. The company can also use Homomorphic Encryption. The data scientists can send encrypted data to some machine that may or may not be trustworthy. When the answer comes back, it's an encrypted solution version.
- Explore more on Machine Learning Observability and Monitoring
- Learn more about the Data Preparation Roadmap