Before we dive deeper into Differential Privacy (DP) and answer the 4 W’s and an H (What, Where, When, Why and How), few of the most important questions that you must ask yourself are… What is Privacy? Should we really care about it? How does it matter?…
As per Wikipedia’s definition,
Privacy is the ability of an individual or group to seclude themselves or information about themselves and thereby express themselves selectively.
To put it simply, privacy is an individual’s right to withhold some of their data which they deem to be private and share the ones they are comfortable with.
Coming to… Should we really care about it? How does it matter?
In this digital age, data privacy has always been a concern for some of us, to the extent that people are paranoid enough to use a made-up names instead of their own! Even if you are not that paranoid, say you are comfortable with sharing your real name with an unknown person as part of your introduction, you might not feel comfortable with sharing some of your other details like birth date, hangout places you love to go to, your hobbies, etc. This is where privacy comes in, the data YOU would want to keep PRIVATE!
Thus, Privacy can also be said to be the right to control how information about you is used, processed, stored, or shared.
For a better understanding of — why privacy matters? OR Privacy vs Security? I would recommend you to take a short read to the below blog posts:
Coming to the main question…
Differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
Note: Differential Privacy is not an algorithm but a System or Framework described for better data privacy!
One of the easiest examples to understand DP concerning the above definition is, the one stated by Abhishek Bhowmick (Lead, ML Privacy, CORE ML | Apple) in his interview in the Udacity’s Secure and Private AI Course:
(Note: I will be using this course material as a reference throughout the post)
Suppose we want to know the average amount of money an individual holds in his/her pocket to be able to buy an online course? Now chances are, many might not want to give out the exact amount! So, what do we do?!
This is where DP comes in, instead of asking the exact amount, we ask the individuals to add any random value (noise) in the range of -100 to +100 to the amount they hold in their pockets and give us just the resultant sum of it. That is if ‘X’ had 30$ in his/her pocket by adding a random number say -10 to it, (30 + (-10) ), they give us just the result, which is 20$ in this case. Thus, preserving their individual privacy.
To protect the data privacy obtained from the different potential individual, we add noise(the random number like in the above example) to the data to make it more private and secure! DP works by adding statistical noise to the data (either to their inputs or the output).
But, this brings us to another question — How is it useful to us if all we get are some random numbers?
The answer to this is Law of Large Numbers:
The law of large numbers, in probability and statistics, states that as a sample size grows, its mean gets closer to the average of the whole population.
When a sufficiently large number of individuals give their resultant sum values. It is seen that when the average of these statistically collected data is taken, the noise cancels out and the average obtained is near to the true average (average of the data without adding noise (random number)). Thus, we now have data on the “average amount an individual hold in their pocket”, at the same time preserving their privacy.
- The law of large numbers states that an observed sample average from a large sample will be close to the true population average and that it will get closer, the larger the sample.
- The law of large numbers does not guarantee that a given sample, especially a small sample, will reflect the true population characteristics or that a sample which does not reflect the true population will be balanced by a subsequent sample.
For better understanding of the Law of Large Numbers, refer to the following:
Another way of looking at DP is a definition by Cynthia Dwork in her book Algorithmic Foundations of Differential Privacy
Differential Privacy describes a promise, made by a data holder, or curator, to a data subject (owner), and the promise is like this: “You will not be affected adversely or otherwise, by allowing your data to be used in any study or analysis, no matter what other studies, datasets or information sources are available”.
It sounds like a well thought definition but maybe more of a fantasy — as De-anonymization of datasets may happen!
This may lead to a question — How do we know if the privacy of a person in the dataset is protected or not? For example, a database with (1) patients and their cancer status and their information OR (2) a coin flip and their (heads or tails) response!
The key query for the DP in such cases would be,” If we remove a person from the database and the query does not change, then that person’s privacy is fully protected”. To put it simply, when querying a database, if I remove someone from the database, would the output of the query be any different?
How do we check this? By creating a parallel database with one less entry (N-1) compared to the original database entries (N).
Lets take a simple example of coin flips, if the first coin flip is heads say Yes (1) and if its tails then answer as per the second coin flip. So, our database will be made of 0’s and 1’s i.e., a binary dataset. The easiest query that can be thought of with this binary dataset is “Sum Query”. The Sum query will add all the 1’s in the database and give a result.
Assuming, D is the original database with N entries and D’ is the parallel database with N-1 entries. On running the sum query on each of them, if sum(D) != sum(D’), it means the output query actually is conditioned directly on the information from a lot of people in D database! It shows non-zero sensitivity, as the outputs are different.
Sensitivity is the maximum amount that a query changes when removing an individual from the database.