In the 80’s and 90’s, governments held the keys to the world’s data — data that could inform policy decisions, programmatic responses, and help researchers understand the world. Now private companies hold the keys to most of the world’s data. At the same time, social networking platforms have a credibility crisis with the external research and policy community in the aftermath of events such as Cambridge Analytica. Although these platforms might have an impact on society, they do not have the means to measure that impact. By sharing their data in a privacy-protective and secure way, they can demonstrate transparency and commitment to doing good in the world. In order to share their data in a privacy-protective and secure way, they need to build the right products, processes, and narratives.
Platforms need to balance the competing goals of user privacy and research utility in order to turn their data into a public asset and meet the needs of external stakeholders. This is a difficult balance to strike. One of the many tools they can use to strike this balance is differential privacy.
Differential Privacy (DP) is a formal, mathematical definition of privacy and is widely recognized by industry experts as providing robust and specific claims about privacy assurances for individuals.
DP gives a meaningful way to calculate measures of privacy loss for individuals, yet that privacy loss needs to be tracked across datasets and translated into risk.
The core premise of DP is that it provides mathematical guarantees of privacy. Those mathematical guarantees are tied directly to how accurate the data is (or, put another way, how much noise was injected to make the data differentially private). These mathematical guarantees are also impacted by which users are in these datasets, how many datasets are released, who they are released to, how often they are updated, and so on — tracking and managing that information is a substantial challenge, yet a requirement for making valid and consistent claims about formal privacy protections afforded to users who are in these released datasets. The accumulated privacy loss associated with these releases in the differential privacy literature is the mathematical parameter epsilon.
There are good frameworks for the efficacy of epsilon — the statistical measure of privacy under DP — for a single dataset, but less guidance exists from the privacy community for how to manage and reason about a privacy budget that is “consumed” globally. This guidance does not exist because this is not a mathematical calculation, but a policy decision informed by those mathematical calculations. In order to provide the necessary information to policy makers that helps guide such a decision, a privacy budget management and tracking system is necessary at multiple levels of data access (e.g. user, team, organization, global).
Since epsilon is a statistical measure that describes the change in probabilities of learning something about someone in a dataset given the prior probabilities, in order to understand what epsilon actually “means”, software engineers need to have a good estimate of prior probabilities of re-identification, for example, of users included in a differentially private dataset release.
Any DP dataset is trading off privacy and utility of the data, but while differential privacy defines privacy loss in a generalized way, there is no equivalent generalized definition of utility.
How much noise should be injected to achieve a certain level of privacy is easily calculated under differential privacy, but how much noise can be injected while the data remains “useful” is a much harder question to answer. By selecting these measures, it is straightforward to develop a production function that mathematically captures the privacy vs utility tradeoff of a dataset. However, there is no “right answer” to what the utility measure should be, though there are frameworks for making that determination. Having a robust set of pre-calculated utility measurement options to draw from when determining “how private” to make a differentially private dataset would substantially improve the long timelines currently required to define, build, and vet a differentially private dataset.
Various complications exist even in this approach, however. For example, for data that is aggregated by country, the utility of data is significantly impacted by the population and platform penetration in that country. While a larger noise value applied to some dataset may still allow for meaningful analysis in the US or India, for example, all relevant signal might be lost for a similar analysis conducted on the same dataset in the Netherlands.
An interim solution that will help inform requirements around generalized measures of utility, is the use of validation servers — in which a specific set of top-line raw answers, rather than differentially private answers, can be returned to researchers. This approach obviously would complicate formal privacy claims.
DP datasets that are released or updated over time necessarily involve additional privacy loss of individuals in the dataset, yet informed and intentional choices about data architecture can help reduce that privacy loss.
Furthermore, certain data architecture (specifically, anticipating required joins and data partitions) lends itself to calculating privacy guarantees at various levels (such as the action, user, or geographic level). Additionally, ensuring mutual exclusivity in dataset aggregations when possible can sometimes allow for minimizing privacy budgets through parallel composition. Additionally, developing aggregates that are accumulated as a function of time bins rather than a total accumulation over time also helps minimize privacy budget consumption. These releases over time still would degrade the overall privacy budget, yet they would do it in a more optimal and predictable way, helping future proof releases from inadvertently consuming a large amount of privacy budget as the dataset is updated in the future.
Needless to say, these sorts of best practices when building underlying datasets that may be exposed, released, or stored with differential privacy in place can help maximize the utility of data without substantially impacting the overall privacy budget consumption.
Moreover, platforms would require a release of user-facing dynamic querying and modeling systems with differentially private API endpoints. To achieve this, software engineers would first have to achieve an “end-state” of differential privacy tooling. They would need to build and deliver a system that:
- allows for dynamic SQL query writing by end users, which
- returns a differentially private result, while
- informing that user and their organization about the privacy budget implications.
This sounds simple in principle, yet examining the details reveals it to be quite complex.
One clear and actionable engineering item here is that while platforms may have generated measures of utility for some given dataset that can be accessed through a differentially private query system, the users may have a different measure of utility and place completely different value on specific calculations based on their research topic. To optimally consume some given privacy budget, this would necessitate providing them with their own production functions (which would be prohibitively computationally intensive to produce for some ever-increasing set of end-users of such a system).
DP has been the domain of mathematicians for years, and as software engineers start to explore industry solutions for DP, they would need to deploy industry methods to democratize the multitude of engineering opportunities.
To improve some of the shortcomings of DP associated with functionality highlighted, it will be necessary to build systems that enable rapid prototyping of DP methods. Systems such as these can help bridge the gap between what is currently available and the desired analytic or ML functionality from users of DP systems.
Last, but not least, it will take a lot of work to build an end to end DP solution for a particular platform given the existence of various use cases around a specific privacy technology in general and differential privacy in particular. Needless to say, such a long and complex process would require the collaboration of multiple parties, rather than being a one-time, single task of software engineers!