The second part of our brief survey of differentially private statistics. This time, we show how to privately estimate the CDF of a distribution (i.e., estimate the distribution in Kolmogorov distance), and conclude with pointers to some other work in the space.
1. CDF Estimation for Discrete, Univariate Distributions
Suppose we have a distribution over the ordered, discrete domain and let be the family of all such distributions. The CDF of the distribution is the function given by
A natural measure of distance between CDFs is the distance, as this is the sort of convergence guarantee that the empirical CDF satisfies. That is, in the non-private setting, the empirical CDF will achieve the minimax rate, which it known by [DKW56, Mas90] to be
1.1. Private CDF Estimation
Proof: Assume without loss of generality that for an integer . Let be a sample. By the triangle inequality, we have
so we will focus on constructing to approximate .
For any and , consider the statistics
Let be the function whose output consists of all such counts. To decipher this notation, for a given , the counts form a histogram of using consecutive bins of width , and we consider the histograms of geometrically increasing width . First, we claim that the function has low sensitivity—for adjacent samples and ,
Thus, we can use the Gaussian mechanism:
As we will argue, there exists a matrix such that . We will let . Since differential privacy is closed under post-processing, inherits the privacy of .
We will now show how to construct the matrix and analyze the error of . For any , we can form the interval as the union of at most disjoint intervals of the form we’ve computed, and therefore we can obtain as the sum of at most of the entries of . For example, if then we can write
See the following diagram for a visual representation of the decomposition.
This shows hierarchical decomposition of the domain using 14 intervals. The highlighted squares represent the interval and the highlighted circles show the decomposition of this interval into a union of intervals in the tree.
Thus we can construct the matrix using this information. Note that each entry of is the sum of at most entries of . Thus, if we use the output of in place of , for every we obtain
Applying standard bounds on the expected supremum of a Gaussian process, we have
1.2. Why Restrict the Domain?
A drawback of the estimator we constructed is that it only applies to distributions of finite support , albeit with a relatively mild dependence on the support size. If privacy isn’t a concern, then no such restriction is necessary, as the bound (2) applies equally well to any distribution over . Can we construct a differentially private estimator for distributions with infinite support?
Perhaps surprisingly, the answer to this question is no! Any differentially private estimator for the CDF of the distribution has to have a rate that depends on the support size, and cannot give non-trivial rates for distributions with infinite support.
Theorem 2 ([BNSV15]) If consists of all distributions on , then
The notation refers to the iterated logarithm.
We emphasize that this theorem shouldn’t meet with too much alarm, as grows remarkably slowly with . There are differentially private CDF estimators that achieve very mild dependence on [BNS13, BNSV15], including one nearly matching the lower bound in Theorem 2. Moreover, if we want to estimate a distribution over , and are willing to make some mild regularity conditions on the distribution, then we can approximate it by a distribution with finite support and only increase the rate slightly. However, what Theorem 2 shows is that there is no “one-size-fits-all” solution to private CDF estimation that achieves similar guarantees to the empirical CDF. That is, the right algorithm has to be tailored somewhat to the application and the assumptions we can make about the distribution.
2. More Private Statistics
Of course, the story doesn’t end here! There’s a whole wide world of differentially private statistics beyond what we’ve mentioned already. We proceed to survey just a few other directions of study in private statistics.
2.1. Parameter and Distribution Estimation
A number of the early works in differential privacy give methods for differentially private statistical estimation for i.i.d. data. The earliest works [DN03, DN04, BDMN05, DMNS06], which introduced the Gaussian mechanism, among other foundational results, can be thought of as methods for estimating the mean of a distribution over the hypercube in the norm. Tight lower bounds for this problem follow from the tracing attacks introduced in [BUV14, DSSUV15, BSU17, SU17a, SU17b]. A very recent work of Acharya, Sun, and Zhang [ASZ20] adapts classical tools for proving estimation and testing lower bounds (lemmata of Assouad, Fano, and Le Cam) to the differentially private setting. Steinke and Ullman [SU17b] give tight minimax lower bounds for the weaker guarantee of selecting the largest coordinates of the mean, which were refined by Cai, Wang, and Zhang [CWZ19] to give lower bounds for sparse mean-estimation problems.
Nissim, Raskhodnikova, and Smith introduced the highly general sample-and-aggregate paradigm, which they apply to several learning problems (e.g., learning mixtures of Gaussians) [NRS07]. Later, Smith [Smi11] showed that this paradigm can be used to transform any estimator for any asymptotically normal, univariate statistic over a bounded data domain into a differentially private one with the same asymptotic convergence rate.
Subsequent work has focused on both relaxing the assumptions in [Smi11], particularly boundedness, and on giving finite-sample guarantees. Karwa and Vadhan investigated the problem of Gaussian mean estimation, proving the first near-optimal bounds for this setting [KV18]. In particular, exploiting concentration properties of Gaussian data allows us to achieve non-trivial results even with unbounded data, which is impossible in general. Following this, Kamath, Li, Singhal, and Ullman moved to the multivariate setting, investigating the estimation of Gaussians and binary product distributions in total variation distance [KLSU19]. In certain cases (i.e., Gaussians with identity covariance), this is equivalent to mean estimation in -distance, though not always. For example, for binary product distribution, one must estimate the mean in a type of -distance instead. The perspective of distribution estimation rather than parameter estimation can be valuable. Bun, Kamath, Steinke, and Wu [BKSW19] develop a primitive for private hypothesis selection, which they apply to learn any coverable class of distributions under pure differential privacy. Through the lens of distribution estimation, their work implies an upper bound for mean estimation of binary product distributions that bypasses lower bounds for the same problem in the empirical setting. In addition to work on mean estimation in the sub-Gaussian setting, such as the results discussed earlier, mean estimation has also been studied under weaker moment conditions [BS19, KSU20]. Beyond these settings, there has also been study of estimation of discrete multinomials, including estimation in Kolmogorov distance [BNSV15] and in total variation distance for structured distributions [DHS15], and parameter estimation of Markov Random Fields [ZKKW20].
A different approach to constructing differentially private estimators is based on robust statistics. This approah begins with the influential work of Dwork and Lei [DL09], which introduced the propose-test-release framework, and applied to estimating robust statistics such as the median and interquartile range. While the definitions in robust statistics and differential privacy are semantically similar, formal connections between the two remain relatively scant, which suggests a productive area for future study.
2.2. Hypothesis Testing
An influential work of Homer et al. [HSRDTMPSNC08] demonstrated the vulnerability of classical statistics in a genomic setting, showing that certain -statistics on many different variables could allow an attacker to determine the presence of an individual in a genome-wide association study (GWAS). Motivated by these concerns, an early line of work from the statistics community focused on addressing these issues [VS09, USF13, YFSU14].
More recently, work on private hypothesis testing can be divided roughly into two lines. The first focuses on the minimax sample complexity, in a line initiated by Cai, Daskalakis, and Kamath [CDK17], who give an algorithm for privately testing goodness-of-fit (more precisely, a statistician might refer to this problem as one-sample testing of multinomial data). A number of subsequent works have essentially settled the complexity of this problem [ASZ18, ADR18], giving tight upper and lower bounds. Other papers in this line study related problems, including the two-sample version of the problem, independence testing, and goodness-of-fit testing for multivariate product distributions [ASZ18, ADR18, ADKR19, CKMUZ19]. A related paper studies the minimax sample complexity of property estimation, rather than testing of discrete distributions, including support size and entropy [AKSZ18]. Other recent works in this vein focus on testing of simple hypotheses [CKMTZ18, CKMSU19]. In particular [CKMSU19] proves an analogue of the Neyman-Pearson Lemma for differentially private testing of simple hypotheses. A paper of Awan and Slavkovic [AS18] gives a universally optimal test when the domain size is two, however Brenner and Nissim [BN14] shows that such universally optimal tests cannot exist when the domain has more than two elements. A related problem in this space is private change-point detection [CKMTZ18, CKMSU19, CKLZ19] — in this setting, we are given a time series of datapoints which are sampled from a distribution, which at some point, changes to a different distribution. The goal is to (privately) determine when this point occurs.
Complementary to minimax hypothesis testing, a line of work [WLK15, GLRV16, KR17, KSF17, CBRG18, SGGRGB19, CKSBG19] designs differentially private versions of popular test statistics for testing goodness-of-fit, closeness, and independence, as well as private ANOVA, focusing on the performance at small sample sizes. Work by Wang et al. [WKLK18] focuses on generating statistical approximating distributions for differentially private statistics, which they apply to hypothesis testing problems.
2.3. Differential Privacy on Graphs
There is a significant amount of work on differentially private analysis of graphs. We remark that these algorithms can satisfy either edge or node differential privacy. The former (easier) guarantee defines a neighboring graph to be one obtained by adding or removing a single edge, while in the latter (harder) setting, a neighboring graph is one that can be obtained by modifying the set of edges connected to a single node. The main challenge in this area is that most graph statistics can have high sensitivity in the worst-case.
The initial works in this area focused on the empirical setting, and goals range from counting subgraphs [KRSY11, BBDS13, KNRS13, CZ13, RS16] to outputting a privatized graph which approximates the original [GRU12, BBDS12, Upa13, AU19, EKKL20]. In contrast to the setting discussed in most of this series, it seems that there are larger qualitative differences between the study of empirical and population statistics due to the fact that many graph statistics have high worst-case sensitivity, but may have smaller sensitivity on typical graphs from many natural models.
In the population statistics setting, recent work has focused on parameter estimation of the underlying random graph model. So far this work has given estimators for the -model [KS16] and graphons [BCS15,BCSZ18]. Graphons are a generalization of the stochastic block model, which is, in turn, a generalization of the Erdös-Rényi model. Interestingly, the methods of Lipschitz-extensions introduced in the empirical setting by [BBDS13, KNRS13] are the main tool used in the statistical setting as well. While the first works on private graphon estimation were not computationally efficient, a recent focus has been on obviating these issues for certain important cases, such as the Erdös-Rényi setting [SU19].
[DKW56] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27(3), 1956.
[HSRDTMPSNC08] Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V. Pearson, Dietrich A. Stephan, Stanley F. Nelson, and David W. Craig. PLoS Genetics, 4(8), 2008.
[YFSU14] Fei Yu, Stephen E. Fienberg, Aleksandra B. Slavković, and Caroline Uhler. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50, 2014.