## Virtual STOC 2020 – Behind the Screens

In order to assist organizers of other virtual conferences, the general chairs of STOC 2020 (myself, Konstantin Makarychev, Yury Makarychev and Madhur Tulsiani, with input from PC chair Julia Chuzhoy) wrote a detailed document describing the design and execution of the conference. I personally felt the conference went about as well as it could have gone, and despite many moving parts, there were minimal technical difficulties.

The guide is available here: Virtual STOC 2020 – Behind the Screens.

If you have any questions or comments, feel free to comment below, or join in the conversation on Twitter.

## STOC 2020 Goes Virtual!

Starting on Monday, STOC is joining the trend of conferences going online, I believe the biggest theory conference to do so thus far. Given my experience with TCS+, I volunteered to lend a hand with the organization and logistics. It’s been a journey (with some unusual technical challenges), but I think we have something which I hope will be engaging and generally a lot of fun. In addition to the typical academic component, we also have a social component planned as well. We learnt from the work of others, including the ACM virtual conferences guide, ICLR 2020, and WAGON. I may make some version of our logistics docs available to others after the conference, so others can learn from our experience as well. Anyway, read on for an announcement from me and the other General Chairs, Konstantin Makarychev, Yury Makarychev, and Madhur Tulsiani. See also the main STOC page for a more complete list of credits.

Dear fellow theorists,

As you already know, STOC 2020 this year will be a virtual conference. If you are interested in attending the conference, but haven’t registered yet, please do so soon (students: $25, regular:$50). This will help us ensure we have capacity for various online events.

Upon registration, you should receive a confirmation email from CVENT, also containing access information for various conference events. Also, if you are a student looking to register for STOC but the cost is a burden, please email us at stoc2020@ttic.edu.

How will the conference work?

• Videos: The videos for all conference talks are now available on YouTube, and can be accessed through the links in the conference program. Registration is not required to view the talks on Youtube.
• Slack: The conference has a Slack workspace, with one channel for every paper and workshop, and additional channels for information, announcements, social events, help, etc. The invitations for the Slack workspace will be sent to registered participants. Authors are also encouraged to monitor the channels for their papers. All access information for the conference will also be available here. The workspace is currently active, and will remain active for at least one week after the conference.
• Zoom sessions: The conference will feature Zoom sessions with short presentations by the speakers. The total time for each paper is 10 minutes. Given that participants have access to the full talks by the speakers on Youtube, these can be thought of as being analogues of poster sessions. The workshops will also be held as separate sessions. The links for the Zoom session via information in the confirmation email.
• Social events: The conference will include junior/senior “lunches”, breakout tables for impromptu and scheduled hangouts, and a group event using gather.town. The timings for the events can be found in the conference program. Sign-up links for various events will be sent to all registered participants – please do sign-up soon!

See you all at (virtual) STOC 2020. Please do let us know if you have any questions or suggestions.

## A Primer on Private Statistics – Part II

The second part of our brief survey of differentially private statistics. This time, we show how to privately estimate the CDF of a distribution (i.e., estimate the distribution in Kolmogorov distance), and conclude with pointers to some other work in the space.

The first part of this series is here, and you can download both parts in PDF form here.

1. CDF Estimation for Discrete, Univariate Distributions

Suppose we have a distribution ${P}$ over the ordered, discrete domain ${\{1,\dots,D\}}$ and let ${\mathcal{P}}$ be the family of all such distributions. The CDF of the distribution is the function ${\Phi_{P} : \{1,\dots,D\} \rightarrow [0,1]}$ given by

$\displaystyle \Phi_{P}(j) = \mathop{\mathbb P}(P \leq j). \ \ \ \ \ (1)$

A natural measure of distance between CDFs is the ${\ell_\infty}$ distance, as this is the sort of convergence guarantee that the empirical CDF satisfies. That is, in the non-private setting, the empirical CDF will achieve the minimax rate, which it known by [DKW56, Mas90] to be

$\displaystyle \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| \Phi_{X} - \Phi_{P} \|_{\infty}) = O\left(\sqrt{\frac{1}{n}} \right). \ \ \ \ \ (2)$

1.1. Private CDF Estimation

Theorem 1 For every ${n \in {\mathbb N}}$ and every ${\epsilon,\delta > 0}$, there exists an ${(\epsilon,\delta)}$-differentially private mechanism ${M}$ such that

$\displaystyle \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| M(X_{1 \cdots n}) - \Phi_{P} \|_{\infty}) = O\left(\sqrt{\frac{1}{n}} + \frac{\log^{3/2}(D) \log^{1/2}(1/\delta)}{\epsilon n} \right). \ \ \ \ \ (3)$

Proof: Assume without loss of generality that ${D = 2^{d}}$ for an integer ${d \geq 1}$. Let ${X_{1 \cdots n} \sim P}$ be a sample. By the triangle inequality, we have

$\displaystyle \begin{array}{rll} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}{\| M(X_{1 \cdots n}) - \Phi_{P} \|_{\infty}} &\leq{} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| \Phi_{X} - \Phi_{P} \|_{\infty} + \| M(X_{1 \cdots n}) - \Phi_{X} \|_{\infty}) \\ &\leq{} O(\sqrt{1/n}) + \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| M(X_{1 \cdots n}) - \Phi_{X} \|_{\infty}), \end{array}$

so we will focus on constructing ${M}$ to approximate ${\Phi_{X}}$.

For any ${\ell = 0,\dots,d-1}$ and ${j = 1,\dots,2^{d - \ell}}$, consider the statistics

$\displaystyle f_{\ell,j}(X_{1 \cdots n}) = \frac{1}{n} \sum_{i=1}^{n} {\bf 1}\{ (j-1)2^{\ell} + 1 \leq X_i \leq j 2^{\ell} \}. \ \ \ \ \ (4)$

Let ${f : \{1,\dots,D\}^n \rightarrow [0,1]^{2D - 2}}$ be the function whose output consists of all ${2D-2}$ such counts. To decipher this notation, for a given ${\ell}$, the counts ${f_{\ell,\cdot}}$ form a histogram of ${X_{1 \cdots n}}$ using consecutive bins of width ${2^{\ell}}$, and we consider the ${\log(D)}$ histograms of geometrically increasing width ${1,2,4,\dots,D}$. First, we claim that the function ${f}$ has low sensitivity—for adjacent samples ${X}$ and ${X'}$,

$\displaystyle \| f(X) - f(X') \|_2^2 \leq \frac{2 \log(D)}{n^2}. \ \ \ \ \ (5)$

Thus, we can use the Gaussian mechanism:

$\displaystyle M'(X_{1 \cdots n}) = f(X_{1 \cdots n}) + \mathcal{N}\left(0, \frac{2 \log(D) \log(1/\delta)}{\epsilon^2 n^2} \cdot \mathbb{I}_{2D \times 2D}\right). \ \ \ \ \ (6)$

As we will argue, there exists a matrix ${A \in {\mathbb R}^{2D \times 2D}}$ such that ${\Phi_{X} = A \cdot f(X_{1 \cdots n})}$. We will let ${M(X_{1 \cdots n}) = A \cdot M'(X_{1 \cdots n})}$. Since differential privacy is closed under post-processing, ${M}$ inherits the privacy of ${M'}$.

We will now show how to construct the matrix ${A}$ and analyze the error of ${M}$. For any ${j = 1,\dots,D}$, we can form the interval ${\{1,\dots,j\}}$ as the union of at most ${\log D}$ disjoint intervals of the form we’ve computed, and therefore we can obtain ${\Phi_{X}(j)}$ as the sum of at most ${\log D}$ of the entries of ${f(X)}$. For example, if ${j = 5}$ then we can write

$\displaystyle \{1,\dots,7\} = \{1,\dots,4\} \cup \{5,6\} \cup \{7\} \ \ \ \ \ (7)$

and

$\displaystyle \Phi_{X}(5) = f_{2,1} + f_{1,3} + f_{0,7}. \ \ \ \ \ (8)$

See the following diagram for a visual representation of the decomposition.

This shows hierarchical decomposition of the domain ${\{1,\dots,8\}}$ using 14 intervals. The highlighted squares represent the interval ${\{1,\dots,7\}}$ and the highlighted circles show the decomposition of this interval into a union of ${3}$ intervals in the tree.

Thus we can construct the matrix ${A}$ using this information. Note that each entry of ${A f(X)}$ is the sum of at most ${\log(D)}$ entries of ${f(X)}$. Thus, if we use the output of ${M'(X_{1 \cdots n})}$ in place of ${f(X_{1 \cdots n})}$, for every ${j}$ we obtain

$\displaystyle \Phi_{X}(j) + \mathcal{N}(0, \sigma^2) \quad \textrm{for} \quad \sigma^2 = \frac{ 2 \log^2(D) \log(1/\delta)}{\epsilon^2 n^2}. \ \ \ \ \ (9)$

Applying standard bounds on the expected supremum of a Gaussian process, we have

$\displaystyle \mathop{\mathbb E}(\| M(X_{1 \cdots n}) - \Phi_{X} \|_{\infty}) = O( \sigma \sqrt{\log D}) = O\left(\frac{\log^{3/2}(D) \log^{1/2}(1/\delta)}{\epsilon n} \right). \ \ \ \ \ (10)$

$\Box$

1.2. Why Restrict the Domain?

A drawback of the estimator we constructed is that it only applies to distributions of finite support ${\{1,2,\dots,D\}}$, albeit with a relatively mild dependence on the support size. If privacy isn’t a concern, then no such restriction is necessary, as the bound (2) applies equally well to any distribution over ${{\mathbb R}}$. Can we construct a differentially private estimator for distributions with infinite support?

Perhaps surprisingly, the answer to this question is no! Any differentially private estimator for the CDF of the distribution has to have a rate that depends on the support size, and cannot give non-trivial rates for distributions with infinite support.

Theorem 2 ([BNSV15]) If ${\mathcal{P}}$ consists of all distributions on ${\{1,\dots,D\}}$, then

$\displaystyle \min_{M \in \mathcal{M}_{1, \frac{1}{n}}} \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| M(X_{1 \cdots n}) - \Phi_{P} \|_{\infty}) = \Omega\left(\frac{\log^* D}{n} \right). \ \ \ \ \ (11)$

The notation ${\log^* D}$ refers to the iterated logarithm.

We emphasize that this theorem shouldn’t meet with too much alarm, as ${\log^* D}$ grows remarkably slowly with ${D}$. There are differentially private CDF estimators that achieve very mild dependence on ${D}$ [BNS13, BNSV15], including one nearly matching the lower bound in Theorem 2. Moreover, if we want to estimate a distribution over ${{\mathbb R}}$, and are willing to make some mild regularity conditions on the distribution, then we can approximate it by a distribution with finite support and only increase the rate slightly. However, what Theorem 2 shows is that there is no “one-size-fits-all” solution to private CDF estimation that achieves similar guarantees to the empirical CDF. That is, the right algorithm has to be tailored somewhat to the application and the assumptions we can make about the distribution.

2. More Private Statistics

Of course, the story doesn’t end here! There’s a whole wide world of differentially private statistics beyond what we’ve mentioned already. We proceed to survey just a few other directions of study in private statistics.

2.1. Parameter and Distribution Estimation

A number of the early works in differential privacy give methods for differentially private statistical estimation for i.i.d. data. The earliest works [DN03, DN04, BDMN05, DMNS06], which introduced the Gaussian mechanism, among other foundational results, can be thought of as methods for estimating the mean of a distribution over the hypercube ${\{0,1\}^d}$ in the ${\ell_\infty}$ norm. Tight lower bounds for this problem follow from the tracing attacks introduced in [BUV14, DSSUV15, BSU17, SU17a, SU17b]. A very recent work of Acharya, Sun, and Zhang [ASZ20] adapts classical tools for proving estimation and testing lower bounds (lemmata of Assouad, Fano, and Le Cam) to the differentially private setting. Steinke and Ullman [SU17b] give tight minimax lower bounds for the weaker guarantee of selecting the largest coordinates of the mean, which were refined by Cai, Wang, and Zhang [CWZ19] to give lower bounds for sparse mean-estimation problems.

Nissim, Raskhodnikova, and Smith introduced the highly general sample-and-aggregate paradigm, which they apply to several learning problems (e.g., learning mixtures of Gaussians) [NRS07]. Later, Smith [Smi11] showed that this paradigm can be used to transform any estimator for any asymptotically normal, univariate statistic over a bounded data domain into a differentially private one with the same asymptotic convergence rate.

Subsequent work has focused on both relaxing the assumptions in [Smi11], particularly boundedness, and on giving finite-sample guarantees. Karwa and Vadhan investigated the problem of Gaussian mean estimation, proving the first near-optimal bounds for this setting [KV18]. In particular, exploiting concentration properties of Gaussian data allows us to achieve non-trivial results even with unbounded data, which is impossible in general. Following this, Kamath, Li, Singhal, and Ullman moved to the multivariate setting, investigating the estimation of Gaussians and binary product distributions in total variation distance [KLSU19]. In certain cases (i.e., Gaussians with identity covariance), this is equivalent to mean estimation in ${\ell_2}$-distance, though not always. For example, for binary product distribution, one must estimate the mean in a type of ${\chi^2}$-distance instead. The perspective of distribution estimation rather than parameter estimation can be valuable. Bun, Kamath, Steinke, and Wu [BKSW19] develop a primitive for private hypothesis selection, which they apply to learn any coverable class of distributions under pure differential privacy. Through the lens of distribution estimation, their work implies an upper bound for mean estimation of binary product distributions that bypasses lower bounds for the same problem in the empirical setting. In addition to work on mean estimation in the sub-Gaussian setting, such as the results discussed earlier, mean estimation has also been studied under weaker moment conditions [BS19, KSU20]. Beyond these settings, there has also been study of estimation of discrete multinomials, including estimation in Kolmogorov distance [BNSV15] and in total variation distance for structured distributions [DHS15], and parameter estimation of Markov Random Fields [ZKKW20].

A different approach to constructing differentially private estimators is based on robust statistics. This approah begins with the influential work of Dwork and Lei [DL09], which introduced the propose-test-release framework, and applied to estimating robust statistics such as the median and interquartile range. While the definitions in robust statistics and differential privacy are semantically similar, formal connections between the two remain relatively scant, which suggests a productive area for future study.

2.2. Hypothesis Testing

An influential work of Homer et al. [HSRDTMPSNC08] demonstrated the vulnerability of classical statistics in a genomic setting, showing that certain ${\chi^2}$-statistics on many different variables could allow an attacker to determine the presence of an individual in a genome-wide association study (GWAS). Motivated by these concerns, an early line of work from the statistics community focused on addressing these issues [VS09, USF13, YFSU14].

More recently, work on private hypothesis testing can be divided roughly into two lines. The first focuses on the minimax sample complexity, in a line initiated by Cai, Daskalakis, and Kamath [CDK17], who give an algorithm for privately testing goodness-of-fit (more precisely, a statistician might refer to this problem as one-sample testing of multinomial data). A number of subsequent works have essentially settled the complexity of this problem [ASZ18, ADR18], giving tight upper and lower bounds. Other papers in this line study related problems, including the two-sample version of the problem, independence testing, and goodness-of-fit testing for multivariate product distributions [ASZ18, ADR18, ADKR19, CKMUZ19]. A related paper studies the minimax sample complexity of property estimation, rather than testing of discrete distributions, including support size and entropy [AKSZ18]. Other recent works in this vein focus on testing of simple hypotheses [CKMTZ18, CKMSU19]. In particular [CKMSU19] proves an analogue of the Neyman-Pearson Lemma for differentially private testing of simple hypotheses. A paper of Awan and Slavkovic [AS18] gives a universally optimal test when the domain size is two, however Brenner and Nissim [BN14] shows that such universally optimal tests cannot exist when the domain has more than two elements. A related problem in this space is private change-point detection [CKMTZ18, CKMSU19, CKLZ19] — in this setting, we are given a time series of datapoints which are sampled from a distribution, which at some point, changes to a different distribution. The goal is to (privately) determine when this point occurs.

Complementary to minimax hypothesis testing, a line of work [WLK15, GLRV16, KR17, KSF17, CBRG18, SGGRGB19, CKSBG19] designs differentially private versions of popular test statistics for testing goodness-of-fit, closeness, and independence, as well as private ANOVA, focusing on the performance at small sample sizes. Work by Wang et al. [WKLK18] focuses on generating statistical approximating distributions for differentially private statistics, which they apply to hypothesis testing problems.

2.3. Differential Privacy on Graphs

There is a significant amount of work on differentially private analysis of graphs. We remark that these algorithms can satisfy either edge or node differential privacy. The former (easier) guarantee defines a neighboring graph to be one obtained by adding or removing a single edge, while in the latter (harder) setting, a neighboring graph is one that can be obtained by modifying the set of edges connected to a single node. The main challenge in this area is that most graph statistics can have high sensitivity in the worst-case.

The initial works in this area focused on the empirical setting, and goals range from counting subgraphs [KRSY11, BBDS13, KNRS13, CZ13, RS16] to outputting a privatized graph which approximates the original [GRU12, BBDS12, Upa13, AU19, EKKL20]. In contrast to the setting discussed in most of this series, it seems that there are larger qualitative differences between the study of empirical and population statistics due to the fact that many graph statistics have high worst-case sensitivity, but may have smaller sensitivity on typical graphs from many natural models.

In the population statistics setting, recent work has focused on parameter estimation of the underlying random graph model. So far this work has given estimators for the ${\beta}$-model [KS16] and graphons [BCS15,BCSZ18]. Graphons are a generalization of the stochastic block model, which is, in turn, a generalization of the Erdös-Rényi model. Interestingly, the methods of Lipschitz-extensions introduced in the empirical setting by [BBDS13, KNRS13] are the main tool used in the statistical setting as well. While the first works on private graphon estimation were not computationally efficient, a recent focus has been on obviating these issues for certain important cases, such as the Erdös-Rényi setting [SU19].

Bibliography

[ADKR19] Maryam Aliakbarpour, Ilias Diakonikolas, Daniel M. Kane, and Ronitt Rubinfeld. Private testing of distributions via sample permutations. NeurIPS ’19.

[ADR18] Maryam Aliakbarpour, Ilias Diakonikolas, and Ronitt Rubinfeld. Differentially private identity and closeness testing of discrete distributions. ICML ’18.

[AKSZ18] Jayadev Acharya, Gautam Kamath, Ziteng Sun, and Huanyu Zhang. Inspectre: Privately estimating the unseen. ICML ’18.

[AS18] Jordan Awan and Aleksandra Slavković. Differentially private uniformly most powerful tests for binomial data. NeurIPS ’18.

[ASZ18] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private testing of identity and closeness of discrete distributions. NeurIPS ’18.

[ASZ20] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private Assouad, Fano, and Le Cam. arXiv, 2004.06830, 2020.

[AU19] Raman Arora and Jalaj Upadhyay. On differentially private graph sparsification and applications. NeurIPS ’19.

[BBDS12] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. The Johnson-Lindenstrauss transform itself preserves differential privacy. FOCS ’12.

[BBDS13] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. Differentially private data analysis of social networks via restricted sensitivity. ITCS ’13.

[BCS15] Christian Borgs, Jennifer Chayes, and Adam Smith. Private graphon estimation for sparse graphs. NIPS ’15.

[BCSZ18] Christian Borgs, Jennifer Chayes, Adam Smith, and Ilias Zadik. Revealing network structure, confidentially: Improved rates for node-private graphon estimation. FOCS ’18.

[BDMN05] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: The SuLQ framework. PODS ’05.

[BKSW19] Mark Bun, Gautam Kamath, Thomas Steinke, and Zhiwei Steven Wu. Private hypothesis selection. NeurIPS ’19.

[BN14] Hai Brenner and Kobbi Nissim. Impossibility of differentially private universally optimal mechanisms. SIAM Journal on Computing, 43(5), 2014.

[BNS13] Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pure vs. approximate differential privacy. APPROX-RANDOM ’13.

[BNSV15] Mark Bun, Kobbi Nissim, Uri Stemmer, and Salil Vadhan. Differentially private release and learning of threshold functions. FOCS ’15.

[BS19] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. NeurIPS ’19.

[BSU17] Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. SODA ’17.

[BUV14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. STOC ’14.

[CBRG18] Zachary Campbell, Andrew Bray, Anna Ritz, and Adam Groce. Differentially private ANOVA testing. ICDIS ’18.

[CDK17] Bryan Cai, Constantinos Daskalakis, and Gautam Kamath. Priv’it: Private and sample efficient identity testing. ICML ’17.

[CKLZ19] Rachel Cummings, Sara Krehbiel, Yuliia Lut, and Wanrong Zhang. Privately detecting changes in unknown distributions. arXiv, 1910.01327, 2019.

[CKMSU19] Clément L. Canonne, Gautam Kamath, Audra McMillan, Adam Smith, and Jonathan Ullman. The structure of optimal private tests for simple hypotheses. STOC ’19.

[CKMTZ18] Rachel Cummings, Sara Krehbiel, Yajun Mei, Rui Tuo, and Wanrong Zhang. Differentially private change-point detection. NeurIPS ’18.

[CKMUZ19] Clément L. Canonne, Gautam Kamath, Audra McMillan, Jonathan Ullman, and Lydia Zakynthinou. Private identity testing for high-dimensional distributions. arXiv, 1905.11947, 2019.

[CKSBG19] Simon Couch, Zeki Kazan, Kaiyan Shi, Andrew Bray, and Adam Groce. Differentially private nonparametric hypothesis testing. CCS ’19.

[CWZ19] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv, 1902.04495, 2019.

[CZ13] Shixi Chen and Shuigeng Zhou. Recursive mechanism: Towards node differential privacy and unrestricted joins. SIGMOD ’13.

[DHS15] Ilias Diakonikolas, Moritz Hardt, and Ludwig Schmidt. Differentially private learning of structured discrete distributions. NIPS ’15.

[DKW56] Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. The Annals of Mathematical Statistics, 27(3), 1956.

[DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. STOC ’09.

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. TCC ’06.

[DN03] Irit Dinur and Kobbi Nissim. Revealing information while preserving privacy. PODS ’03.

[DN04] Cynthia Dwork and Kobbi Nissim. Privacy-preserving datamining on vertically partitioned databases. CRYPTO ’04.

[DSSUV15] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. FOCS ’15.

[EKKL20] Marek Eliáš, Michael Kapralov, Janardhan Kulkarni, and Yin Tat Lee. Differentially private release of synthetic graphs. SODA ’20.

[GLRV16] Marco Gaboardi, Hyun-Woo Lim, Ryan M. Rogers, and Salil P. Vadhan. Differentially private chi-squared hypothesis testing: Goodness of fit and independence testing. ICML ’16.

[GRU12] Anupam Gupta, Aaron Roth, and Jonathan Ullman. Iterative constructions and private data release. TCC ’12.

[HSRDTMPSNC08] Nils Homer, Szabolcs Szelinger, Margot Redman, David Duggan, Waibhav Tembe, Jill Muehling, John V. Pearson, Dietrich A. Stephan, Stanley F. Nelson, and David W. Craig. PLoS Genetics, 4(8), 2008.

[KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. COLT ’19.

[KNRS13] Shiva Prasad Kasiviswanathan, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Analyzing graphs with node differential privacy. TCC ’13.

[KR17] Daniel Kifer and Ryan M. Rogers. A new class of private chi-square tests. AISTATS ’17.

[KRSY11] Vishesh Karwa, Sofya Raskhodnikova, Adam Smith, and Grigory Yaroslavtsev. Private analysis of graph structure. VLDB ’11.

[KS16] Vishesh Karwa and Aleksandra Slavković. Inference using noisy degrees: Differentially private β-model and synthetic graphs. The Annals of Statistics, 44(1), 2016.

[KSF17] Kazuya Kakizaki, Jun Sakuma, and Kazuto Fukuchi. Differentially private chi-squared test by unit circle mechanism. ICML ’17.

[KSU20] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. arXiv, 2002.09464, 2020.

[KV18] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. ITCS ’18.

[Mas90] Pascal Massart. The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality. The Annals of Probability, 18(3), 1990.

[NRS07] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. STOC ’07.

[RS16] Sofya Raskhodnikova and Adam D. Smith. Lipschitz extensions for node-private graph statistics and the generalized exponential mechanism. FOCS ’16.

[Smi11] Adam Smith. Privacy-preserving statistical estimation with optimal convergence rates. STOC ’11.

[SGGRGB19] Marika Swanberg, Ira Globus-Harris, Iris Griffith, Anna Ritz, Adam Groce, and Andrew Bray. Improved differentially private analysis of variance. PETS ’19.

[SU17a] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. Journal of Privacy and Confidentiality, 7(2), 2017.

[SU17b] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. FOCS ’17.

[SU19] Adam Sealfon and Jonathan Ullman. Efficiently estimating Erdos-Renyi graphs with node differential privacy. NeurIPS ’19.

[Upa13] Jalaj Upadhyay. Random projections, graph sparsification, and differential privacy. ASIACRYPT ’13.

[USF13] Caroline Uhler, Aleksandra Slavković, and Stephen E. Fienberg. Privacy-preserving data sharing for genome-wide association studies. The Journal of Privacy and Confidentiality, 5(1), 2013.

[VS09] Duy Vu and Aleksandra Slavković. Differential privacy for clinical trial data: Preliminary evaluations. ICDMW ’09.

[WKLK18] Yue Wang, Daniel Kifer, Jaewoo Lee, and Vishesh Karwa. Statistical approximating distributions under differential privacy. The Journal of Privacy and Confidentiality, 8(1), 2018.

[WLK15] Yue Wang, Jaewoo Lee, and Daniel Kifer. Revisiting differentially private hypothesis tests for categorical data. arXiv, 1511.03376, 2015.

[YFSU14] Fei Yu, Stephen E. Fienberg, Aleksandra B. Slavković, and Caroline Uhler. Scalable privacy-preserving data sharing methodology for genome-wide association studies. Journal of Biomedical Informatics, 50, 2014.

[ZKKW20] Huanyu Zhang, Gautam Kamath, Janardhan Kulkarni, and Zhiwei Steven Wu. Privately learning Markov random fields. arXiv, 2002.09463, 2020.

## A Primer on Private Statistics – Part I

Differentially private statistics is a very lively research area, and has seen a lot of activity in the last couple years. While the phrasing is a slight departure from previous work which focused on estimation with worst-case datasets, it turns out that the differences are often superficial. In a short series of blog posts, we hope to educate readers on some of the recent advancements in this area, as well as shed light on some of the connections between the old and the new. We’ll describe the settings, cover a couple of technical examples, and give pointers to some other directions in the area. Thanks to Adam Smith for helping kick off this project, Clément Canonne, Aaron Roth, and Thomas Steinke for helpful comments, and Luca Trevisan for his LaTeX2WP script.

1. Introduction

Statistics and machine learning are now ubiquitous in data analysis. Given a dataset, one immediately wonders what it allows us to infer about the underlying population. However, modern datasets don’t exist in a vacuum: they often contain sensitive information about the individuals they represent. Without proper care, statistical procedures will result in gross violations of privacy. Motivated by the shortcomings of ad hoc methods for data anonymization, Dwork, McSherry, Nissim, and Smith introduced the celebrated notion of differential privacy [DMNS06].

From its inception, some of the driving motivations for differential privacy were applications in statistics and the social sciences, notably disclosure limitation for the US Census. And yet, the lion’s share of differential privacy research has taken place within the computer science community. As a result, the specific applications being studied are often not formulated using statistical terminology, or even as statistical problems. Perhaps most significantly, much of the early work in computer science (though definitely not all) focus on estimating some property of a dataset rather than estimating some property of an underlying population.

Although the earliest works exploring the interaction between differential privacy and classical statistics go back to at least 2009 [VS09,FRY10], the emphasis on differentially private statistical inference in the computer science literature is somewhat more recent. However, while earlier results on differential privacy did not always formulate problems in a statistical language, statistical inference was a key motivation for most of this work. As a result many of the techniques that were developed have direct applications in statistics, for example establishing minimax rates for estimation problems.

The purpose of this series of blog posts is to highlight some of those results in the computer science literature, and present them in a more statistical language. Specifically, we will discuss:

• Tight minimax lower bounds for privately estimating the mean of a multivariate distribution over ${{\mathbb R}^d}$, using the technique of tracing attacks developed in [BUV14,DSSUV15, BSU17, SU17a, SU17b, KLSU19].

• Upper bounds for estimating a distribution in Kolmogorov distance, using the ubiquitous binary-tree mechanism introduced in [DNPR10,CSS11].

In particular, we hope to encourage computer scientists working on differential privacy to pay more attention to the applications of their methods in statistics, and share with statisticians many of the powerful techniques that have been developed in the computer science literature.

1.1. Formulating Private Statistical Inference

Essentially every differentially private statistical estimation task can be phrased using the following setup. We are given a dataset ${X = (X_1, \dots, X_n)}$ of size ${n}$, and we wish to design an algorithm ${M \in \mathcal{M}}$ where ${\mathcal{M}}$ is the class of mechanisms that are both:

1. differentially private, and
2. accurate, either in expectation or with high probability, according to some task-specific measure.

A few comments about this framework are in order. First, although the accuracy requirement is stochastic in nature (i.e., an algorithm might not be accurate depending on the randomness of the algorithm and the data generation process), the privacy requirement is worst-case in nature. That is, the algorithm must protect privacy for every dataset ${X}$, even those we believe are very unlikely.

Second, the accuracy requirement is stated rather vaguely. This is because the notion of accuracy of an algorithm is slightly more nuanced, depending on whether we are concerned with empirical or population statistics. A particular emphasis of these blog posts is to explore the difference (or, as we will see, the lack of a difference) between these two notions of accuracy. The former estimates a quantity of the observed dataset, while the latter estimates a quantity of an unobserved distribution which is assumed to have generated the dataset.

More precisely, the former can be phrased in terms of empirical loss, of the form:

$\displaystyle \min_{M \in \mathcal{M}}~\max_{X \in \mathcal{X}}~\mathop{\mathbb E}_M(\ell(M(X), f(X))),$

where ${\mathcal{M}}$ is some class of randomized estimators (e.g., differentially private estimators), ${\mathcal{X}}$ is some class of datasets, ${f}$ is some quantity of interest, and ${\ell}$ is some loss function. That is, we’re looking to find an estimator that has small expected loss on any dataset in some class.

In contrast, statistical minimax theory looks at statements about population loss, of the form:

$\displaystyle \min_{M \in \mathcal{M}}~\max_{P \in \mathcal{P}}~\mathop{\mathbb E}_{X \sim P, M}(\ell(M(X),f(P))),$

where ${\mathcal{P}}$ is some family of distributions over datasets (typically consisting of i.i.d. samples). That is, we’re looking to find an estimator that has small expected loss on random data from any distribution in some class. In particular, note that the randomness in this objective additionally includes the data generating procedure ${X \sim P}$.

These two formulations are formally very different in several ways. First, the empirical formulation requires an estimator to have small loss on worst-case datasets, whereas the statistical formulation only requires the estimator to have small loss on average over datasets drawn from certain distributions. Second, the statistical formulation requires that we estimate the unknown quantity ${f(P)}$, and thus necessitates a solution to the non-private estimation problem. On the other hand, the empirical formulation only asks us to estimate the known quantity ${f(X)}$, and thus if there were no privacy constraint it would always be possible to compute ${f(X)}$ exactly. Third, typically in the statistical formulation, we require that the dataset is drawn i.i.d., which means that we are more constrained when proving lower bounds for estimation than we are in the empirical problem.

However, in practice (more precisely, in the practice of doing theoretical research), these two formulations are more alike than they are different, and results about one formulation often imply results about the other formulation. On the algorithmic side, classical statistical results will often tell us that ${\ell(f(X),f(P))}$ is small, in which case algorithms that guarantee ${\ell(M(X),f(X))}$ is small also guarantee ${\ell(M(X),f(P))}$ is small.

Moreover, typical lower bound arguments for empirical quantities are often statistical in nature. These typically involving constructing some simple “hard distribution” over datasets such that no private algorithm can estimate well on average for this distribution, and thus these lower bound arguments also apply to estimating population statistics for some simple family of distributions. We will proceed to give some examples of estimation problems that were originally studied by computer scientists with the empirical formulation in mind. These results either implicitly or explicitly provide solutions to the corresponding population versions of the same problems—our goal is to spell out and illustrate these connections.

2. DP Background

Let ${X = (X_1,X_2,\dots,X_n) \in \mathcal{X}^n}$ be a collection of ${n}$ samples where each individual sample comes from the domain ${\mathcal{X}}$. We say that two samples ${X,X' \in \mathcal{X}^*}$ are adjacent, denoted ${X \sim X'}$, if they differ on at most one individual sample. Intuitively, a randomized algorithm ${M}$, which is often called a mechanism for historical reasons, is differentially private if the distribution of ${M(X)}$ and ${M(X')}$ are similar for every pair of adjacent samples ${X,X'}$.

Definition 1 ([DMNS06]) A mechanism ${M \colon \mathcal{X}^n \rightarrow \mathcal{R}}$ is ${(\epsilon,\delta)}$-differentially private if for every pair of adjacent datasets ${X \sim X'}$, and every (measurable) ${R \subseteq R}$

$\displaystyle \mathop{\mathbb P}(M(X) \in R) \leq e^{\epsilon} \cdot \mathop{\mathbb P}(M(X') \in R) + \delta.$

We let ${\mathcal{M}_{\epsilon,\delta}}$ denote the set of mechanisms that satisfy ${(\epsilon,\delta)}$-differential privacy.

Remark 1 To simplify notation, and to maintain consistency with the literature, we adopt the convention of defining the mechanism only for a fixed sample size ${n}$. What this means in practice is that the mechanisms we describe treat the sample size ${n}$ is public information that need not be kept private. While one could define a more general model where ${n}$ is not fixed, it wouldn’t add anything to this discussion other than additional complexity.

Remark 2 In these blog posts, we stick to the most general formulation of differential privacy, so-called approximate differential privacy, i.e. ${(\epsilon,\delta)}$-differential privacy for ${\delta > 0}$ essentially because this is the notion that captures the widest variety of private mechanisms. Almost all of what follows would apply equally well, with minor technical modifications, to slightly stricter notions of concentrated differential privacy [DR16, BS16, BDRS18], Rényi differential privacy [Mir17], or Gaussian differential privacy [DRS19]. While so-called pure differential privacy, i.e. ${(\epsilon,0)}$-differential privacy has also been studied extensively, this notion is artificially restrictive and excludes many differentially private mechanisms.

A key property of differential privacy that helps when desinging efficient estimators is closure under postprocessing:

Lemma 2 (Post-Processing [DMNS06])  If ${M \colon \mathcal{X}^n \rightarrow \mathcal{R}}$ is ${(\epsilon,\delta)}$-differentially private and ${M' \colon \mathcal{R} \rightarrow \mathcal{R}'}$ is any randomized algorithm, then ${M' \circ M}$ is ${(\epsilon,\delta)}$-differentially private.

The estimators we present in this work will use only one tool for achieving differential privacy, the Gaussian Mechanism.

Lemma 3 (Gaussian Mechanism) Let ${f \colon \mathcal{X}^n \rightarrow {\mathbb R}^d}$ be a function and let

$\displaystyle \Delta_{f} = \sup_{X\sim X'} \| f(X) - f(X') \|_2$

denote its ${\ell_2}$-sensitivity. The Gaussian mechanism

$\displaystyle M(X) = f(X) + \mathcal{N}\left(0 , \frac{2 \log(2/\delta)}{\epsilon^2} \cdot \Delta_{f}^2 \cdot {\mathbb I}_{d \times d} \right)$

satisfies ${(\epsilon,\delta)}$-differential privacy.

3. Mean Estimation in ${{\mathbb R}^d}$

Let’s take a dive into the problem of private mean estimation for some family ${\mathcal{P}}$ of multivariate distributions over ${{\mathbb R}^d}$. This problem has been studied for various families ${\mathcal{P}}$ and various choices of loss function. Here we focus on perhaps the simplest variant of the problem, in which ${\mathcal{P}}$ contains distributions of bounded support ${[\pm 1]^d}$ and the loss is the ${\ell_2^2}$ error. We emphasize, however, that the methods we discuss here are quite versatile and can be used to derive minimax bounds for other variants of the mean-estimation problem.

Note that, by a simple argument, the non-private minimax rate for this class is achieved by the empirical mean, and is

$\displaystyle \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| \overline{X} - \mu\|_2^2) = \frac{d}{n}. \ \ \ \ \ (1)$

The main goal of this section is to derive the minimax bound

$\displaystyle \min_{M \in \mathcal{M}_{\epsilon,\frac{1}{n}}} \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| M(X_{1 \cdots n}) - \mu \|_2^2) = \frac{d}{n} + \tilde\Theta\left(\frac{d^2}{\epsilon^2 n^2}\right). \ \ \ \ \ (2)$

Recall that ${\tilde \Theta(f(n))}$ refers to a function which is both ${O(f(n) \log^{c_1} f(n))}$ and ${\Omega(f(n) \log^{c_2} f(n))}$ for some constants ${c_1, c_2}$. The proof of this lower bound is based on robust tracing attacks, also called membership inference attacks, which were developed in a chain of papers [BUV14, DSSUV15, BSU17, SU17a, SU17b, KLSU19]. We remark that this lower bound is almost identical to the minimax bound for mean estimation proven in the much more recent work of Cai, Wang, and Zhang [CWZ19], but it lacks tight dependence on the parameter ${\delta}$, which we discuss in the following remark.

Remark 3 The choice of ${\delta = 1/n}$ in (2) may look strange at first. For the upper bound this choice is arbitrary—as we will see, we can upper bound the rate for any ${\delta > 0}$ at a cost of a factor of ${O(\log(1/\delta))}$. The lower bound applies only when ${\delta \leq 1/n}$. Note that the rate is qualitatively different when ${\delta \gg 1/n}$. However, we emphasize that ${(\epsilon,\delta)}$-differential privacy is not a meaningful privacy notion unless ${\delta \ll 1/n}$. In particular, the mechanism that randomly outputs ${\delta n}$ elements of the sample satisfies ${(0,\delta)}$-differential privacy. However, when ${\delta \gg 1/n}$, this mechanism completely violates the privacy of ${\gg 1}$ person in the dataset. Moreover, taking the empirical mean of these ${\delta n}$ samples gives rate ${d/\delta n}$, which would violate our lower bound when ${\delta}$ is large enough. On the other hand, we would expect the minimax rate to become slower when ${\delta \ll 1/n}$. This expectation is, in fact, correct, however the proof we present does not give the tight dependence on the parameter ${\delta}$. See [SU17a] for a refinement that can obtain the right dependence on ${\delta}$, and [CWZ19] for the details of how to apply this refinement in the i.i.d. setting.

3.1. A Simple Upper Bound

Theorem 4 For every ${n \in {\mathbb N}}$, and every ${\epsilon,\delta > 0}$, there exists an ${(\epsilon,\delta)}$-differentially private private mechanism ${M}$ such that

$\displaystyle \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P}(\| M(X_{1 \cdots n}) - \mu \|_2^2) \leq \frac{d}{n} + \frac{2 d^2 \log(2/\delta)}{\epsilon^2 n^2}. \ \ \ \ \ (3)$

Proof: Define the mechanism

$\displaystyle M(X_{1 \cdots n}) = \overline{X} + \mathcal{N}\left(0, \frac{2 d \log(2/\delta)}{\varepsilon^2 n^2} \cdot \mathbb{I}_{d \times d} \right). \ \ \ \ \ (4)$

This mechanism satisfies ${(\epsilon,\delta)}$-differential privacy by Lemma 3, noting that for any pair of adjacent samples ${X_{1 \cdots n}}$ and ${X'_{1 \cdots n}}$, ${\| \overline{X} - \overline{X}'\|_2^2 \leq \frac{d}{n^2}}$.

Let ${\sigma^2 = \frac{2 d \log(2/\delta)}{\varepsilon^2 n^2}}$. Note that since the Gaussian noise has mean ${0}$ and is independent of ${\overline{X} - \mu}$, we have

$\displaystyle \begin{array}{rll} \mathop{\mathbb E}(\| M(X_{1 \cdots n}) - \mu \|_2^2) ={} &\mathop{\mathbb E}(\| \overline{X} - \mu \|_2^2) + \mathop{\mathbb E}(\| M(X_{1 \cdots n}) - \overline{X} \|_2^2 ) \\ \leq{} &\frac{d}{n} + \mathop{\mathbb E}(\| M(X_{1 \cdots n}) - \overline{X} \|_2^2 ) \\ ={} &\frac{d}{n} + \mathop{\mathbb E}(\| \mathcal{N}(0, \sigma^2 \mathbb{I}_{d \times d}) \|_2^2 ) \\ ={} &\frac{d}{n} + \sigma^2 d \\ ={} &\frac{d}{n} + \frac{2 d^2 \log(2/\delta)}{\epsilon^2 n^2}. \end{array}$

$\Box$

3.2. Minimax Lower Bounds via Tracing

Theorem 5 For every ${n, d \in {\mathbb N}}$, ${\epsilon > 0}$, and ${\delta < 1/96n}$, if ${\mathcal{P}}$ is the class of all product distributions on ${\{\pm 1\}^{d}}$, then for some constant ${C > 0}$,

$\displaystyle \min_{M \in \mathcal{M}_{\epsilon,\delta}} \max_{P \in \mathcal{P}} \mathop{\mathbb E}_{X_{1 \cdots n} \sim P,M}(\| M(X_{1 \cdots n}) - \mu \|_2^2) = \Omega\left(\min \left\{ \frac{d^2}{ \epsilon^2 n^2}, d \right\}\right).$

Note that it is trivial to achieve error ${d}$ for any distribution using the mechanism ${M(X_{1 \cdots n}) \equiv 0}$, so the result says that the error must be ${\Omega(d^2/\epsilon^2 n^2)}$ whenever this error is significantly smaller than the trivial error of ${d}$.

Tracing Attacks.

Before giving the formal proof, we will try to give some intuition for the high-level proof strategy. The proof can be viewed as constructing a tracing attack [DSSU17] (sometimes called a membership inference attack) of the following form. There is an attacker who has the data of some individual ${Y}$ chosen in one of the two ways: either ${Y}$ is a random element of the sample ${X}$, or ${Y}$ is an independent random sample from the population ${P}$. The attacker is given access to the true distribution ${P}$ and the outcome of the mechanism ${M(X)}$, and wants to determine which of the two is the case. If the attacker can succeed, then ${M}$ cannot be differentially private. To understand why this is the case, if ${Y}$ is a member of the dataset, then the attacker should say ${Y}$ is in the dataset, but if we consider the adjacent dataset ${X'}$ where we replace ${Y}$ with some independent sample from ${P}$, then the attacker will now say ${Y}$ is independent of the dataset. Thus, ${M(X)}$ and ${M(X')}$ cannot be close in the sense required by differential privacy.

Thus, the proof works by constructing a test statistic ${Z = Z(M(X),Y,P),}$ that the attacker can use to distinguish the two possibilities for ${Y}$. In particular, we show that there is a distribution over populations ${P}$ such that ${\mathop{\mathbb E}(Z)}$ is small when ${Y}$ is independent of ${X}$, but for every sufficiently accurate mechanism ${M}$, ${\mathop{\mathbb E}(Z)}$ is large when ${Y}$ is a random element of ${X}$.

Proof of Theorem 5.

The proof that we present closely follows the one that appears in Thomas Steinke’s Ph.D. thesis [Ste16].

We start by constructing a “hard distribution” over the family of product distributions ${\mathcal{P}}$. Let ${\mu = (\mu^1,\dots,\mu^d) \in [-1,1]^d}$ consist of ${d}$ independent draws from the uniform distribution on ${[-1,1]}$ and let ${P_{\mu}}$ be the product distribution over ${\{\pm 1\}^{d}}$ with mean ${\mu}$. Let ${X_1,\dots,X_n \sim P_{\mu}}$ and ${X = (X_1,\dots,X_n)}$.

Let ${M \colon \{\pm 1\}^{n \times d} \rightarrow [\pm 1]^d}$ be any ${(\epsilon,\delta)}$-differentially private mechanism and let

$\displaystyle \alpha^2 = \mathop{\mathbb E}_{\mu,X,M}(\| M(X) - \mu\|_2^2 ) \ \ \ \ \ (5)$

be its expected loss. We will prove the desired lower bound on ${\alpha^2}$.

For every element ${i}$, we define the random variables

$\displaystyle Z_i = Z_i(M(X),X_i,\mu) = \left\langle M(X) - \mu, X_i - \mu \right\rangle \\ Z'_{i} = Z'_i(M(X_{\sim i}), X_i, \mu) = \left\langle M(X_{\sim i}) - \mu, X_i - \mu \right\rangle,$

where ${X_{\sim i}}$ denotes ${(X_1,\dots,X'_i,\dots,X_n)}$ where ${X'_i}$ is an independent sample from ${P_\mu}$. Our goal will be to show that, privacy and accuracy imply both upper and lower bounds on ${\mathop{\mathbb E}(\sum_i Z_i)}$ that depend on ${\alpha}$, and thereby obtain a bound on ${\alpha^2}$.

The first claim says that, when ${X_i}$ is not in the sample, then the likelihood random variable has mean ${0}$ and variance controlled by the expected ${\ell_2^2}$ error of the mechanism.

Claim 1 For every ${i}$, ${\mathop{\mathbb E}(Z'_i) = 0}$, ${\mathrm{Var}(Z'_i) \leq 4\alpha^2}$, and ${\|Z'_i\|_\infty \leq 4d}$.

Proof: Conditioned on any value of ${\mu}$, ${M(X_{\sim i})}$ is independent from ${X_i}$. Moreover, ${\mathop{\mathbb E}(X_i - \mu) = 0}$, so we have

$\displaystyle \begin{array}{rll} &\mathop{\mathbb E}_{\mu,X,M}(\langle M(X_{\sim i}) - \mu, X_i - \mu \rangle) \\ = &\mathop{\mathbb E}_{\mu}(\mathop{\mathbb E}_{X,M}(\langle M(X_{\sim i}) - \mu, X_i - \mu \rangle)) \\ = &\mathop{\mathbb E}_{\mu}(\left\langle \mathop{\mathbb E}_{X,M}(M(X_{\sim i}) - \mu), \mathop{\mathbb E}_{X,M}(X_i - \mu) \right \rangle ) \\ = &\mathop{\mathbb E}_{\mu}(\left\langle \mathop{\mathbb E}_{X,M}(M(X_{\sim i}) - \mu), 0 \right \rangle ) \\ = &0. \end{array}$

For the second part of the claim, since ${(X_i - \mu)^2 \leq 4}$, we have ${\mathrm{Var}(Z'_i) \leq 4 \cdot \mathop{\mathbb E}(\| M(X) - \mu \|_2^2) = 4\alpha^2}$. The final part of the claim follows from the fact that every entry of ${M(X_{\sim i}) - \mu}$ and ${X_i - \mu}$ is bounded by ${2}$ in absolute value, and ${Z'_i}$ is a sum of ${d}$ such entries, so its absolute value is always at most ${4d}$. $\Box$

The next claim says that, because ${M}$ is differentially private, ${Z_i}$ has similar expectation to ${Z'_i}$, and thus its expectation is also small.

Claim 2 ${\mathop{\mathbb E}(\sum_{i=1}^{n} Z_i) \leq 4n\alpha \epsilon + 8n \delta d.}$

Proof: The proof is a direct calculation using the following inequality, whose proof is relatively simple using the definition of differential privacy:

$\displaystyle \mathop{\mathbb E}(Z_i) \leq \mathop{\mathbb E}(Z'_i) + 2\epsilon \sqrt{\mathrm{Var}(Z'_i)} + 2\delta \| Z'_i \|_\infty.$

Given the inequality and Claim 1, we have

$\displaystyle \mathop{\mathbb E}(Z_i) \leq 0 + (2\epsilon)(2\alpha) + (2\delta)(2d) = 4\epsilon \alpha + 8 \delta d .$

The claim now follows by summing over all ${i}$. $\Box$

The final claim says that, because ${M}$ is accurate, the expected sum of the random variables ${Z_i}$ is large.

Claim 3 ${\mathop{\mathbb E}(\sum_{i=1}^{n} Z_i) \geq \frac{d}{3} - \alpha^2.}$

The proof relies on the following key lemma, whose proof we omit.

Lemma 6 (Fingerprinting Lemma [BSU17])  If ${\mu \in [\pm 1]}$ is sampled uniformly, ${X_1,\dots,X_n \in \{\pm 1\}^{n}}$ are sampled independently with mean ${\mu}$, and ${f \colon \{\pm 1\}^n \rightarrow [\pm 1]}$ is any function, then

$\displaystyle \mathop{\mathbb E}_{\mu,X}((f(X) - \mu) \cdot \sum_{i=1}^{n} (X_i - \mu)) \geq \frac{1}{3} - \mathop{\mathbb E}_{\mu,X}((f(X) - \mu)^2).$

The lemma is somewhat technical, but for intuition, consider the case where ${f(X) = \frac{1}{n}\sum_{i} X_i}$ is the empirical mean. In this case we have

$\displaystyle \begin{array}{rcl} \mathop{\mathbb E}_{\mu,X}((f(X) - \mu) \cdot \sum_{i=1}^n (X_i - \mu)) ={} \mathop{\mathbb E}_{\mu}(\frac{1}{n} \sum_i \mathop{\mathbb E}_{X}( (X_i - \mu)^2) ) ={} \mathop{\mathbb E}_{\mu}(\mathrm{Var}(X_i)) = \frac{1}{3}. \end{array}$

The lemma says that, when ${\mu}$ is sampled this way, then any modification of ${f}$ that reduces the correlation between ${f(X)}$ and ${\sum_i X_i}$ will increase the mean-squared-error of ${f}$ proportionally.

We now prove Claim 3.

Proof: We can apply the lemma to each coordinate of the estimate ${M(X)}$.

$\displaystyle \begin{array}{rll} \mathop{\mathbb E}(\sum_{i=1}^{n} Z_i) ={} &\mathop{\mathbb E}(\sum_{i=1}^{n} \left\langle M(X) - \mu, X_i - \mu \right\rangle) \\ ={} &\sum_{j=1}^{d} \mathop{\mathbb E}((M^j(X) - \mu^j)\cdot \sum_{i=1}^{n} (X_i^j - \mu^j)) \\ \geq{} &\sum_{j=1}^{d} \left( \frac{1}{3} - \mathop{\mathbb E}((M^j(X) - \mu^j)^2) \right) \\ ={} &\frac{d}{3} - \mathop{\mathbb E}(\| M(X) - \mu \|_2^2) ={} \frac{d}{3} - \alpha^2. \end{array}$

The inequality is Lemma 6. $\Box$

Combining Claims 2 and 3 gives

$\displaystyle \frac{d}{3} - \alpha^2 \leq 4n\alpha \epsilon + 8n \delta d. \ \ \ \ \ (6)$

Now, if ${\alpha^2 \geq \frac{d}{6}}$ then we’re done, so we’ll assume that ${\alpha^2 \leq \frac{d}{6}}$. Further, by our assumption on the value of ${\delta}$, ${8n \delta d \leq \frac{d}{12}}$. In this case we can rearrange terms and square both sides to obtain

$\displaystyle \alpha^2 \geq{} \frac{1}{16 \epsilon^2 n^2} \left(\frac{d}{3} - \alpha^2 - 8 n\delta d\right)^2 \geq \frac{1}{16 \epsilon^2 n^2} \left(\frac{d}{12}\right)^2 = \frac{d^2}{2304 \epsilon^2 n^2}. \ \ \ \ \ (7)$

Combining the two cases for ${\alpha^2}$ gives ${\alpha^2 \geq \min\{ \frac{d}{6}, \frac{d^2}{2304 \epsilon^2 n^2} \}}$, as desired.

Bibliography

[BDRS18] Mark Bun, Cynthia Dwork, Guy N. Rothblum, and Thomas Steinke. Composable and versatile privacy via truncated CDP. STOC ’18.

[BS16] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. TCC ’16-B.

[BSU17] Mark Bun, Thomas Steinke, and Jonathan Ullman. Make up your mind: The price of online queries in differential privacy. SODA ’17.

[BUV14] Mark Bun, Jonathan Ullman, and Salil Vadhan. Fingerprinting codes and the price of approximate differential privacy. STOC ’14.

[CSS11] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security, 14(3):26, 2011.

[CWZ19] T. Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv, 1902.04495, 2019.

[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. TCC ’06.

[DNPR10] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N. Rothblum. Differential privacy under continual observation. STOC ’10.

[DR16] Cynthia Dwork and Guy N. Rothblum. Concentrated differential privacy. arXiv, 1603.01887, 2016.

[DRS19] Jinshuo Dong, Aaron Roth, and Weijie J. Su. Gaussian differential privacy. arXiv, 1905.02383, 2019.

[DSSU17] Cynthia Dwork, Adam Smith, Thomas Steinke, Jonathan Ullman, and Salil Vadhan. Robust traceability from trace amounts. FOCS ’15.

[DSSUV15] Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. Exposed! a survey of attacks on private data. Annual Review of Statistics and Its Application, 4:61–84, 2017.

[FRY10] Stephen E. Fienberg, Alessandro Rinaldo, and Xiaolin Yang. Differential privacy and the risk-utility tradeoff for multi-dimensional contingency tables. PSD ’10.

[KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. COLT ’19.

[Mir17] Ilya Mironov. Rényi differential privacy. CSF ’17.

[Ste16] Thomas Alexander Steinke. Upper and Lower Bounds for Privacy and Adaptivity in Algorithmic Data Analysis. PhD thesis, 2016.

[SU17a] Thomas Steinke and Jonathan Ullman. Between pure and approximate differential privacy. Journal of Privacy and Confidentiality, 7(2), 2017.

[SU17b] Thomas Steinke and Jonathan Ullman. Tight lower bounds for differentially private selection. FOCS ’17.

[VS09] Duy Vu and Aleksandra Slavković. Differential privacy for clinical trial data: Preliminary evaluations. ICDMW ’09.

## ICALP (and LICS) 2020 – Relocation and Extended Deadline

Due to the Wuhan coronavirus outbreak, the organizers of ICALP and LICS have made the difficult decision to relocate both (co-located) conferences from Beijing, China, to Saarbrücken, Germany. Speaking specifically about ICALP now (I do not have further information about LICS): As a result of previous uncertainty regarding the situation, the deadline has been extended by about six days, until Tuesday February 18, 2020, at 6 AM GMT. The dates of the conference remain (roughly) the same, July 8 – 11, 2020.
The following is a more official message from ICALP Track A Chair, Artur Czumaj.

The ICALP and the LICS steering committee have agreed together with the conference chairs in Beijing to relocate the two conferences.
ICALP and LICS 2020 will take place in Saarbrücken, Germany, July 8 – 11 2020 (with satellite workshops on July 6 – 7 2020).
The deadline is extended, see below.

Call for Papers – ICALP 2020
July 8 – 11 2020, Saarbrücken, Germany

NEW Paper submission deadline: Tuesday February 18, 2020, 6am GMT
https://easychair.org/conferences/?conf=icalp2020

ICALP (International Colloquium on Automata, Languages and Programming) is the main European conference in Theoretical Computer Science and annual meeting of the European Association for Theoretical Computer Science (EATCS). ICALP 2020 will be hosted on the Saarland Informatics Campus in Saarbrücken, in co-location with LICS 2020 (ACM/IEEE Symposium on Logic in Computer Science).

Invited speakers:
Track A: Virginia Vassilevska (MIT), Robert Krauthgamer (Weizmann)
Track B: Stefan Kiefer (Oxford)
Joint ICALP-LICS: Andrew Yao (Tsinghua), Jérôme Leroux (Bordeaux)

Submission Guidelines: see https://easychair.org/conferences/?conf=icalp2020

NEW Paper submission deadline: February 18, 2020, 6am GMT

Topics: ICALP 2020 will have the two traditional tracks
A (Algorithms, Complexity and Games – including Algorithmic Game Theory, Distributed Algorithms and Parallel, Distributed and External Memory Computing) and
B (Automata, Logic, Semantics and Theory of Programming).
(Notice that the old tracks A and C have been merged into a single track A.)
Papers presenting original, unpublished research on all aspects of theoretical computer science are sought.

Typical, but not exclusive topics are:

Track A — Algorithmic Aspects of Networks and Networking, Algorithms for Computational Biology, Algorithmic Game Theory, Combinatorial Optimization, Combinatorics in Computer Science, Computational Complexity, Computational Geometry, Computational Learning Theory, Cryptography, Data Structures, Design and Analysis of Algorithms, Foundations of Machine Learning, Foundations of Privacy, Trust and Reputation in Network, Network Models for Distributed Computing, Network Economics and Incentive-Based Computing Related to Networks, Network Mining and Analysis, Parallel, Distributed and External Memory Computing, Quantum Computing, Randomness in Computation, Theory of Security in Networks

Track B — Algebraic and Categorical Models, Automata, Games, and Formal Languages, Emerging and Non-standard Models of Computation, Databases, Semi-Structured Data and Finite Model Theory, Formal and Logical Aspects of Learning, Logic in Computer Science, Theorem Proving and Model Checking, Models of Concurrent, Distributed, and Mobile Systems, Models of Reactive, Hybrid and Stochastic Systems, Principles and Semantics of Programming Languages, Program Analysis and Transformation, Specification, Verification and Synthesis, Type Systems and Theory, Typed Calculi

PC Track A chair: Artur Czumaj (University  of Warwick)
PC Track B chair: Anuj Dawar (University of Cambridge)

Contact
All questions about submissions should be emailed to the PC Track chairs:
Artur Czumaj A.Czumaj@warwick.ac.uk<mailto:A.Czumaj@warwick.ac.uk>
Anuj Dawar Anuj.Dawar@cl.cam.ac.uk<mailto:Anuj.Dawar@cl.cam.ac.uk>

## Theory and Practice of Differential Privacy 2019

While I’m a relative newcomer to differential privacy (my first paper on it was only in 2017), I’ve found the community to be a pleasure to interact with: paradoxically, simultaneously tight-knit yet highly welcoming to newcomers. I partially credit this culture to the number of workshops and programs which bring people together, including, but not limited to, a BIRS workshop, the Privacy Tools project at Harvard, a semester at the Simons Institute, the forthcoming Shonan workshop, and the Theory and Practice of Differential Privacy (TPDP) Workshop.

I’m writing this post to draw attention to the imminent deadline of TPDP 2019, co-located with CCS 2019 in London. I’ll spare you the full details (click the link for more information), but most pressing is the deadline tomorrow, June 21, 2019, anywhere on Earth (let me know if this presents hardship for you, and I can pass concerns on to the chair). Essentially anything related to the theory or practice of differential privacy is welcome. Submissions are limited to four pages in length and are lightly refereed, based on originality, relevance, interest, and clarity. There are no published proceedings, and previously published results are welcome. If you’ve been looking to get to know the community, consider either submitting or attending the workshop!

## Hello World!

Welcome to my blog! My name is Gautam Kamath, and I just started as an assistant professor in computer science at the University of Waterloo (for more info, see About).

This blog will, broadly speaking, be about topics relevant to those interested in the the theory of computer science, statistics, and machine learning. Posts will range from technical, to informational, to meta (read: basically whatever I want to write about, but I’ll do my best to keep it topical). Stay tuned!

The unofficial theme song of this blog is “Mathematics” by Mos Def.