When analyzing confidential data through a privacy filter, a data scientist often needs to decide which queries will best support their intended analysis. For example, an analyst may wish to study noisy two-way marginals in a dataset produced by a mechanism M1. But, if the data are relatively sparse, the analyst may choose to examine noisy one-way marginals, produced by a mechanism M2 instead. Since the choice of whether to use M1 or M2 is data-dependent, a typical differentially private workflow is to first split the privacy loss budget rho into two parts: rho1 and rho2, then use the first part rho1 to determine which mechanism to use, and the remainder rho2 to obtain noisy answers from the chosen mechanism. In a sense, the first step seems wasteful because it takes away part of the privacy loss budget that could have been used to make the query answers more accurate. In this paper, we consider the question of whether the choice between M1 and M2 can be performed without wasting any privacy loss budget. For linear queries, we propose a method for decomposing M1 and M2 into three parts: (1) a mechanism M* that captures their shared information, (2) a mechanism M1' that captures information that is specific to M1, (3) a mechanism M2' that captures information that is specific to M2. Running M* and M1' together is completely equivalent to running M1 (both in terms of query answer accuracy and total privacy cost rho). Similarly, running M* and M2' together is completely equivalent to running M2. Since M* will be used no matter what, the analyst can use its output to decide whether to subsequently run M1'(thus recreating the analysis supported by M1) or M2'(recreating the analysis supported by M2), without wasting privacy loss budget.
translated by 谷歌翻译
我们研究私有综合数据生成查询版本,其中目标是构建差异隐私的敏感数据集的消毒版本,这大致保留了大量统计查询的答案。我们首先介绍一个算法框架,统一文献中的长线迭代算法。在此框架下,我们提出了两种新方法。第一种方法,私人熵投影(PEP),可以被视为MWEM的高级变体,可自适应地重复使用过去查询测量以提高精度。我们的第二种方法,具有指数机制(GEM)的生成网络,通过优化由神经网络参数化的生成模型来避免MWEM和PEP等算法中的计算瓶颈,该分布族捕获了丰富的分布系列,同时实现了快速的基于梯度的优化。我们展示了PEP和GEM经验胜过现有算法。此外,我们表明宝石很好地纳入了公共数据的先前信息,同时克服了PMW ^ PUB的限制,现有的现有方法也利用公共数据。
translated by 谷歌翻译
Although query-based systems (QBS) have become one of the main solutions to share data anonymously, building QBSes that robustly protect the privacy of individuals contributing to the dataset is a hard problem. Theoretical solutions relying on differential privacy guarantees are difficult to implement correctly with reasonable accuracy, while ad-hoc solutions might contain unknown vulnerabilities. Evaluating the privacy provided by QBSes must thus be done by evaluating the accuracy of a wide range of privacy attacks. However, existing attacks require time and expertise to develop, need to be manually tailored to the specific systems attacked, and are limited in scope. In this paper, we develop QuerySnout (QS), the first method to automatically discover vulnerabilities in QBSes. QS takes as input a target record and the QBS as a black box, analyzes its behavior on one or more datasets, and outputs a multiset of queries together with a rule to combine answers to them in order to reveal the sensitive attribute of the target record. QS uses evolutionary search techniques based on a novel mutation operator to find a multiset of queries susceptible to lead to an attack, and a machine learning classifier to infer the sensitive attribute from answers to the queries selected. We showcase the versatility of QS by applying it to two attack scenarios, three real-world datasets, and a variety of protection mechanisms. We show the attacks found by QS to consistently equate or outperform, sometimes by a large margin, the best attacks from the literature. We finally show how QS can be extended to QBSes that require a budget, and apply QS to a simple QBS based on the Laplace mechanism. Taken together, our results show how powerful and accurate attacks against QBSes can already be found by an automated system, allowing for highly complex QBSes to be automatically tested "at the pressing of a button".
translated by 谷歌翻译
translated by 谷歌翻译
We study fine-grained error bounds for differentially private algorithms for counting under continual observation. Our main insight is that the matrix mechanism when using lower-triangular matrices can be used in the continual observation model. More specifically, we give an explicit factorization for the counting matrix $M_\mathsf{count}$ and upper bound the error explicitly. We also give a fine-grained analysis, specifying the exact constant in the upper bound. Our analysis is based on upper and lower bounds of the {\em completely bounded norm} (cb-norm) of $M_\mathsf{count}$. Along the way, we improve the best-known bound of 28 years by Mathias (SIAM Journal on Matrix Analysis and Applications, 1993) on the cb-norm of $M_\mathsf{count}$ for a large range of the dimension of $M_\mathsf{count}$. Furthermore, we are the first to give concrete error bounds for various problems under continual observation such as binary counting, maintaining a histogram, releasing an approximately cut-preserving synthetic graph, many graph-based statistics, and substring and episode counting. Finally, we note that our result can be used to get a fine-grained error bound for non-interactive local learning {and the first lower bounds on the additive error for $(\epsilon,\delta)$-differentially-private counting under continual observation.} Subsequent to this work, Henzinger et al. (SODA2023) showed that our factorization also achieves fine-grained mean-squared error.
translated by 谷歌翻译
translated by 谷歌翻译
我们研究了分层数据集的差异私有合成数据生成的问题,其中各个数据点被分组在一起(例如,家庭中的人)。特别是,为了衡量合成数据集和基础私有数据集之间的相似性,我们在私人查询发布问题下构架了我们的目标,生成了一个合成数据集,该数据集可为某些查询收集(即统计数据统计数据,如平均汇总计数)保留答案。 。但是,尽管对私人合成数据的应用在查询释放问题中的应用进行了充分的研究,但此类研究仅限于非层次数据域,提出了最初的问题 - 在考虑这种形式的数据时,哪些查询很重要?此外,尚未确定如何在捕获此类统计数据的同时,如何在组和个体级别上生成合成数据。鉴于这些挑战,我们首先正式化了层次查询发行的问题,其中的目标是为某些层次数据集发布统计数据集。具体而言,我们提供了一组一般的统计查询,这些查询捕获了组和个体级别的属性之间的关系。随后,我们引入了私人合成数据算法,以进行层次查询发布,并在美国社区调查和Allegheny家庭筛查工具数据的层次数据集上进行评估。最后,我们研究了美国社区调查,其固有的层次结构产生了我们进行的另一组特定领域的查询。
translated by 谷歌翻译
The ''Propose-Test-Release'' (PTR) framework is a classic recipe for designing differentially private (DP) algorithms that are data-adaptive, i.e. those that add less noise when the input dataset is nice. We extend PTR to a more general setting by privately testing data-dependent privacy losses rather than local sensitivity, hence making it applicable beyond the standard noise-adding mechanisms, e.g. to queries with unbounded or undefined sensitivity. We demonstrate the versatility of generalized PTR using private linear regression as a case study. Additionally, we apply our algorithm to solve an open problem from ''Private Aggregation of Teacher Ensembles (PATE)'' -- privately releasing the entire model with a delicate data-dependent analysis.
translated by 谷歌翻译
我们介绍了一种基于约翰逊·林登斯特劳斯引理的统计查询的新方法,以释放具有差异隐私的统计查询的答案。关键的想法是随机投影查询答案,以较低的维空间,以便将可行的查询答案的任何两个向量之间的距离保留到添加性错误。然后,我们使用简单的噪声机制回答投影的查询,并将答案提升到原始维度。使用这种方法,我们首次给出了纯粹的私人机制,具有最佳情况下的最佳情况样本复杂性,在平均错误下,以回答$ n $ $ n $的宇宙的$ k $ Queries的工作量。作为其他应用,我们给出了具有最佳样品复杂性的第一个纯私人有效机制,用于计算有限的高维分布的协方差,并用于回答2向边缘查询。我们还表明,直到对错误的依赖性,我们机制的变体对于每个给定的查询工作负载几乎是最佳的。
translated by 谷歌翻译
我们考虑一个顺序设置,其中使用单个数据集用于执行自适应选择的分析,同时确保每个参与者的差别隐私丢失不超过预先指定的隐私预算。此问题的标准方法依赖于限制所有个人对所有个人的隐私损失的最坏情况估计,以及每个单一分析的所有可能的数据值。然而,在许多情况下,这种方法过于保守,特别是对于“典型”数据点,通过参与大部分分析产生很少的隐私损失。在这项工作中,我们基于每个分析中每个人的个性化隐私损失估计的价值,给出了更严格的隐私损失会计的方法。实现我们设计R \'enyi差异隐私的过滤器。过滤器是一种工具,可确保具有自适应选择的隐私参数的组合算法序列的隐私参数不超过预先预算。我们的过滤器比以往的$(\ epsilon,\ delta)$ - rogers等人的差别隐私更简单且更紧密。我们将结果应用于对嘈杂渐变下降的分析,并显示个性化会计可以实用,易于实施,并且只能使隐私式权衡更紧密。
translated by 谷歌翻译
The first large-scale deployment of private federated learning uses differentially private counting in the continual release model as a subroutine (Google AI blog titled "Federated Learning with Formal Differential Privacy Guarantees"). In this case, a concrete bound on the error is very relevant to reduce the privacy parameter. The standard mechanism for continual counting is the binary mechanism. We present a novel mechanism and show that its mean squared error is both asymptotically optimal and a factor 10 smaller than the error of the binary mechanism. We also show that the constants in our analysis are almost tight by giving non-asymptotic lower and upper bounds that differ only in the constants of lower-order terms. Our algorithm is a matrix mechanism for the counting matrix and takes constant time per release. We also use our explicit factorization of the counting matrix to give an upper bound on the excess risk of the private learning algorithm of Denisov et al. (NeurIPS 2022). Our lower bound for any continual counting mechanism is the first tight lower bound on continual counting under approximate differential privacy. It is achieved using a new lower bound on a certain factorization norm, denoted by $\gamma_F(\cdot)$, in terms of the singular values of the matrix. In particular, we show that for any complex matrix, $A \in \mathbb{C}^{m \times n}$, \[ \gamma_F(A) \geq \frac{1}{\sqrt{m}}\|A\|_1, \] where $\|\cdot \|$ denotes the Schatten-1 norm. We believe this technique will be useful in proving lower bounds for a larger class of linear queries. To illustrate the power of this technique, we show the first lower bound on the mean squared error for answering parity queries.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
translated by 谷歌翻译
我们给出了第一个多项式 - 时间,多项式 - 样本,差异私人估算器,用于任意高斯分发$ \ mathcal {n}(\ mu,\ sigma)$ in $ \ mathbb {r} ^ d $。所有以前的估算器都是非变性的,具有无限的运行时间,或者要求用户在参数$ \ mu $和$ \ sigma $上指定先验的绑定。我们算法中的主要新技术工具是一个新的差别私有预处理器,它从任意高斯$ \ mathcal {n}(0,\ sigma)$中采用样本,并返回矩阵$ a $,使得$ a \ sigma a ^ t$具有恒定的条件号。
translated by 谷歌翻译
translated by 谷歌翻译
使用敏感用户数据调用隐私保护方法,执行低排名矩阵完成。在这项工作中,我们提出了一种新型的噪声添加机制,用于保留差异隐私,其中噪声分布受Huber损失的启发,Huber损失是众所周知的稳定统计数据中众所周知的损失功能。在使用交替的最小二乘方法来解决矩阵完成问题的同时,对现有的差异隐私机制进行了评估。我们还建议使用迭代重新加权的最小二乘算法来完成低级矩阵,并研究合成和真实数据集中不同噪声机制的性能。我们证明所提出的机制实现了{\ epsilon} - 差异性隐私,类似于拉普拉斯机制。此外,经验结果表明,在某些情况下,Huber机制优于Laplacian和Gaussian,否则是可比的。
translated by 谷歌翻译
We study the task of training regression models with the guarantee of label differential privacy (DP). Based on a global prior distribution on label values, which could be obtained privately, we derive a label DP randomization mechanism that is optimal under a given regression loss function. We prove that the optimal mechanism takes the form of a ``randomized response on bins'', and propose an efficient algorithm for finding the optimal bin values. We carry out a thorough experimental evaluation on several datasets demonstrating the efficacy of our algorithm.
translated by 谷歌翻译
研究人员和从业人员如何处理隐私 - 实用性权衡之间存在脱节。研究人员主要是从隐私的第一角度运作,设定严格的隐私要求并最大程度地限制受这些约束的风险。从业者通常希望获得准确的第一视角,可能会对他们可能获得足够小的错误的最大隐私感到满意。 Ligett等。已经引入了一种“降噪”算法来解决后一种观点。作者表明,通过添加相关的拉普拉斯噪声并逐步减少其需求,可以产生一系列越来越准确的私人参数估计值,而仅以最低噪声介绍的方式支付隐私成本。在这项工作中,我们将降噪概括为高斯噪声的设置,并引入了布朗机制。布朗机制首先添加与模拟布朗运动的最后点相对应的高方差的高斯噪声。然后,根据从业人员的酌情决定权,通过沿着布朗的路径追溯到较早的时间来逐渐降低噪音。我们的机制更自然地适用于有限的$ \ ell_2 $ - 敏感性的共同设置,从经验上优于公共统计任务上的现有工作,并在与从业者的整个交互中提供了对隐私损失的可自定义控制。我们通过简化的Brownian机制来补充我们的布朗机制,这是对提供自适应隐私保证的经典座位算法的概括。总体而言,我们的结果表明,人们可以达到公用事业的限制,同时仍保持强大的隐私水平。
translated by 谷歌翻译
translated by 谷歌翻译