The problem of learning threshold functions is a fundamental one in machine learning. Classical learning theory implies sample complexity of $O(\xi^{-1} \log(1/\beta))$ (for generalization error $\xi$ with confidence $1-\beta$). The private version of the problem, however, is more challenging and in particular, the sample complexity must depend on the size $|X|$ of the domain. Progress on quantifying this dependence, via lower and upper bounds, was made in a line of works over the past decade. In this paper, we finally close the gap for approximate-DP and provide a nearly tight upper bound of $\tilde{O}(\log^* |X|)$, which matches a lower bound by Alon et al (that applies even with improper learning) and improves over a prior upper bound of $\tilde{O}((\log^* |X|)^{1.5})$ by Kaplan et al. We also provide matching upper and lower bounds of $\tilde{\Theta}(2^{\log^*|X|})$ for the additive error of private quasi-concave optimization (a related and more general problem). Our improvement is achieved via the novel Reorder-Slice-Compute paradigm for private data analysis which we believe will have further applications.
translated by 谷歌翻译
In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].
translated by 谷歌翻译
我们给出了第一个多项式算法来估计$ d $ -variate概率分布的平均值,从$ \ tilde {o}(d)$独立的样本受到纯粹的差异隐私的界限。此问题的现有算法无论是呈指数运行时间,需要$ \ OMEGA(D ^ {1.5})$样本,或仅满足较弱的集中或近似差分隐私条件。特别地,所有先前的多项式算法都需要$ d ^ {1+ \ omega(1)} $ samples,以保证“加密”高概率,1-2 ^ { - d ^ {\ omega(1) $,虽然我们的算法保留$ \ tilde {o}(d)$ SAMPS复杂性即使在此严格设置中也是如此。我们的主要技术是使用强大的方块方法(SOS)来设计差异私有算法的新方法。算法的证据是在高维算法统计数据中的许多近期作品中的一个关键主题 - 显然需要指数运行时间,但可以通过低度方块证明可以捕获其分析可以自动变成多项式 - 时间算法具有相同的可证明担保。我们展示了私有算法的类似证据现象:工作型指数机制的实例显然需要指数时间,但可以用低度SOS样张分析的指数时间,可以自动转换为多项式差异私有算法。我们证明了捕获这种现象的元定理,我们希望在私人算法设计中广泛使用。我们的技术还在高维度之间绘制了差异私有和强大统计数据之间的新连接。特别是通过我们的校验算法镜头来看,几次研究的SOS证明在近期作品中的算法稳健统计中直接产生了我们差异私有平均估计算法的关键组成部分。
translated by 谷歌翻译
Differentially private algorithms for common metric aggregation tasks, such as clustering or averaging, often have limited practicality due to their complexity or to the large number of data points that is required for accurate results. We propose a simple and practical tool, $\mathsf{FriendlyCore}$, that takes a set of points ${\cal D}$ from an unrestricted (pseudo) metric space as input. When ${\cal D}$ has effective diameter $r$, $\mathsf{FriendlyCore}$ returns a "stable" subset ${\cal C} \subseteq {\cal D}$ that includes all points, except possibly few outliers, and is {\em certified} to have diameter $r$. $\mathsf{FriendlyCore}$ can be used to preprocess the input before privately aggregating it, potentially simplifying the aggregation or boosting its accuracy. Surprisingly, $\mathsf{FriendlyCore}$ is light-weight with no dependence on the dimension. We empirically demonstrate its advantages in boosting the accuracy of mean estimation and clustering tasks such as $k$-means and $k$-GMM, outperforming tailored methods.
translated by 谷歌翻译
We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.
translated by 谷歌翻译
Learning problems form an important category of computational tasks that generalizes many of the computations researchers apply to large real-life data sets. We ask: what concept classes can be learned privately, namely, by an algorithm whose output does not depend too heavily on any one input or specific training example? More precisely, we investigate learning algorithms that satisfy differential privacy, a notion that provides strong confidentiality guarantees in contexts where aggregate information is released about a database containing sensitive information about individuals.Our goal is a broad understanding of the resources required for private learning in terms of samples, computation time, and interaction. We demonstrate that, ignoring computational constraints, it is possible to privately agnostically learn any concept class using a sample size approximately logarithmic in the cardinality of the concept class. Therefore, almost anything learnable is learnable privately: specifically, if a concept class is learnable by a (non-private) algorithm with polynomial sample complexity and output size, then it can be learned privately using a polynomial number of samples. We also present a computationally efficient private PAC learner for the class of parity functions. This result dispels the similarity between learning with noise and private learning (both must be robust to small changes in inputs), since parity is thought to be very hard to learn given random classification noise.Local (or randomized response) algorithms are a practical class of private algorithms that have received extensive investigation. We provide a precise characterization of local private learning algorithms. We show that a concept class is learnable by a local algorithm if and only if it is learnable in the statistical query (SQ) model. Therefore, for local private learning algorithms, the similarity to learning with noise is stronger: local learning is equivalent to SQ learning, and SQ algorithms include most known noise-tolerant learning algorithms. Finally, we present a separation between the power of interactive and noninteractive local learning algorithms. Because of the equivalence to SQ learning, this result also separates adaptive and nonadaptive SQ learning.
translated by 谷歌翻译
We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.
translated by 谷歌翻译
我们启动差异私有(DP)估计的研究,并访问少量公共数据。为了对D维高斯人进行私人估计,我们假设公共数据来自高斯人,该高斯与私人数据的基础高斯人的总变化距离可能消失了。我们表明,在纯或集中DP的约束下,D+1个公共数据样本足以从私人样本复杂性中删除对私人数据分布的范围参数的任何依赖性,而在没有公共数据的情况下,这是必不可少的。对于分离的高斯混合物,我们假设基本的公共和私人分布是相同的,我们考虑两个设置:(1)当给出独立于维度的公共数据时,可以根据多种方式改善私人样本复杂性混合组件的数量以及对分布范围参数的任何依赖性都可以在近似DP情况下去除; (2)当在维度上给出了一定数量的公共数据线性时,即使在集中的DP下,也可以独立于范围参数使私有样本复杂性使得可以对整体样本复杂性进行其他改进。
translated by 谷歌翻译
差异隐私通常使用比理论更大的隐私参数应用于理想的理想。已经提出了宽大隐私参数的各种非正式理由。在这项工作中,我们考虑了部分差异隐私(DP),该隐私允许以每个属性为基础量化隐私保证。在此框架中,我们研究了几个基本数据分析和学习任务,并设计了其每个属性隐私参数的算法,其较小的人(即所有属性)的最佳隐私参数比最佳的隐私参数。
translated by 谷歌翻译
聚类是数据分析中的一个根本问题。在差别私有聚类中,目标是识别$ k $群集中心,而不披露各个数据点的信息。尽管研究进展显着,但问题抵制了实际解决方案。在这项工作中,我们的目的是提供简单的可实现的差异私有聚类算法,当数据“简单”时,提供实用程序,例如,当簇之间存在显着的分离时。我们提出了一个框架,允许我们将非私有聚类算法应用于简单的实例,并私下结合结果。在高斯混合的某些情况下,我们能够改善样本复杂性界限,并获得$ k $ -means。我们与合成数据的实证评估补充了我们的理论分析。
translated by 谷歌翻译
给定真实的假设类$ \ mathcal {h} $,我们在什么条件下调查有一个差异的私有算法,它从$ \ mathcal {h} $给出的最佳假设.I.i.d.数据。灵感来自最近的成果的二进制分类的相关环境(Alon等,2019; Bun等,2020),其中显示了二进制类的在线学习是必要的,并且足以追随其私人学习,Jung等人。 (2020)显示,在回归的设置中,$ \ mathcal {h} $的在线学习是私人可读性所必需的。这里的在线学习$ \ mathcal {h} $的特点是其$ \ eta $-sequentient胖胖子的优势,$ {\ rm sfat} _ \ eta(\ mathcal {h})$,适用于所有$ \ eta> 0 $。就足够的私人学习条件而言,Jung等人。 (2020)显示$ \ mathcal {h} $私下学习,如果$ \ lim _ {\ eta \ downarrow 0} {\ rm sfat} _ \ eta(\ mathcal {h})$是有限的,这是一个相当限制的健康)状况。我们展示了在轻松的条件下,\ LIM \ INF _ {\ eta \ downarrow 0} \ eta \ cdot {\ rm sfat} _ \ eta(\ mathcal {h})= 0 $,$ \ mathcal {h} $私人学习,为\ \ rm sfat} _ \ eta(\ mathcal {h})$ \ eta \ dockarrow 0 $ divering建立第一个非参数私人学习保证。我们的技术涉及一种新颖的过滤过程,以输出非参数函数类的稳定假设。
translated by 谷歌翻译
在共享数据的统计学习和分析中,在联合学习和元学习等平台上越来越广泛地采用,有两个主要问题:隐私和鲁棒性。每个参与的个人都应该能够贡献,而不会担心泄露一个人的敏感信息。与此同时,系统应该在恶意参与者的存在中插入损坏的数据。最近的算法在学习中,学习共享数据专注于这些威胁中的一个,使系统容易受到另一个威胁。我们弥合了这个差距,以获得估计意思的规范问题。样品。我们介绍了素数,这是第一算法,实现了各种分布的隐私和鲁棒性。我们通过新颖的指数时间算法进一步补充了这一结果,提高了素数的样本复杂性,实现了近最优保证并匹配(非鲁棒)私有平均估计的已知下限。这证明没有额外的统计成本同时保证隐私和稳健性。
translated by 谷歌翻译
我们研究动态算法,以便在$ N $插入和删除流中最大化单调子模块功能的问题。我们显示任何维护$(0.5+ epsilon)$ - 在基数约束下的近似解决方案的算法,对于任何常数$ \ epsilon> 0 $,必须具有$ \ mathit {polynomial} $的摊销查询复杂性$ n $。此外,需要线性摊销查询复杂性,以维持0.584美元 - 批量的解决方案。这与近期[LMNF + 20,MON20]的最近动态算法相比,达到$(0.5- \ epsilon)$ - 近似值,与$ \ mathsf {poly} \ log(n)$摊销查询复杂性。在正面,当流是仅插入的时候,我们在基数约束下的问题和近似的Matroid约束下提供有效的算法,近似保证$ 1-1 / e-\ epsilon $和摊销查询复杂性$ \ smash {o (\ log(k / \ epsilon)/ \ epsilon ^ 2)} $和$ \ smash {k ^ {\ tilde {o}(1 / \ epsilon ^ 2)} \ log n} $,其中$ k $表示基数参数或Matroid的等级。
translated by 谷歌翻译
经典流算法在(并非总是合理的)假设下运行的,即输入流已预先固定。最近,对于设计可靠的流算法,即使在执行过程中自适应地选择输入流也可以提供可证明的保证,越来越有兴趣。我们提出了一个新的框架,用于强大的流媒体,该框架结合了Hassidim等人最近建议的两个框架的技术。[神经2020]以及伍德拉夫和周[焦点2021]。这些最近建议的框架依赖于非常不同的想法,每个想法都具有自己的优势和劣势。我们将这两个框架组合到一个单一的混合框架中,该框架获得了``两全其美的'',从而解决了Woodruff和Zhou留下的问题。
translated by 谷歌翻译
我们提出了改进的算法,并为身份测试$ n $维分布的问题提供了统计和计算下限。在身份测试问题中,我们将作为输入作为显式分发$ \ mu $,$ \ varepsilon> 0 $,并访问对隐藏分布$ \ pi $的采样甲骨文。目标是区分两个分布$ \ mu $和$ \ pi $是相同的还是至少$ \ varepsilon $ -far分开。当仅从隐藏分布$ \ pi $中访问完整样本时,众所周知,可能需要许多样本,因此以前的作品已经研究了身份测试,并额外访问了各种有条件采样牙齿。我们在这里考虑一个明显弱的条件采样甲骨文,称为坐标Oracle,并在此新模型中提供了身份测试问题的相当完整的计算和统计表征。我们证明,如果一个称为熵的分析属性为可见分布$ \ mu $保留,那么对于任何使用$ \ tilde {o}(n/\ tilde {o}),有一个有效的身份测试算法Varepsilon)$查询坐标Oracle。熵的近似张力是一种经典的工具,用于证明马尔可夫链的最佳混合时间边界用于高维分布,并且最近通过光谱独立性为许多分布族建立了最佳的混合时间。我们将算法结果与匹配的$ \ omega(n/\ varepsilon)$统计下键进行匹配的算法结果补充,以供坐标Oracle下的查询数量。我们还证明了一个计算相变:对于$ \ {+1,-1,-1 \}^n $以上的稀疏抗抗铁磁性模型,在熵失败的近似张力失败的状态下,除非RP = np,否则没有有效的身份测试算法。
translated by 谷歌翻译
我们研究了用于线性回归的主动采样算法,该算法仅旨在查询目标向量$ b \ in \ mathbb {r} ^ n $的少量条目,并将近最低限度输出到$ \ min_ {x \ In \ mathbb {r} ^ d} \ | ax-b \ | $,其中$ a \ in \ mathbb {r} ^ {n \ times d} $是一个设计矩阵和$ \ | \ cdot \ | $是一些损失函数。对于$ \ ell_p $ norm回归的任何$ 0 <p <\ idty $,我们提供了一种基于Lewis权重采样的算法,其使用只需$ \ tilde {o}输出$(1+ \ epsilon)$近似解决方案(d ^ {\ max(1,{p / 2})} / \ mathrm {poly}(\ epsilon))$查询到$ b $。我们表明,这一依赖于$ D $是最佳的,直到对数因素。我们的结果解决了陈和Derezi的最近开放问题,陈和Derezi \'{n} Ski,他们为$ \ ell_1 $ norm提供了附近的最佳界限,以及$ p \中的$ \ ell_p $回归的次优界限(1,2) $。我们还提供了$ O的第一个总灵敏度上限(D ^ {\ max \ {1,p / 2 \} \ log ^ 2 n)$以满足最多的$ p $多项式增长。这改善了Tukan,Maalouf和Feldman的最新结果。通过将此与我们的技术组合起来的$ \ ell_p $回归结果,我们获得了一个使$ \ tilde o的活动回归算法(d ^ {1+ \ max \ {1,p / 2 \}} / \ mathrm {poly}。 (\ epsilon))$疑问,回答陈和德里兹的另一个打开问题{n}滑雪。对于Huber损失的重要特殊情况,我们进一步改善了我们对$ \ tilde o的主动样本复杂性的绑定(d ^ {(1+ \ sqrt2)/ 2} / \ epsilon ^ c)$和非活跃$ \ tilde o的样本复杂性(d ^ {4-2 \ sqrt 2} / \ epsilon ^ c)$,由于克拉克森和伍德拉夫而改善了Huber回归的以前的D ^ 4 $。我们的敏感性界限具有进一步的影响,使用灵敏度采样改善了各种先前的结果,包括orlicz规范子空间嵌入和鲁棒子空间近似。最后,我们的主动采样结果为每种$ \ ell_p $ norm提供的第一个Sublinear时间算法。
translated by 谷歌翻译
可实现和不可知性的可读性的等价性是学习理论的基本现象。与PAC学习和回归等古典设置范围的变种,近期趋势,如对冲强劲和私人学习,我们仍然缺乏统一理论;等同性的传统证据往往是不同的,并且依赖于强大的模型特异性假设,如统一的收敛和样本压缩。在这项工作中,我们给出了第一个独立的框架,解释了可实现和不可知性的可读性的等价性:三行黑箱减少简化,统一,并在各种各样的环境中扩展了我们的理解。这包括没有已知的学报的模型,例如学习任意分布假设或一般损失,以及许多其他流行的设置,例如强大的学习,部分学习,公平学习和统计查询模型。更一般地,我们认为可实现和不可知的学习的等价性实际上是我们调用属性概括的更广泛现象的特殊情况:可以满足有限的学习算法(例如\噪声公差,隐私,稳定性)的任何理想性质假设类(可能在某些变化中)延伸到任何学习的假设类。
translated by 谷歌翻译
使用差异隐私(DP)学习的大多数工作都集中在每个用户具有单个样本的设置上。在这项工作中,我们考虑每个用户持有M $ Samples的设置,并且在每个用户数据的级别强制执行隐私保护。我们展示了,在这个设置中,我们可以学习少数用户。具体而言,我们表明,只要每个用户收到足够多的样本,我们就可以通过$(\ epsilon,\ delta)$ - dp算法使用$ o(\ log(1 / \ delta)来学习任何私人学习的课程/ \ epsilon)$用户。对于$ \ epsilon $ -dp算法,我们展示我们即使在本地模型中也可以使用$ o _ {\ epsilon}(d)$用户学习,其中$ d $是概率表示维度。在这两种情况下,我们在所需用户数量上显示了几乎匹配的下限。我们的结果的一个关键组成部分是全局稳定性的概括[Bun等,Focs 2020]允许使用公共随机性。在这种轻松的概念下,我们采用相关的采样策略来表明全局稳定性可以在样品数量的多项式牺牲中被提升以任意接近一个。
translated by 谷歌翻译
我们给出了第一个多项式时间和样本$(\ epsilon,\ delta)$ - 差异私有(DP)算法,以估计存在恒定的对抗性异常分数的平均值,协方差和更高的时刻。我们的算法成功用于分布的分布系列,以便在经济估计上满足两个学习的良好性质:定向时刻的可证明的子销售,以及2度多项式的可证式超分子。我们的恢复保证持有“右仿射效率规范”:Mahalanobis距离的平均值,乘法谱和相对Frobenius距离保证,适用于更高时刻的协方差和注射规范。先前的作品获得了私有稳健算法,用于界限协方差的子静脉分布的平均估计。对于协方差估算,我们的是第一算法(即使在没有异常值的情况下也是在没有任何条件号的假设的情况下成功的。我们的算法从一个新的框架出现,该框架提供了一种用于修改凸面放宽的一般蓝图,以便在算法在其运行中产生正确的正确性的证人,以满足适当的参数规范中的强烈最坏情况稳定性。我们验证了用于修改标准的平方(SOS)SEMIDEFINITE编程放松的担保,以实现鲁棒估算。我们的隐私保障是通过将稳定性保证与新的“估计依赖性”噪声注入机制相结合来获得,其中噪声比例与估计的协方差的特征值。我们认为,此框架更加有用,以获得强大的估算器的DP对应者。独立于我们的工作,Ashtiani和Liaw [Al21]还获得了高斯分布的多项式时间和样本私有鲁棒估计算法。
translated by 谷歌翻译
我们给出了第一个多项式 - 时间,多项式 - 样本,差异私人估算器,用于任意高斯分发$ \ mathcal {n}(\ mu,\ sigma)$ in $ \ mathbb {r} ^ d $。所有以前的估算器都是非变性的,具有无限的运行时间,或者要求用户在参数$ \ mu $和$ \ sigma $上指定先验的绑定。我们算法中的主要新技术工具是一个新的差别私有预处理器,它从任意高斯$ \ mathcal {n}(0,\ sigma)$中采用样本,并返回矩阵$ a $,使得$ a \ sigma a ^ t$具有恒定的条件号。
translated by 谷歌翻译