我们研究了$ N $节点上稳健地估计参数$ P $'ENY ACLY图的问题,其中$ \ gamma $小点的节点可能是对抗的。在展示规范估计器的缺陷之后,我们设计了一种计算上有效的频谱算法,估计$ P $高精度$ \ tilde o(\ sqrt {p(1-p)} / n + \ gamma \ sqrt {p(1-p)} / \ sqrt {n} + \ gamma / n)$ for $ \ gamma <1/60 $。此外,我们为所有$ \ Gamma <1/2 $,信息定理限制提供了一种效率低下的算法。最后,我们证明了几乎匹配的统计下限,表明我们的算法的错误是最佳的对数因子。
translated by 谷歌翻译
We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.
translated by 谷歌翻译
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
translated by 谷歌翻译
我们研究了在存在$ \ epsilon $ - 对抗异常值的高维稀疏平均值估计的问题。先前的工作为此任务获得了该任务的样本和计算有效算法,用于辅助性Subgaussian分布。在这项工作中,我们开发了第一个有效的算法,用于强大的稀疏平均值估计,而没有对协方差的先验知识。对于$ \ Mathbb r^d $上的分布,带有“认证有限”的$ t $ tum-矩和足够轻的尾巴,我们的算法达到了$ o(\ epsilon^{1-1/t})$带有样品复杂性$的错误(\ epsilon^{1-1/t}) m =(k \ log(d))^{o(t)}/\ epsilon^{2-2/t} $。对于高斯分布的特殊情况,我们的算法达到了$ \ tilde o(\ epsilon)$的接近最佳错误,带有样品复杂性$ m = o(k^4 \ mathrm {polylog}(d)(d))/\ epsilon^^ 2 $。我们的算法遵循基于方形的总和,对算法方法的证明。我们通过统计查询和低度多项式测试的下限来补充上限,提供了证据,表明我们算法实现的样本时间 - 错误权衡在质量上是最好的。
translated by 谷歌翻译
We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.
translated by 谷歌翻译
在这项工作中,我们研究了具有对抗性节点损坏的随机块模型中社区发现的问题。我们的主要结果是一种有效的算法,该算法可以忍受$ \ epsilon $ - 损坏和达到错误$ o(\ epsilon) + e^{ - \ frac {c} {2} {2}(1 \ pm o(1))} $其中$ c =(\ sqrt {a} - \ sqrt {b})^2 $是信噪比,$ a/n $和$ b/n $是互发和intra-intra-intra-社区连接概率分别。这些界限基本上与无损坏的SBM的最小值相匹配。我们还为$ \ mathbb {z} _2 $ -Synchronization提供了可靠的算法。我们算法的核心是一个新的半决赛程序,它使用全局信息来鲁棒提高粗糙聚类的准确性。此外,我们表明我们的算法是双重的,因为它们在更具挑战性的噪声模型中起作用,该模型将对抗性腐败与无限制的单调变化混合在一起,从半随机模型中。
translated by 谷歌翻译
In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].
translated by 谷歌翻译
我们给出了第一个多项式算法来估计$ d $ -variate概率分布的平均值,从$ \ tilde {o}(d)$独立的样本受到纯粹的差异隐私的界限。此问题的现有算法无论是呈指数运行时间,需要$ \ OMEGA(D ^ {1.5})$样本,或仅满足较弱的集中或近似差分隐私条件。特别地,所有先前的多项式算法都需要$ d ^ {1+ \ omega(1)} $ samples,以保证“加密”高概率,1-2 ^ { - d ^ {\ omega(1) $,虽然我们的算法保留$ \ tilde {o}(d)$ SAMPS复杂性即使在此严格设置中也是如此。我们的主要技术是使用强大的方块方法(SOS)来设计差异私有算法的新方法。算法的证据是在高维算法统计数据中的许多近期作品中的一个关键主题 - 显然需要指数运行时间,但可以通过低度方块证明可以捕获其分析可以自动变成多项式 - 时间算法具有相同的可证明担保。我们展示了私有算法的类似证据现象:工作型指数机制的实例显然需要指数时间,但可以用低度SOS样张分析的指数时间,可以自动转换为多项式差异私有算法。我们证明了捕获这种现象的元定理,我们希望在私人算法设计中广泛使用。我们的技术还在高维度之间绘制了差异私有和强大统计数据之间的新连接。特别是通过我们的校验算法镜头来看,几次研究的SOS证明在近期作品中的算法稳健统计中直接产生了我们差异私有平均估计算法的关键组成部分。
translated by 谷歌翻译
我们给出了\ emph {list-codobable协方差估计}的第一个多项式时间算法。对于任何$ \ alpha> 0 $,我们的算法获取输入样本$ y \ subseteq \ subseteq \ mathbb {r}^d $ size $ n \ geq d^{\ mathsf {poly}(1/\ alpha)} $获得通过对抗损坏I.I.D的$(1- \ alpha)n $点。从高斯分布中的样本$ x $ size $ n $,其未知平均值$ \ mu _*$和协方差$ \ sigma _*$。在$ n^{\ mathsf {poly}(1/\ alpha)} $ time中,它输出$ k = k(\ alpha)=(1/\ alpha)^{\ mathsf {poly}的常数大小列表(1/\ alpha)} $候选参数,具有高概率,包含$(\ hat {\ mu},\ hat {\ sigma})$,使得总变化距离$ tv(\ Mathcal {n}(n})(n}(n})( \ mu _*,\ sigma _*),\ Mathcal {n}(\ hat {\ mu},\ hat {\ sigma}))<1-o _ {\ alpha}(1)$。这是距离的统计上最强的概念,意味着具有独立尺寸误差的参数的乘法光谱和相对Frobenius距离近似。我们的算法更普遍地适用于$(1- \ alpha)$ - 任何具有低度平方总和证书的分布$ d $的损坏,这是两个自然分析属性的:1)一维边际和抗浓度2)2度多项式的超收缩率。在我们工作之前,估计可定性设置的协方差的唯一已知结果是针对Karmarkar,Klivans和Kothari(2019),Raghavendra和Yau(2019和2019和2019和2019和2019年)的特殊情况。 2020年)和巴克西(Bakshi)和科塔里(Kothari)(2020年)。这些结果需要超级物理时间,以在基础维度中获得任何子构误差。我们的结果意味着第一个多项式\ emph {extcect}算法,用于列表可解码的线性回归和子空间恢复,尤其允许获得$ 2^{ - \ Mathsf { - \ Mathsf {poly}(d)} $多项式时间错误。我们的结果还意味着改进了用于聚类非球体混合物的算法。
translated by 谷歌翻译
在这项工作中,我们研究了鲁布利地学习Mallows模型的问题。我们给出了一种算法,即使其样本的常数分数是任意损坏的恒定分数,也可以准确估计中央排名。此外,我们的稳健性保证是无关的,因为我们的整体准确性不依赖于排名的替代品的数量。我们的工作可以被认为是从算法稳健统计到投票和信息聚集中的中央推理问题之一的视角的自然输注。具体而言,我们的投票规则是有效的可计算的,并且通过一大群勾结的选民无法改变其结果。
translated by 谷歌翻译
我们提出了改进的算法,并为身份测试$ n $维分布的问题提供了统计和计算下限。在身份测试问题中,我们将作为输入作为显式分发$ \ mu $,$ \ varepsilon> 0 $,并访问对隐藏分布$ \ pi $的采样甲骨文。目标是区分两个分布$ \ mu $和$ \ pi $是相同的还是至少$ \ varepsilon $ -far分开。当仅从隐藏分布$ \ pi $中访问完整样本时,众所周知,可能需要许多样本,因此以前的作品已经研究了身份测试,并额外访问了各种有条件采样牙齿。我们在这里考虑一个明显弱的条件采样甲骨文,称为坐标Oracle,并在此新模型中提供了身份测试问题的相当完整的计算和统计表征。我们证明,如果一个称为熵的分析属性为可见分布$ \ mu $保留,那么对于任何使用$ \ tilde {o}(n/\ tilde {o}),有一个有效的身份测试算法Varepsilon)$查询坐标Oracle。熵的近似张力是一种经典的工具,用于证明马尔可夫链的最佳混合时间边界用于高维分布,并且最近通过光谱独立性为许多分布族建立了最佳的混合时间。我们将算法结果与匹配的$ \ omega(n/\ varepsilon)$统计下键进行匹配的算法结果补充,以供坐标Oracle下的查询数量。我们还证明了一个计算相变:对于$ \ {+1,-1,-1 \}^n $以上的稀疏抗抗铁磁性模型,在熵失败的近似张力失败的状态下,除非RP = np,否则没有有效的身份测试算法。
translated by 谷歌翻译
Robust mean estimation is one of the most important problems in statistics: given a set of samples in $\mathbb{R}^d$ where an $\alpha$ fraction are drawn from some distribution $D$ and the rest are adversarially corrupted, we aim to estimate the mean of $D$. A surge of recent research interest has been focusing on the list-decodable setting where $\alpha \in (0, \frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean. In this paper, we consider that the underlying distribution $D$ is Gaussian with $k$-sparse mean. Our main contribution is the first polynomial-time algorithm that enjoys sample complexity $O\big(\mathrm{poly}(k, \log d)\big)$, i.e. poly-logarithmic in the dimension. One of our core algorithmic ingredients is using low-degree sparse polynomials to filter outliers, which may find more applications.
translated by 谷歌翻译
我们开发了一种高效的随机块模型中的弱恢复算法。该算法与随机块模型的Vanilla版本的最佳已知算法的统计保证匹配。从这个意义上讲,我们的结果表明,随机块模型没有稳健性。我们的工作受到最近的银行,Mohanty和Raghavendra(SODA 2021)的工作,为相应的区别问题提供了高效的算法。我们的算法及其分析显着脱离了以前的恢复。关键挑战是我们算法的特殊优化景观:种植的分区可能远非最佳意义,即完全不相关的解决方案可以实现相同的客观值。这种现象与PCA的BBP相转变的推出效应有关。据我们所知,我们的算法是第一个在非渐近设置中存在这种推出效果的鲁棒恢复。我们的算法是基于凸优化的框架的实例化(与平方和不同的不同),这对于其他鲁棒矩阵估计问题可能是有用的。我们的分析的副产物是一种通用技术,其提高了任意强大的弱恢复算法的成功(输入的随机性)从恒定(或缓慢消失)概率以指数高概率。
translated by 谷歌翻译
我们研究了算法收到I.I.D的统计问题中对抗噪声模型的基本问题。从分发$ \ mathcal {d} $绘制。这些对手的定义指定了允许的损坏类型(噪声模型)以及可以进行这些损坏(适应性);后者区别了唯一可以损坏分发$ \ mathcal {d} $和适应性对手的疏忽,这些对手可以损坏他们的腐败依赖于从$ \ mathcal {d} $绘制的特定样本$ s $。在这项工作中,我们调查了在文献中研究的所有噪声模型中是否有效地相当于自适应对手。具体而言,算法$ \ mathcal {a} $的行为可以在不受算法$ \ mathcal {a}'$的情况下始终受到适应性对手的存在的良好近似?我们的第一个结果表明,这确实是在所有合理的噪声模型下广泛的统计查询算法的情况。然后,我们显示在附加噪声的具体情况下,这种等价物适用于所有算法。最后,我们将所有算法和所有合理的噪声模型中的最丰富的一般性映射到最完整的普遍性的方法。
translated by 谷歌翻译
我们重新审视耐受分发测试的问题。也就是说,给出来自未知分发$ P $超过$ \ {1,\ dots,n \} $的样本,它是$ \ varepsilon_1 $ -close到或$ \ varepsilon_2 $ -far从引用分发$ q $(总变化距离)?尽管过去十年来兴趣,但在极端情况下,这个问题很好。在无噪声设置(即,$ \ varepsilon_1 = 0 $)中,样本复杂性是$ \ theta(\ sqrt {n})$,强大的域大小。在频谱的另一端时,当$ \ varepsilon_1 = \ varepsilon_2 / 2 $时,样本复杂性跳转到勉强su​​blinear $ \ theta(n / \ log n)$。然而,非常少于中级制度。我们充分地表征了分发测试中的公差价格,作为$ N $,$ varepsilon_1 $,$ \ varepsilon_2 $,最多一个$ \ log n $ factor。具体来说,我们显示了\ [\ tilde \ theta \ left的样本复杂性(\ frac {\ sqrt {n}} {\ varepsilon_2 ^ {2}} + \ frac {n} {\ log n} \ cdot \ max \左\ {\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2},\ left(\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2} \右)^ {\!\!\!2} \ \ \} \右) ,\]提供两个先前已知的案例之间的顺利折衷。我们还为宽容的等价测试问题提供了类似的表征,其中$ p $和$ q $均未赘述。令人惊讶的是,在这两种情况下,对样本复杂性的主数量是比率$ \ varepsilon_1 / varepsilon_2 ^ 2 $,而不是更直观的$ \ varepsilon_1 / \ varepsilon_2 $。特别是技术兴趣是我们的下限框架,这涉及在以往的工作中处理不对称所需的新颖近似性理论工具,从而缺乏以前的作品。
translated by 谷歌翻译
我们介绍了一个普遍的框架,用于表征差异隐私保证的统计估算问题的统计效率。我们的框架,我们呼叫高维建议 - 试验释放(HPTR),在三个重要组件上建立:指数机制,强大的统计和提议 - 试验释放机制。将所有这些粘在一起是恢复力的概念,这是强大的统计估计的核心。弹性指导算法的设计,灵敏度分析和试验步骤的成功概率分析。关键识别是,如果我们设计了一种仅通过一维鲁棒统计数据访问数据的指数机制,则可以大大减少所产生的本地灵敏度。使用弹性,我们可以提供紧密的本地敏感界限。这些紧张界限在几个案例中容易转化为近乎最佳的实用程序。我们给出了将HPTR应用于统计估计问题的给定实例的一般配方,并在平均估计,线性回归,协方差估计和主成分分析的规范问题上证明了它。我们介绍了一般的公用事业分析技术,证明了HPTR几乎在文献中研究的若干场景下实现了最佳的样本复杂性。
translated by 谷歌翻译
我们建立了最佳的统计查询(SQ)下限,以鲁棒地学习某些离散高维分布的家庭。特别是,我们表明,没有访问$ \ epsilon $ -Cruntupted二进制产品分布的有效SQ算法可以在$ \ ell_2 $ -error $ o(\ epsilon \ sqrt {\ log(\ log(1/\ epsilon))内学习其平均值})$。同样,我们表明,没有访问$ \ epsilon $ - 腐败的铁磁高温岛模型的有效SQ算法可以学习到总变量距离$ O(\ Epsilon \ log(1/\ Epsilon))$。我们的SQ下限符合这些问题已知算法的错误保证,提供证据表明这些任务的当前上限是最好的。在技​​术层面上,我们为离散的高维分布开发了一个通用的SQ下限,从低维矩匹配构建体开始,我们认为这将找到其他应用程序。此外,我们介绍了新的想法,以分析这些矩匹配的结构,以进行离散的单变量分布。
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
在共享数据的统计学习和分析中,在联合学习和元学习等平台上越来越广泛地采用,有两个主要问题:隐私和鲁棒性。每个参与的个人都应该能够贡献,而不会担心泄露一个人的敏感信息。与此同时,系统应该在恶意参与者的存在中插入损坏的数据。最近的算法在学习中,学习共享数据专注于这些威胁中的一个,使系统容易受到另一个威胁。我们弥合了这个差距,以获得估计意思的规范问题。样品。我们介绍了素数,这是第一算法,实现了各种分布的隐私和鲁棒性。我们通过新颖的指数时间算法进一步补充了这一结果,提高了素数的样本复杂性,实现了近最优保证并匹配(非鲁棒)私有平均估计的已知下限。这证明没有额外的统计成本同时保证隐私和稳健性。
translated by 谷歌翻译
我们研究了测试有序域上的离散概率分布是否是指定数量的垃圾箱的直方图。$ k $的简洁近似值的最常见工具之一是$ k $ [n] $,是概率分布,在一组$ k $间隔上是分段常数的。直方图测试问题如下:从$ [n] $上的未知分布中给定样品$ \ mathbf {p} $,我们想区分$ \ mathbf {p} $的情况从任何$ k $ - 组织图中,总变化距离的$ \ varepsilon $ -far。我们的主要结果是针对此测试问题的样本接近最佳和计算有效的算法,以及几乎匹配的(在对数因素内)样品复杂性下限。具体而言,我们表明直方图测试问题具有样品复杂性$ \ widetilde \ theta(\ sqrt {nk} / \ varepsilon + k / \ varepsilon^2 + \ sqrt {n} / \ varepsilon^2)$。
translated by 谷歌翻译