We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given $n$ i.i.d. samples from the distribution $\mathcal{N}\left(\theta,I_d\right)$ (with unknown $\theta$), of which a small fraction has been arbitrarily corrupted. Under the promise that $\|\theta\|_0\le s$, we want to correctly distinguish whether $\|\theta\|_2=0$ or $\|\theta\|_2>\gamma$, for some input parameter $\gamma>0$. We show that any algorithm for this task requires $n=\Omega\left(s\log\frac{ed}{s}\right)$ samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, $\|\theta\|_q\le s$ for any $0 < q < 2$. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be $s$-sparse. Here too we assume that an $\epsilon$-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least $n=\Omega\left(\min(s\log d,{1}/{\gamma^4})\right)$ samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.
translated by 谷歌翻译
We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.
translated by 谷歌翻译
我们研究了在存在$ \ epsilon $ - 对抗异常值的高维稀疏平均值估计的问题。先前的工作为此任务获得了该任务的样本和计算有效算法,用于辅助性Subgaussian分布。在这项工作中,我们开发了第一个有效的算法,用于强大的稀疏平均值估计,而没有对协方差的先验知识。对于$ \ Mathbb r^d $上的分布,带有“认证有限”的$ t $ tum-矩和足够轻的尾巴,我们的算法达到了$ o(\ epsilon^{1-1/t})$带有样品复杂性$的错误(\ epsilon^{1-1/t}) m =(k \ log(d))^{o(t)}/\ epsilon^{2-2/t} $。对于高斯分布的特殊情况,我们的算法达到了$ \ tilde o(\ epsilon)$的接近最佳错误,带有样品复杂性$ m = o(k^4 \ mathrm {polylog}(d)(d))/\ epsilon^^ 2 $。我们的算法遵循基于方形的总和,对算法方法的证明。我们通过统计查询和低度多项式测试的下限来补充上限,提供了证据,表明我们算法实现的样本时间 - 错误权衡在质量上是最好的。
translated by 谷歌翻译
鉴于$ n $ i.i.d.从未知的分发$ P $绘制的样本,何时可以生成更大的$ n + m $ samples,这些标题不能与$ n + m $ i.i.d区别区别。从$ p $绘制的样品?(AXELROD等人2019)将该问题正式化为样本放大问题,并为离散分布和高斯位置模型提供了最佳放大程序。然而,这些程序和相关的下限定制到特定分布类,对样本扩增的一般统计理解仍然很大程度上。在这项工作中,我们通过推出通常适用的放大程序,下限技术和与现有统计概念的联系来放置对公司统计基础的样本放大问题。我们的技术适用于一大类分布,包括指数家庭,并在样本放大和分配学习之间建立严格的联系。
translated by 谷歌翻译
We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.
translated by 谷歌翻译
We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.
translated by 谷歌翻译
我们介绍了一个普遍的框架,用于表征差异隐私保证的统计估算问题的统计效率。我们的框架,我们呼叫高维建议 - 试验释放(HPTR),在三个重要组件上建立:指数机制,强大的统计和提议 - 试验释放机制。将所有这些粘在一起是恢复力的概念,这是强大的统计估计的核心。弹性指导算法的设计,灵敏度分析和试验步骤的成功概率分析。关键识别是,如果我们设计了一种仅通过一维鲁棒统计数据访问数据的指数机制,则可以大大减少所产生的本地灵敏度。使用弹性,我们可以提供紧密的本地敏感界限。这些紧张界限在几个案例中容易转化为近乎最佳的实用程序。我们给出了将HPTR应用于统计估计问题的给定实例的一般配方,并在平均估计,线性回归,协方差估计和主成分分析的规范问题上证明了它。我们介绍了一般的公用事业分析技术,证明了HPTR几乎在文献中研究的若干场景下实现了最佳的样本复杂性。
translated by 谷歌翻译
In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].
translated by 谷歌翻译
我们建立了最佳的统计查询(SQ)下限,以鲁棒地学习某些离散高维分布的家庭。特别是,我们表明,没有访问$ \ epsilon $ -Cruntupted二进制产品分布的有效SQ算法可以在$ \ ell_2 $ -error $ o(\ epsilon \ sqrt {\ log(\ log(1/\ epsilon))内学习其平均值})$。同样,我们表明,没有访问$ \ epsilon $ - 腐败的铁磁高温岛模型的有效SQ算法可以学习到总变量距离$ O(\ Epsilon \ log(1/\ Epsilon))$。我们的SQ下限符合这些问题已知算法的错误保证,提供证据表明这些任务的当前上限是最好的。在技​​术层面上,我们为离散的高维分布开发了一个通用的SQ下限,从低维矩匹配构建体开始,我们认为这将找到其他应用程序。此外,我们介绍了新的想法,以分析这些矩匹配的结构,以进行离散的单变量分布。
translated by 谷歌翻译
We study the following independence testing problem: given access to samples from a distribution $P$ over $\{0,1\}^n$, decide whether $P$ is a product distribution or whether it is $\varepsilon$-far in total variation distance from any product distribution. For arbitrary distributions, this problem requires $\exp(n)$ samples. We show in this work that if $P$ has a sparse structure, then in fact only linearly many samples are required. Specifically, if $P$ is Markov with respect to a Bayesian network whose underlying DAG has in-degree bounded by $d$, then $\tilde{\Theta}(2^{d/2}\cdot n/\varepsilon^2)$ samples are necessary and sufficient for independence testing.
translated by 谷歌翻译
Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.
translated by 谷歌翻译
我们提出了改进的算法,并为身份测试$ n $维分布的问题提供了统计和计算下限。在身份测试问题中,我们将作为输入作为显式分发$ \ mu $,$ \ varepsilon> 0 $,并访问对隐藏分布$ \ pi $的采样甲骨文。目标是区分两个分布$ \ mu $和$ \ pi $是相同的还是至少$ \ varepsilon $ -far分开。当仅从隐藏分布$ \ pi $中访问完整样本时,众所周知,可能需要许多样本,因此以前的作品已经研究了身份测试,并额外访问了各种有条件采样牙齿。我们在这里考虑一个明显弱的条件采样甲骨文,称为坐标Oracle,并在此新模型中提供了身份测试问题的相当完整的计算和统计表征。我们证明,如果一个称为熵的分析属性为可见分布$ \ mu $保留,那么对于任何使用$ \ tilde {o}(n/\ tilde {o}),有一个有效的身份测试算法Varepsilon)$查询坐标Oracle。熵的近似张力是一种经典的工具,用于证明马尔可夫链的最佳混合时间边界用于高维分布,并且最近通过光谱独立性为许多分布族建立了最佳的混合时间。我们将算法结果与匹配的$ \ omega(n/\ varepsilon)$统计下键进行匹配的算法结果补充,以供坐标Oracle下的查询数量。我们还证明了一个计算相变:对于$ \ {+1,-1,-1 \}^n $以上的稀疏抗抗铁磁性模型,在熵失败的近似张力失败的状态下,除非RP = np,否则没有有效的身份测试算法。
translated by 谷歌翻译
我们开发机器以设计有效的可计算和一致的估计,随着观察人数而达到零的估计误差,因为观察的次数增长,当面对可能损坏的答复,除了样本的所有品,除了每种量之外的ALL。作为具体示例,我们调查了两个问题:稀疏回归和主成分分析(PCA)。对于稀疏回归,我们实现了最佳样本大小的一致性$ n \ gtrsim(k \ log d)/ \ alpha ^ $和最佳错误率$ o(\ sqrt {(k \ log d)/(n \ cdot \ alpha ^ 2))$ N $是观察人数,$ D $是尺寸的数量,$ k $是参数矢量的稀疏性,允许在数量的数量中为逆多项式进行逆多项式样品。在此工作之前,已知估计是一致的,当Inliers $ \ Alpha $ IS $ O(1 / \ log \ log n)$,即使是(非球面)高斯设计矩阵时也是一致的。结果在弱设计假设下持有,并且在这种一般噪声存在下仅被D'Orsi等人最近以密集的设置(即一般线性回归)显示。 [DNS21]。在PCA的上下文中,我们在参数矩阵上的广泛尖端假设下获得最佳错误保证(通常用于矩阵完成)。以前的作品可以仅在假设下获得非琐碎的保证,即与最基于的测量噪声以$ n $(例如,具有方差1 / n ^ 2 $的高斯高斯)。为了设计我们的估算,我们用非平滑的普通方(如$ \ ell_1 $ norm或核规范)装备Huber丢失,并以一种新的方法来分析损失的新方法[DNS21]的方法[DNS21]。功能。我们的机器似乎很容易适用于各种估计问题。
translated by 谷歌翻译
我们重新审视耐受分发测试的问题。也就是说,给出来自未知分发$ P $超过$ \ {1,\ dots,n \} $的样本,它是$ \ varepsilon_1 $ -close到或$ \ varepsilon_2 $ -far从引用分发$ q $(总变化距离)?尽管过去十年来兴趣,但在极端情况下,这个问题很好。在无噪声设置(即,$ \ varepsilon_1 = 0 $)中,样本复杂性是$ \ theta(\ sqrt {n})$,强大的域大小。在频谱的另一端时,当$ \ varepsilon_1 = \ varepsilon_2 / 2 $时,样本复杂性跳转到勉强su​​blinear $ \ theta(n / \ log n)$。然而,非常少于中级制度。我们充分地表征了分发测试中的公差价格,作为$ N $,$ varepsilon_1 $,$ \ varepsilon_2 $,最多一个$ \ log n $ factor。具体来说,我们显示了\ [\ tilde \ theta \ left的样本复杂性(\ frac {\ sqrt {n}} {\ varepsilon_2 ^ {2}} + \ frac {n} {\ log n} \ cdot \ max \左\ {\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2},\ left(\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2} \右)^ {\!\!\!2} \ \ \} \右) ,\]提供两个先前已知的案例之间的顺利折衷。我们还为宽容的等价测试问题提供了类似的表征,其中$ p $和$ q $均未赘述。令人惊讶的是,在这两种情况下,对样本复杂性的主数量是比率$ \ varepsilon_1 / varepsilon_2 ^ 2 $,而不是更直观的$ \ varepsilon_1 / \ varepsilon_2 $。特别是技术兴趣是我们的下限框架,这涉及在以往的工作中处理不对称所需的新颖近似性理论工具,从而缺乏以前的作品。
translated by 谷歌翻译
我们为高维分布的身份测试提供了改进的差异私有算法。具体来说,对于带有已知协方差$ \ sigma $的$ d $二维高斯分布,我们可以测试该分布是否来自$ \ Mathcal {n}(\ mu^*,\ sigma)$,对于某些固定$ \ mu^** $或从某个$ \ MATHCAL {n}(\ mu,\ sigma)$,总变化距离至少$ \ alpha $ from $ \ mathcal {n}(\ mu^*,\ sigma)$(\ varepsilon) ,0)$ - 微分隐私,仅使用\ [\ tilde {o} \ left(\ frac {d^{1/2}}} {\ alpha^2} + \ frac {d^{1/3}} {1/3}} { \ alpha^{4/3} \ cdot \ varepsilon^{2/3}}} + \ frac {1} {\ alpha \ cdot \ cdot \ cdot \ varepsilon} \ right)\]唯一\ [\ tilde {o} \ left(\ frac {d^{1/2}}} {\ alpha^2} + \ frac {d^{1/4}} {\ alpha \ alpha \ cdot \ cdot \ cdot \ varepsilon} \ right )\]用于计算有效算法的样品。我们还提供了一个匹配的下限,表明我们的计算效率低下的算法具有最佳的样品复杂性。我们还将算法扩展到各种相关问题,包括对具有有限但未知协方差的高斯人的平均测试,对$ \ { - 1,1,1 \}^d $的产品分布的均匀性测试以及耐受性测试。我们的结果改善了Canonne等人的先前最佳工作。 (\ frac {\ sqrt {d}} {\ alpha^2} \ right)$在许多标准参数设置中。此外,我们的结果表明,令人惊讶的是,可以使用$ d $二维高斯的私人身份测试,可以用少于离散分布的私人身份测试尺寸$ d $ \ cite {actharyasz18}的私人身份测试来完成,以重组猜测〜\ cite {canonnekmuz20}的下限。
translated by 谷歌翻译
对于高维和非参数统计模型,速率最优估计器平衡平方偏差和方差是一种常见的现象。虽然这种平衡被广泛观察到,但很少知道是否存在可以避免偏差和方差之间的权衡的方法。我们提出了一般的策略,以获得对任何估计方差的下限,偏差小于预先限定的界限。这表明偏差差异折衷的程度是不可避免的,并且允许量化不服从其的方法的性能损失。该方法基于许多抽象的下限,用于涉及关于不同概率措施的预期变化以及诸如Kullback-Leibler或Chi-Sque-diversence的信息措施的变化。其中一些不平等依赖于信息矩阵的新概念。在该物品的第二部分中,将抽象的下限应用于几种统计模型,包括高斯白噪声模型,边界估计问题,高斯序列模型和高维线性回归模型。对于这些特定的统计应用,发生不同类型的偏差差异发生,其实力变化很大。对于高斯白噪声模型中集成平方偏置和集成方差之间的权衡,我们将较低界限的一般策略与减少技术相结合。这允许我们将原始问题与估计的估计器中的偏差折衷联动,以更简单的统计模型中具有额外的对称性属性。在高斯序列模型中,发生偏差差异的不同相位转换。虽然偏差和方差之间存在非平凡的相互作用,但是平方偏差的速率和方差不必平衡以实现最小估计速率。
translated by 谷歌翻译
我们重新审视量子状态认证的基本问题:给定混合状态$ \ rho \中的副本\ mathbb {c} ^ {d \ times d} $和混合状态$ \ sigma $的描述,决定是否$ \ sigma = \ rho $或$ \ | \ sigma - \ rho \ | _ {\ mathsf {tr}} \ ge \ epsilon $。当$ \ sigma $最大化时,这是混合性测试,众所周知,$ \ omega(d ^ {\ theta(1)} / \ epsilon ^ 2)$副本是必要的,所以确切的指数取决于测量类型学习者可以使[OW15,BCL20],并且在许多这些设置中,有一个匹配的上限[OW15,Bow19,BCL20]。可以避免这种$ d ^ {\ theta(1)} $依赖于某些类型的混合状态$ \ sigma $,例如。大约低等级的人?更常见地,是否存在一个简单的功能$ f:\ mathbb {c} ^ {d \ times d} \ to \ mathbb {r} _ {\ ge 0} $,其中一个人可以显示$ \ theta(f( \ sigma)/ \ epsilon ^ 2)$副本是必要的,并且足以就任何$ \ sigma $的国家认证?这种实例 - 最佳边界在经典分布测试的背景下是已知的,例如, [VV17]。在这里,我们为量子设置提供了这个性质的第一个界限,显示(达到日志因子),即使用非接受不连贯测量的状态认证的复杂性复杂性基本上是通过复制复杂性进行诸如$ \ sigma $之间的保真度的复杂性。和最大混合的状态。令人惊讶的是,我们的界限与经典问题的实例基本上不同,展示了两个设置之间的定性差异。
translated by 谷歌翻译
我们研究了测试有序域上的离散概率分布是否是指定数量的垃圾箱的直方图。$ k $的简洁近似值的最常见工具之一是$ k $ [n] $,是概率分布,在一组$ k $间隔上是分段常数的。直方图测试问题如下:从$ [n] $上的未知分布中给定样品$ \ mathbf {p} $,我们想区分$ \ mathbf {p} $的情况从任何$ k $ - 组织图中,总变化距离的$ \ varepsilon $ -far。我们的主要结果是针对此测试问题的样本接近最佳和计算有效的算法,以及几乎匹配的(在对数因素内)样品复杂性下限。具体而言,我们表明直方图测试问题具有样品复杂性$ \ widetilde \ theta(\ sqrt {nk} / \ varepsilon + k / \ varepsilon^2 + \ sqrt {n} / \ varepsilon^2)$。
translated by 谷歌翻译
聚类是无监督学习中的基本原始,它引发了丰富的计算挑战性推理任务。在这项工作中,我们专注于将$ D $ -dimential高斯混合的规范任务与未知(和可能的退化)协方差集成。最近的作品(Ghosh等人。恢复在高斯聚类实例中种植的某些隐藏结构。在许多类似的推理任务上的工作开始,这些较低界限强烈建议存在群集的固有统计到计算间隙,即群集任务是\ yringit {statistically}可能但没有\ texit {多项式 - 时间}算法成功。我们考虑的聚类任务的一个特殊情况相当于在否则随机子空间中找到种植的超立体载体的问题。我们表明,也许令人惊讶的是,这种特定的聚类模型\ extent {没有展示}统计到计算间隙,即使在这种情况下继续应用上述的低度和SOS下限。为此,我们提供了一种基于Lenstra - Lenstra - Lovasz晶格基础减少方法的多项式算法,该方法实现了$ D + 1 $样本的统计上最佳的样本复杂性。该结果扩展了猜想统计到计算间隙的问题的类问题可以通过“脆弱”多项式算法“关闭”,突出显示噪声在统计到计算间隙的发作中的关键而微妙作用。
translated by 谷歌翻译
在负面的感知问题中,我们给出了$ n $数据点$({\ boldsymbol x} _i,y_i)$,其中$ {\ boldsymbol x} _i $是$ d $ -densional vector和$ y_i \ in \ { + 1,-1 \} $是二进制标签。数据不是线性可分离的,因此我们满足自己的内容,以找到最大的线性分类器,具有最大的\ emph {否定}余量。换句话说,我们想找到一个单位常规矢量$ {\ boldsymbol \ theta} $,最大化$ \ min_ {i \ le n} y_i \ langle {\ boldsymbol \ theta},{\ boldsymbol x} _i \ rangle $ 。这是一个非凸优化问题(它相当于在Polytope中找到最大标准矢量),我们在两个随机模型下研究其典型属性。我们考虑比例渐近,其中$ n,d \ to \ idty $以$ n / d \ to \ delta $,并在最大边缘$ \ kappa _ {\ text {s}}(\ delta)上证明了上限和下限)$或 - 等效 - 在其逆函数$ \ delta _ {\ text {s}}(\ kappa)$。换句话说,$ \ delta _ {\ text {s}}(\ kappa)$是overparametization阈值:以$ n / d \ le \ delta _ {\ text {s}}(\ kappa) - \ varepsilon $一个分类器实现了消失的训练错误,具有高概率,而以$ n / d \ ge \ delta _ {\ text {s}}(\ kappa)+ \ varepsilon $。我们在$ \ delta _ {\ text {s}}(\ kappa)$匹配,以$ \ kappa \ to - \ idty $匹配。然后,我们分析了线性编程算法来查找解决方案,并表征相应的阈值$ \ delta _ {\ text {lin}}(\ kappa)$。我们观察插值阈值$ \ delta _ {\ text {s}}(\ kappa)$和线性编程阈值$ \ delta _ {\ text {lin {lin}}(\ kappa)$之间的差距,提出了行为的问题其他算法。
translated by 谷歌翻译