智能论文笔记

Support Recovery in Mixture Models with Sparse Parameters

Arya Mazumdar , Soumyabrata Pal

分类：机器学习 | (统计)机器学习

2022-02-24

混合模型被广泛用于拟合复杂和多模式数据集。在本文中，我们研究了具有高维稀疏潜在参数矢量的混合物，并考虑了支持这些向量的恢复的问题。尽管对混合模型中的参数学习进行了充分研究，但稀疏性约束仍然相对尚未探索。参数向量的稀疏性是各种设置的自然约束，支持恢复是参数估计的主要步骤。我们为支持恢复提供有效的算法，该算法具有对数样品的复杂性依赖于潜在空间的维度。我们的算法非常笼统，即它们适用于1）许多不同规范分布的混合物，包括统一，泊松，拉普拉斯，高斯人等。2）在统一参数的不同假设下，线性回归和线性分类器与高斯协变量的混合物与高斯协变量的混合物。在大多数这些设置中，我们的结果是对问题的首先保证，而在其余部分中，我们的结果为现有作品提供了改进。

translated by 谷歌翻译

Support Recovery of Sparse Signals from a Mixture of Linear Measurements

Venkata Gandikota , Arya Mazumdar , Soumyabrata Pal

分类： (统计)机器学习 | 机器学习

2021-06-10

恢复来自简单测量的稀疏向量的支持是一个广泛研究的问题，考虑在压缩传感，1位压缩感测和更通用的单一索引模型下。我们考虑这个问题的概括：线性回归的混合物，以及线性分类器的混合物，其中目标是仅使用少量可能嘈杂的线性和1位测量来恢复多个稀疏载体的支持。关键挑战是，来自不同载体的测量是随机混合的。最近也接受了这两个问题。在线性分类器的混合物中，观察结果对应于查询的超平面侧随机未知向量，而在线性回归的混合物中，我们观察在查询的超平面上的随机未知向量的投影。从混合物中回收未知载体的主要步骤是首先识别所有单个组分载体的支持。在这项工作中，我们研究了足以在这两种模型中恢复混合物中所有组件向量的支持的测量数量。我们提供使用$ k，\ log n $和准多项式在$ \ ell $中使用多项式多项式的算法，以恢复在每个人的高概率中恢复所有$ \ ell $未知向量的支持组件是$ k $ -parse $ n $ -dimensional向量。

translated by 谷歌翻译

Robust Sparse Mean Estimation via Sum of Squares

Ilias Diakonikolas , Daniel M. Kane , Sushrut Karmalkar , Ankit Pensia , Thanasis Pittas

分类：机器学习 | (统计)机器学习

2022-06-07

我们研究了在存在$ \ epsilon $ - 对抗异常值的高维稀疏平均值估计的问题。先前的工作为此任务获得了该任务的样本和计算有效算法，用于辅助性Subgaussian分布。在这项工作中，我们开发了第一个有效的算法，用于强大的稀疏平均值估计，而没有对协方差的先验知识。对于$ \ Mathbb r^d $上的分布，带有“认证有限”的$ t $ tum-矩和足够轻的尾巴，我们的算法达到了$ o（\ epsilon^{1-1/t}）$带有样品复杂性$的错误（\ epsilon^{1-1/t}） m =（k \ log（d））^{o（t）}/\ epsilon^{2-2/t} $。对于高斯分布的特殊情况，我们的算法达到了$ \ tilde o（\ epsilon）$的接近最佳错误，带有样品复杂性$ m = o（k^4 \ mathrm {polylog}（d）（d））/\ epsilon^^ 2 $。我们的算法遵循基于方形的总和，对算法方法的证明。我们通过统计查询和低度多项式测试的下限来补充上限，提供了证据，表明我们算法实现的样本时间 - 错误权衡在质量上是最好的。

translated by 谷歌翻译

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

Jerry Li , Allen Liu

分类：机器学习 | (统计)机器学习

2021-12-01

我们考虑了在高维度中平均分离的高斯聚类混合物的问题。我们是从$ k $身份协方差高斯的混合物提供的样本，使任何两对手段之间的最小成对距离至少为$ \ delta $，对于某些参数$ \ delta> 0 $，目标是恢复这些样本的地面真相聚类。它是分离$ \ delta = \ theta（\ sqrt {\ log k}）$既有必要且足以理解恢复良好的聚类。但是，实现这种担保的估计值效率低下。我们提供了在多项式时间内运行的第一算法，几乎符合此保证。更确切地说，我们给出了一种算法，它需要多项式许多样本和时间，并且可以成功恢复良好的聚类，只要分离为$ \ delta = \ oomega（\ log ^ {1/2 + c} k）$ ，任何$ c> 0 $。以前，当分离以k $的分离和可以容忍$ \ textsf {poly}（\ log k）$分离所需的quasi arynomial时间时，才知道该问题的多项式时间算法。我们还将我们的结果扩展到分布的分布式的混合物，该分布在额外的温和假设下满足Poincar \ {e}不等式的分布。我们认为我们相信的主要技术工具是一种新颖的方式，可以隐含地代表和估计分配的高度时刻，这使我们能够明确地提取关于高度时刻的重要信息而没有明确地缩小全瞬间张量。

translated by 谷歌翻译

Outlier-Robust Sparse Mean Estimation for Heavy-Tailed Distributions

Ilias Diakonikolas , Daniel M. Kane , Jasper C. H. Lee , Ankit Pensia

分类：机器学习 | (统计)机器学习

2022-11-29

We study the fundamental task of outlier-robust mean estimation for heavy-tailed distributions in the presence of sparsity. Specifically, given a small number of corrupted samples from a high-dimensional heavy-tailed distribution whose mean $\mu$ is guaranteed to be sparse, the goal is to efficiently compute a hypothesis that accurately approximates $\mu$ with high probability. Prior work had obtained efficient algorithms for robust sparse mean estimation of light-tailed distributions. In this work, we give the first sample-efficient and polynomial-time robust sparse mean estimator for heavy-tailed distributions under mild moment assumptions. Our algorithm achieves the optimal asymptotic error using a number of samples scaling logarithmically with the ambient dimension. Importantly, the sample complexity of our method is optimal as a function of the failure probability $\tau$, having an additive $\log(1/\tau)$ dependence. Our algorithm leverages the stability-based approach from the algorithmic robust statistics literature, with crucial (and necessary) adaptations required in our setting. Our analysis may be of independent interest, involving the delicate design of a (non-spectral) decomposition for positive semi-definite matrices satisfying certain sparsity properties.

translated by 谷歌翻译

Symmetric Sparse Boolean Matrix Factorization and Applications

Sitan Chen , Zhao Song , Runzhou Tao , Ruizhe Zhang

分类：机器学习 | (统计)机器学习

2021-02-02

在这项工作中，我们研究了一个非负矩阵分解的变体，我们希望找到给定输入矩阵的对称分解成稀疏的布尔矩阵。正式说话，给定$ \ mathbf {m} \ in \ mathbb {z} ^ {m \ times m} $，我们想找到$ \ mathbf {w} \ in \ {0,1 \} ^ {m \ times $} $这样$ \ | \ mathbf {m} - \ mathbf {w} \ mathbf {w} ^ \ top \ | _0 $在所有$ \ mathbf {w} $中最小化为$ k $ -parse。这个问题结果表明与恢复线图中的超图以及私人神经网络训练的重建攻击相比密切相关。由于这个问题在最坏的情况下，我们研究了在这些重建攻击的背景下出现的自然平均水平变体：$ \ mathbf {m} = \ mathbf {w} \ mathbf {w} ^ {\ top $ \ mathbf {w} $ \ mathbf {w} $ k $ -parse行的随机布尔矩阵，目标是恢复$ \ mathbf {w} $上列排列。等效，这可以被认为是从其线图中恢复均匀随机的k $ k $。我们的主要结果是基于对$ \ MATHBF {W} $的引导高阶信息的此问题的多项式算法，然后分解适当的张量。我们分析中的关键成分，可能是独立的兴趣，是表示这种矩阵$ \ mathbf {w} $在$ m = \ widetilde {\ omega}（r）时，这一矩阵$ \ mathbf {w} $具有高概率。 $，我们使用Littlewood-Offord理论的工具和二进制Krawtchouk多项式的估算。

translated by 谷歌翻译

List-Decodable Covariance Estimation

Misha Ivkov , Pravesh K. Kothari

分类：机器学习 | (统计)机器学习

2022-06-22

我们给出了\ emph {list-codobable协方差估计}的第一个多项式时间算法。对于任何$ \ alpha> 0 $，我们的算法获取输入样本$ y \ subseteq \ subseteq \ mathbb {r}^d $ size $ n \ geq d^{\ mathsf {poly}（1/\ alpha）} $获得通过对抗损坏I.I.D的$（1- \ alpha）n $点。从高斯分布中的样本$ x $ size $ n $，其未知平均值$ \ mu _*$和协方差$ \ sigma _*$。在$ n^{\ mathsf {poly}（1/\ alpha）} $ time中，它输出$ k = k（\ alpha）=（1/\ alpha）^{\ mathsf {poly}的常数大小列表（1/\ alpha）} $候选参数，具有高概率，包含$（\ hat {\ mu}，\ hat {\ sigma}）$，使得总变化距离$ tv（\ Mathcal {n}（n}）（n}（n}）（ \ mu _*，\ sigma _*），\ Mathcal {n}（\ hat {\ mu}，\ hat {\ sigma}））<1-o _ {\ alpha}（1）$。这是距离的统计上最强的概念，意味着具有独立尺寸误差的参数的乘法光谱和相对Frobenius距离近似。我们的算法更普遍地适用于$（1- \ alpha）$ - 任何具有低度平方总和证书的分布$ d $的损坏，这是两个自然分析属性的：1）一维边际和抗浓度2）2度多项式的超收缩率。在我们工作之前，估计可定性设置的协方差的唯一已知结果是针对Karmarkar，Klivans和Kothari（2019），Raghavendra和Yau（2019和2019和2019和2019和2019年）的特殊情况。 2020年）和巴克西（Bakshi）和科塔里（Kothari）（2020年）。这些结果需要超级物理时间，以在基础维度中获得任何子构误差。我们的结果意味着第一个多项式\ emph {extcect}算法，用于列表可解码的线性回归和子空间恢复，尤其允许获得$ 2^{ - \ Mathsf { - \ Mathsf {poly}（d）} $多项式时间错误。我们的结果还意味着改进了用于聚类非球体混合物的算法。

translated by 谷歌翻译

Active Sampling for Linear Regression Beyond the $\ell_2$ Norm

Cameron Musco , Christopher Musco , David P. Woodruff , Taisuke Yasuda

分类：机器学习 | (统计)机器学习

2021-11-09

我们研究了用于线性回归的主动采样算法，该算法仅旨在查询目标向量$ b \ in \ mathbb {r} ^ n $的少量条目，并将近最低限度输出到$ \ min_ {x \ In \ mathbb {r} ^ d} \ | ax-b \ | $，其中$ a \ in \ mathbb {r} ^ {n \ times d} $是一个设计矩阵和$ \ | \ cdot \ | $是一些损失函数。对于$ \ ell_p $ norm回归的任何$ 0 <p <\ idty $，我们提供了一种基于Lewis权重采样的算法，其使用只需$ \ tilde {o}输出$（1+ \ epsilon）$近似解决方案（d ^ {\ max（1，{p / 2}）} / \ mathrm {poly}（\ epsilon））$查询到$ b $。我们表明，这一依赖于$ D $是最佳的，直到对数因素。我们的结果解决了陈和Derezi的最近开放问题，陈和Derezi \'{n} Ski，他们为$ \ ell_1 $ norm提供了附近的最佳界限，以及$ p \中的$ \ ell_p $回归的次优界限（1,2） $。我们还提供了$ O的第一个总灵敏度上限（D ^ {\ max \ {1，p / 2 \} \ log ^ 2 n）$以满足最多的$ p $多项式增长。这改善了Tukan，Maalouf和Feldman的最新结果。通过将此与我们的技术组合起来的$ \ ell_p $回归结果，我们获得了一个使$ \ tilde o的活动回归算法（d ^ {1+ \ max \ {1，p / 2 \}} / \ mathrm {poly}。（\ epsilon））$疑问，回答陈和德里兹的另一个打开问题{n}滑雪。对于Huber损失的重要特殊情况，我们进一步改善了我们对$ \ tilde o的主动样本复杂性的绑定（d ^ {（1+ \ sqrt2）/ 2} / \ epsilon ^ c）$和非活跃$ \ tilde o的样本复杂性（d ^ {4-2 \ sqrt 2} / \ epsilon ^ c）$，由于克拉克森和伍德拉夫而改善了Huber回归的以前的D ^ 4 $。我们的敏感性界限具有进一步的影响，使用灵敏度采样改善了各种先前的结果，包括orlicz规范子空间嵌入和鲁棒子空间近似。最后，我们的主动采样结果为每种$ \ ell_p $ norm提供的第一个Sublinear时间算法。

translated by 谷歌翻译

What Makes A Good Fisherman? Linear Regression under Self-Selection Bias

Yeshwanth Cherapanamjeri , Constantinos Daskalakis , Andrew Ilyas , Manolis Zampetakis

分类：机器学习 | (统计)机器学习

2022-05-06

In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.

translated by 谷歌翻译

Lattice-Based Methods Surpass Sum-of-Squares in Clustering

Ilias Zadik , Min Jae Song , Alexander S. Wein , Joan Bruna

分类：机器学习 | (统计)机器学习

2021-12-07

聚类是无监督学习中的基本原始，它引发了丰富的计算挑战性推理任务。在这项工作中，我们专注于将$ D $ -dimential高斯混合的规范任务与未知（和可能的退化）协方差集成。最近的作品（Ghosh等人。恢复在高斯聚类实例中种植的某些隐藏结构。在许多类似的推理任务上的工作开始，这些较低界限强烈建议存在群集的固有统计到计算间隙，即群集任务是\ yringit {statistically}可能但没有\ texit {多项式 - 时间}算法成功。我们考虑的聚类任务的一个特殊情况相当于在否则随机子空间中找到种植的超立体载体的问题。我们表明，也许令人惊讶的是，这种特定的聚类模型\ extent {没有展示}统计到计算间隙，即使在这种情况下继续应用上述的低度和SOS下限。为此，我们提供了一种基于Lenstra - Lenstra - Lovasz晶格基础减少方法的多项式算法，该方法实现了$ D + 1 $样本的统计上最佳的样本复杂性。该结果扩展了猜想统计到计算间隙的问题的类问题可以通过“脆弱”多项式算法“关闭”，突出显示噪声在统计到计算间隙的发作中的关键而微妙作用。

translated by 谷歌翻译

List-Decodable Sparse Mean Estimation

Shiwei Zeng , Jie Shen

分类：机器学习

2022-05-28

Robust mean estimation is one of the most important problems in statistics: given a set of samples in $\mathbb{R}^d$ where an $\alpha$ fraction are drawn from some distribution $D$ and the rest are adversarially corrupted, we aim to estimate the mean of $D$. A surge of recent research interest has been focusing on the list-decodable setting where $\alpha \in (0, \frac12]$, and the goal is to output a finite number of estimates among which at least one approximates the target mean. In this paper, we consider that the underlying distribution $D$ is Gaussian with $k$-sparse mean. Our main contribution is the first polynomial-time algorithm that enjoys sample complexity $O\big(\mathrm{poly}(k, \log d)\big)$, i.e. poly-logarithmic in the dimension. One of our core algorithmic ingredients is using low-degree sparse polynomials to filter outliers, which may find more applications.

translated by 谷歌翻译

Robustness Implies Privacy in Statistical Estimation

Samuel B. Hopkins , Gautam Kamath , Mahbod Majid , Shyam Narayanan

分类： (统计)机器学习

2022-12-09

We study the relationship between adversarial robustness and differential privacy in high-dimensional algorithmic statistics. We give the first black-box reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental high-dimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearly-optimal polynomial-time robust estimators for the mean and covariance of high-dimensional Gaussians which are based on the Sum-of-Squares method, we design the first polynomial-time private estimators for these problems with nearly-optimal samples-accuracy-privacy tradeoffs. Our algorithms are also robust to a constant fraction of adversarially-corrupted samples.

translated by 谷歌翻译

Learning General Halfspaces with General Massart Noise under the Gaussian Distribution

Ilias Diakonikolas , Daniel M. Kane , Vasilis Kontonis , Christos Tzamos , Nikos Zarifis

分类：机器学习 | (统计)机器学习

2021-08-19

我们在高斯分布下使用Massart噪声与Massart噪声进行PAC学习半个空间的问题。在Massart模型中，允许对手将每个点$ \ mathbf {x} $的标签与未知概率$ \ eta（\ mathbf {x}）\ leq \ eta $，用于某些参数$ \ eta \ [0,1 / 2] $。目标是找到一个假设$ \ mathrm {opt} + \ epsilon $的错误分类错误，其中$ \ mathrm {opt} $是目标半空间的错误。此前已经在两个假设下研究了这个问题：（i）目标半空间是同质的（即，分离超平面通过原点），并且（ii）参数$ \ eta $严格小于$ 1/2 $。在此工作之前，当除去这些假设中的任何一个时，不知道非增长的界限。我们研究了一般问题并建立以下内容：对于$ \ eta <1/2 $，我们为一般半个空间提供了一个学习算法，采用样本和计算复杂度$ d ^ {o_ {\ eta}（\ log（1 / \ gamma））））}} \ mathrm {poly}（1 / \ epsilon）$，其中$ \ gamma = \ max \ {\ epsilon，\ min \ {\ mathbf {pr} [f（\ mathbf {x}）= 1]， \ mathbf {pr} [f（\ mathbf {x}）= -1] \} \} $是目标半空间$ f $的偏差。现有的高效算法只能处理$ \ gamma = 1/2 $的特殊情况。有趣的是，我们建立了$ d ^ {\ oomega（\ log（\ log（\ log（\ log））}}的质量匹配的下限，而是任何统计查询（SQ）算法的复杂性。对于$ \ eta = 1/2 $，我们为一般半空间提供了一个学习算法，具有样本和计算复杂度$ o_ \ epsilon（1）d ^ {o（\ log（1 / epsilon））} $。即使对于均匀半空间的子类，这个结果也是新的;均匀Massart半个空间的现有算法为$ \ eta = 1/2 $提供可持续的保证。我们与D ^ {\ omega（\ log（\ log（\ log（\ log（\ epsilon））} $的近似匹配的sq下限补充了我们的上限，这甚至可以为同类半空间的特殊情况而保持。

translated by 谷歌翻译

Tensor decompositions for learning latent variable models

Anima Anandkumar , Rong Ge , Daniel Hsu , Sham M. Kakade , Matus Telgarsky

分类：

2012-10-29

This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.

translated by 谷歌翻译

Clustering a Mixture of Gaussians with Unknown Covariance

Damek Davis , Mateo Díaz , Kaizheng Wang

分类： (统计)机器学习 | 机器学习

2021-10-04

我们调查与高斯的混合的数据分享共同但未知，潜在虐待协方差矩阵的数据。我们首先考虑具有两个等级大小的组件的高斯混合，并根据最大似然估计导出最大切割整数程序。当样品的数量在维度下线性增长时，我们证明其解决方案实现了最佳的错误分类率，直到对数因子。但是，解决最大切割问题似乎是在计算上棘手的。为了克服这一点，我们开发了一种高效的频谱算法，该算法达到最佳速率，但需要一种二次样本量。虽然这种样本复杂性比最大切割问题更差，但我们猜测没有多项式方法可以更好地执行。此外，我们收集了支持统计计算差距存在的数值和理论证据。最后，我们将MAX-CUT程序概括为$ k $ -means程序，该程序处理多组分混合物的可能性不平等。它享有相似的最优性保证，用于满足运输成本不平等的分布式的混合物，包括高斯和强烈的对数的分布。

translated by 谷歌翻译

Privately Estimating a Gaussian: Efficient, Robust and Optimal

Daniel Alabi , Pravesh K. Kothari , Pranay Tankala , Prayaag Venkat , Fred Zhang

分类： (统计)机器学习

2022-12-15

In this work, we give efficient algorithms for privately estimating a Gaussian distribution in both pure and approximate differential privacy (DP) models with optimal dependence on the dimension in the sample complexity. In the pure DP setting, we give an efficient algorithm that estimates an unknown $d$-dimensional Gaussian distribution up to an arbitrary tiny total variation error using $\widetilde{O}(d^2 \log \kappa)$ samples while tolerating a constant fraction of adversarial outliers. Here, $\kappa$ is the condition number of the target covariance matrix. The sample bound matches best non-private estimators in the dependence on the dimension (up to a polylogarithmic factor). We prove a new lower bound on differentially private covariance estimation to show that the dependence on the condition number $\kappa$ in the above sample bound is also tight. Prior to our work, only identifiability results (yielding inefficient super-polynomial time algorithms) were known for the problem. In the approximate DP setting, we give an efficient algorithm to estimate an unknown Gaussian distribution up to an arbitrarily tiny total variation error using $\widetilde{O}(d^2)$ samples while tolerating a constant fraction of adversarial outliers. Prior to our work, all efficient approximate DP algorithms incurred a super-quadratic sample cost or were not outlier-robust. For the special case of mean estimation, our algorithm achieves the optimal sample complexity of $\widetilde O(d)$, improving on a $\widetilde O(d^{1.5})$ bound from prior work. Our pure DP algorithm relies on a recursive private preconditioning subroutine that utilizes the recent work on private mean estimation [Hopkins et al., 2022]. Our approximate DP algorithms are based on a substantial upgrade of the method of stabilizing convex relaxations introduced in [Kothari et al., 2022].

translated by 谷歌翻译

Fuzzy Clustering with Similarity Queries

Wasim Huleihel , Arya Mazumdar , Soumyabrata Pal

分类：机器学习 | (统计)机器学习

2021-06-04

模糊或柔软$ k $ -means目标是众所周知的$ k $ -means问题的流行泛化，将$ k $ -means扩展到不确定，模糊和否则难以群集的数据集的聚类能力。在本文中，我们提出了一个半监督的主动聚类框架，其中允许学习者与Oracle（域专家）进行交互，询问一组所选项目之间的相似性。我们研究了本框架中的聚类查询和计算复杂性。我们证明具有一些这样的相似性查询使得一个人能够将多项式时间近似算法获得到另外的辅助NP难题。特别是，我们提供了在此设置中的模糊聚类的算法，该算法询问$ O（\ mathsf {poly}（k）\ log n）$相似查询并使用多项式 - 时间复杂度运行，其中$ n $是项目的数量。模糊$ k $ -means目标是非渗透，$ k $ -means作为一个特殊情况，相当于一些其他通用非核解问题，如非负矩阵分解。普遍存在的LLOYD型算法（或交替的最小化算法）可以以局部最小粘在一起。我们的结果表明，通过制作一些相似性查询，问题变得更加易于解决。最后，我们通过现实世界数据集测试我们的算法，展示了其在现实世界应用中的有效性。

translated by 谷歌翻译

Distributed Sparse Linear Regression with Sublinear Communication

Chen Amiraz , Robert Krauthgamer , Boaz Nadler

分类： (统计)机器学习 | 机器学习

2022-09-15

我们研究在计算和通信约束下分布式设置中高维稀疏线性回归的问题。具体来说，我们考虑了一个星形拓扑网络，该网络将几台机器连接到融合中心，他们可以与他们交换相对较短的消息。每台机器都有来自线性回归模型的嘈杂样品，该模型具有相同的未知稀疏$ d $ - 维数二维矢量$ \ theta $。融合中心的目标是使用几乎没有计算和有限的通信在每台机器上估算矢量$ \ theta $及其支持。在这项工作中，我们考虑基于正交匹配追求（OMP）的分布式算法，并理论上研究了他们精确收回$ \ theta $的支持的能力。我们证明，在某些条件下，即使在单个机器无法检测到$ \ theta $的支持下，分布式式方法在$ \ theta $的支持下，在$ d $中的总通信sublinear中正确恢复了它。此外，我们提出的模拟说明了基于分布式OMP的算法的性能，并表明它们的性能类似于更复杂和计算密集的方法，在某些情况下甚至表现优于它们。

translated by 谷歌翻译

Consistent Estimation for PCA and Sparse Regression with Oblivious Outliers

Tommaso d'Orsi , Chih-Hung Liu , Rajai Nasser , Gleb Novikov , David Steurer , Stefan Tiegel

分类：机器学习 | (统计)机器学习

2021-11-04

我们开发机器以设计有效的可计算和一致的估计，随着观察人数而达到零的估计误差，因为观察的次数增长，当面对可能损坏的答复，除了样本的所有品，除了每种量之外的ALL。作为具体示例，我们调查了两个问题：稀疏回归和主成分分析（PCA）。对于稀疏回归，我们实现了最佳样本大小的一致性$ n \ gtrsim（k \ log d）/ \ alpha ^ $和最佳错误率$ o（\ sqrt {（k \ log d）/（n \ cdot \ alpha ^ 2））$ N $是观察人数，$ D $是尺寸的数量，$ k $是参数矢量的稀疏性，允许在数量的数量中为逆多项式进行逆多项式样品。在此工作之前，已知估计是一致的，当Inliers $ \ Alpha $ IS $ O（1 / \ log \ log n）$，即使是（非球面）高斯设计矩阵时也是一致的。结果在弱设计假设下持有，并且在这种一般噪声存在下仅被D'Orsi等人最近以密集的设置（即一般线性回归）显示。 [DNS21]。在PCA的上下文中，我们在参数矩阵上的广泛尖端假设下获得最佳错误保证（通常用于矩阵完成）。以前的作品可以仅在假设下获得非琐碎的保证，即与最基于的测量噪声以$ n $（例如，具有方差1 / n ^ 2 $的高斯高斯）。为了设计我们的估算，我们用非平滑的普通方（如$ \ ell_1 $ norm或核规范）装备Huber丢失，并以一种新的方法来分析损失的新方法[DNS21]的方法[DNS21]。功能。我们的机器似乎很容易适用于各种估计问题。

translated by 谷歌翻译

The Price of Tolerance in Distribution Testing

Clément L. Canonne , Ayush Jain , Gautam Kamath , Jerry Li

分类： (统计)机器学习

2021-06-25

我们重新审视耐受分发测试的问题。也就是说，给出来自未知分发$ P $超过$ \ {1，\ dots，n \} $的样本，它是$ \ varepsilon_1 $ -close到或$ \ varepsilon_2 $ -far从引用分发$ q $（总变化距离）？尽管过去十年来兴趣，但在极端情况下，这个问题很好。在无噪声设置（即，$ \ varepsilon_1 = 0 $）中，样本复杂性是$ \ theta（\ sqrt {n}）$，强大的域大小。在频谱的另一端时，当$ \ varepsilon_1 = \ varepsilon_2 / 2 $时，样本复杂性跳转到勉强sublinear $ \ theta（n / \ log n）$。然而，非常少于中级制度。我们充分地表征了分发测试中的公差价格，作为$ N $，$ varepsilon_1 $，$ \ varepsilon_2 $，最多一个$ \ log n $ factor。具体来说，我们显示了\ [\ tilde \ theta \ left的样本复杂性（\ frac {\ sqrt {n}} {\ varepsilon_2 ^ {2}} + \ frac {n} {\ log n} \ cdot \ max \左\ {\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2}，\ left（\ frac {\ varepsilon_1} {\ varepsilon_2 ^ 2} \右）^ {\！\！\！2} \ \ \} \右），\]提供两个先前已知的案例之间的顺利折衷。我们还为宽容的等价测试问题提供了类似的表征，其中$ p $和$ q $均未赘述。令人惊讶的是，在这两种情况下，对样本复杂性的主数量是比率$ \ varepsilon_1 / varepsilon_2 ^ 2 $，而不是更直观的$ \ varepsilon_1 / \ varepsilon_2 $。特别是技术兴趣是我们的下限框架，这涉及在以往的工作中处理不对称所需的新颖近似性理论工具，从而缺乏以前的作品。

translated by 谷歌翻译