We give the first polynomial-time algorithm for performing linear or polynomial regression resilient to adversarial corruptions in both examples and labels. Given a sufficiently large (polynomial-size) training set drawn i.i.d. from distribution D and subsequently corrupted on some fraction of points, our algorithm outputs a linear function whose squared error is close to the squared error of the best-fitting linear function with respect to D, assuming that the marginal distribution of D over the input space is certifiably hypercontractive. This natural property is satisfied by many well-studied distributions such as Gaussian, strongly log-concave distributions and, uniform distribution on the hypercube among others. We also give a simple statistical lower bound showing that some distributional assumption is necessary to succeed in this setting. These results are the first of their kind and were not known to be even information-theoretically possible prior to our work. Our approach is based on the sum-of-squares (SoS) method and is inspired by the recent applications of the method for parameter recovery problems in unsupervised learning. Our algorithm can be seen as a natural convex relaxation of the following conceptually simple non-convex optimization problem: find a linear function and a large subset of the input corrupted sample such that the least squares loss of the function over the subset is minimized over all possible large subsets.
translated by 谷歌翻译
We give new algorithms based on the sum-of-squares method for tensor decomposition. Our results improve the best known running times from quasi-polynomial to polynomial for several problems, including decomposing random overcomplete 3-tensors and learning overcomplete dictionaries with constant relative sparsity. We also give the first robust analysis for decomposing overcomplete 4-tensors in the smoothed analysis model. A key ingredient of our analysis is to establish small spectral gaps in moment matrices derived from solutions to sum-of-squares relaxations. To enable this analysis we augment sum-of-squares relaxations with spectral analogs of maximum entropy constraints.
translated by 谷歌翻译
We study the problem of list-decodable (robust) Gaussian mean estimation and the related problem of learning mixtures of separated spherical Gaussians. In the former problem, we are given a set T of points in R n with the promise that an α-fraction of points in T , where 0 < α < 1/2, are drawn from an unknown mean identity covariance Gaussian G, and no assumptions are made about the remaining points. The goal is to output a small list of candidate vectors with the guarantee that at least one of the candidates is close to the mean of G. In the latter problem, we are given samples from a k-mixture of spherical Gaussians on R n and the goal is to estimate the unknown model parameters up to small accuracy. We develop a set of techniques that yield new efficient algorithms with significantly improved guarantees for these problems. Specifically, our main contributions are as follows: List-Decodable Mean Estimation. Fix any d ∈ Z + and 0 < α < 1/2. We design an algorithm with sample complexity O d (poly(n d /α)) and runtime O d (poly(n/α) d) that outputs a list of O(1/α) many candidate vectors such that with high probability one of the candidates is within ℓ 2-distance O d (α −1/(2d)) from the mean of G. The only previous algorithm for this problem [CSV17] achieved error˜Oerror˜ error˜O(α −1/2) under second moment conditions. For d = O(1/ε), where ε > 0 is a constant, our algorithm runs in polynomial time and achieves error O(α ε). For d = Θ(log(1/α)), our algorithm runs in time (n/α) O(log(1/α)) and achieves error O(log 3/2 (1/α)), almost matching the information-theoretically optimal bound of Θ(log 1/2 (1/α)) that we establish. We also give a Statistical Query (SQ) lower bound suggesting that the complexity of our algorithm is qualitatively close to best possible. Learning Mixtures of Spherical Gaussians. We give a learning algorithm for mixtures of spherical Gaussians, with unknown spherical covariances, that succeeds under significantly weaker separation assumptions compared to prior work. For the prototypical case of a uniform k-mixture of identity co-variance Gaussians we obtain the following: For any ε > 0, if the pairwise separation between the means is at least Ω(k ε + log(1/δ)), our algorithm learns the unknown parameters within accuracy δ with sample complexity and running time poly(n, 1/δ, (k/ε) 1/ε). Moreover, our algorithm is robust to a small dimension-independent fraction of corrupted data. The previously best known polynomial time algorithm [VW02] required separation at least k 1/4 polylog(k/δ). Finally, our algorithm works under separation of˜Oof˜ of˜O(log 3/2 (k) + log(1/δ)) with sample complexity and running time poly(n, 1/δ, k log k). This bound is close to the information-theoretically minimum separation of Ω(√ log k) [RV17]. Our main technical contribution is a new technique, using degree-d multivariate polynomials, to remove outliers from high-dimensional datasets where the majority of the points are corrupted.
translated by 谷歌翻译

2018-11-07

translated by 谷歌翻译
We describe a general technique that yields the first Statistical Query lower bounds for a range of fundamental high-dimensional learning problems involving Gaussian distributions. Our main results are for the problems of (1) learning Gaussian mixture models (GMMs), and (2) robust (agnostic) learning of a single unknown Gaussian distribution. For each of these problems, we show a super-polynomial gap between the (information-theoretic) sample complexity and the computational complexity of any Statistical Query algorithm for the problem. Statistical Query (SQ) algorithms are a class of algorithms that are only allowed to query expectations of functions of the distribution rather than directly access samples. This class of algorithms is quite broad: a wide range of known algorithmic techniques in machine learning are known to be implementable using SQs. Moreover, for the unsupervised learning problems studied in this paper, all known algorithms with non-trivial performance guarantees are SQ or are easily implementable using SQs. Our SQ lower bound for Problem (1) is qualitatively matched by known learning algorithms for GMMs. At a conceptual level, this result implies that-as far as SQ algorithms are concerned-the computational complexity of learning GMMs is inherently exponential in the dimension of the latent space-even though there is no such information-theoretic barrier. Our lower bound for Problem (2) implies that the accuracy of the robust learning algorithm in [DKK + 16] is essentially best possible among all polynomial-time SQ algorithms. On the positive side, we also give a new (SQ) learning algorithm for Problem (2) achieving the information-theoretically optimal accuracy, up to a constant factor, whose running time essentially matches our lower bound. Our algorithm relies on a filtering technique generalizing [DKK + 16] that removes outliers based on higher-order tensors. Our SQ lower bounds are attained via a unified moment-matching technique that is useful in other contexts and may be of broader interest. Our technique yields nearly-tight lower bounds for a number of related unsupervised estimation problems. Specifically, for the problems of (3) robust covariance estimation in spectral norm, and (4) robust sparse mean estimation, we establish a quadratic statistical-computational tradeoff for SQ algorithms, matching known upper bounds. Finally, our technique can be used to obtain tight sample complexity lower bounds for high-dimensional testing problems. Specifically , for the classical problem of robustly testing an unknown mean (known covariance) Gaussian, our technique implies an information-theoretic sample lower bound that scales linearly in the dimension. Our sample lower bound matches the sample complexity of the corresponding robust learning problem and separates the sample complexity of robust testing from standard (non-robust) testing. This separation is surprising because such a gap does not exist for the corresponding lear
translated by 谷歌翻译
We use the Sum of Squares method to develop new efficient algorithms for learning well-separated mixtures of Gaussians and robust mean estimation, both in high dimensions, that substantially improve upon the statistical guarantees achieved by previous efficient algorithms. Our contributions are: • Mixture models with separated means: We study mixtures of k distributions in d dimensions, where the means of every pair of distributions are separated by at least k ε. In the special case of spherical Gaussian mixtures, we give a (dk) O(1/ε 2)-time algorithm that learns the means assuming separation at least k ε , for any ε > 0. This is the first algorithm to improve on greedy ("single-linkage") and spectral clustering, breaking a long-standing barrier for efficient algorithms at separation k 1/4. • Robust estimation: When an unknown (1−ε)-fraction of X 1 ,. .. , X n are chosen from a sub-Gaussian distribution with mean µ but the remaining points are chosen adversarially, we give an algorithm recovering µ to error ε 1−1/t in time d O(t 2) , so long as sub-Gaussian-ness up to O(t) moments can be certified by a Sum of Squares proof. This is the first polynomial-time algorithm with guarantees approaching the information-theoretic limit for non-Gaussian distributions. Previous algorithms could not achieve error better than ε 1/2. Both of these results are based on a unified technique. Inspired by recent algorithms of Di-akonikolas et al. in robust statistics, we devise an SDP based on the Sum of Squares method for the following setting: given X 1 ,. .. , X n ∈ R d for large d and n = poly(d) with the promise that a subset of X 1 ,. .. , X n were sampled from a probability distribution with bounded moments, recover some information about that distribution.
translated by 谷歌翻译
We present a general approach to rounding semidefinite programming relaxations obtained by the Sum-of-Squares method (Lasserre hierarchy). Our approach is based on using the connection between these relaxations and the Sum-of-Squares proof system to transform a combining algorithm-an algorithm that maps a distribution over solutions into a (possibly weaker) solution-into a rounding algorithm that maps a solution of the relaxation to a solution of the original problem. Using this approach, we obtain algorithms that yield improved results for natural variants of three well-known problems: 1. We give a quasipolynomial-time algorithm that approximates max x 2 =1 P(x) within an additive factor of εP spectral additive approximation, where ε > 0 is a constant, P is a degree d = O(1), n-variate polynomial with nonnegative coefficients, and P spectral is the spectral norm of a matrix corresponding to P's coefficients. Beyond being of interest in its own right, obtaining such an approximation for general polynomials (with possibly negative coefficients) is a long-standing open question in quantum information theory, and our techniques have already led to improved results in this area (Brandão and Harrow, STOC '13). 2. We give a polynomial-time algorithm that, given a subspace V ⊆ n of dimension d that (almost) contains the characteristic function of a set of size n/k, finds a vector v ∈ V that satisfies ¾ i v 4 i Ω(d −1/3 k(¾ i v 2 i) 2). This is a natural analytical relaxation of the problem of finding the sparsest element in a subspace, and is also motivated by a connection to the Small Set Expansion problem shown by Barak et al. (STOC 2012). In particular our results yield an improvement of the previous best known algorithms for small set expansion in a certain range of parameters. 3. We use this notion of L 4 vs. L 2 sparsity to obtain a polynomial-time algorithm with substantially improved guarantees for recovering a planted sparse vector v in a random d-dimensional subspace of n. If v has µn nonzero coordinates, we can recover it with high probability whenever µ O(min(1, n/d 2)). In particular, when d √ n, this recovers a planted vector with up to Ω(n) nonzero coordinates. When d n 2/3 , our algorithm improves upon existing methods based on comparing the L 1 and L ∞ norms, which intrinsically require µ O
translated by 谷歌翻译
For f a weighted voting scheme used by n voters to choose between two candidates, the n Shapley-Shubik Indices (or Shapley values) of f provide a measure of how much control each voter can exert over the overall outcome of the vote. Shapley-Shubik indices were introduced by Lloyd Shapley and Martin Shubik in 1954 [SS54] and are widely studied in social choice theory as a measure of the "influence" of voters. The Inverse Shapley Value Problem is the problem of designing a weighted voting scheme which (approximately) achieves a desired input vector of values for the Shapley-Shubik indices. Despite much interest in this problem no provably correct and efficient algorithm was known prior to our work. We give the first efficient algorithm with provable performance guarantees for the Inverse Shapley Value Problem. For any constant ǫ > 0 our algorithm runs in fixed poly(n) time (the degree of the polynomial is independent of ǫ) and has the following performance guarantee: given as input a vector of desired Shapley values, if any "reasonable" weighted voting scheme (roughly, one in which the threshold is not too skewed) approximately matches the desired vector of values to within some small error, then our algorithm explicitly outputs a weighted voting scheme that achieves this vector of Shapley values to within error ǫ. If there is a "reasonable" voting scheme in which all voting weights are integers at most poly(n) that approximately achieves the desired Shapley values, then our algorithm runs in time poly(n) and outputs a weighted voting scheme that achieves the target vector of Shapley values to within error ǫ = n −1/8 .
translated by 谷歌翻译
We study the efficient learnability of geometric concept classes-specifically, low-degree polynomial threshold functions (PTFs) and intersections of halfspaces-when a fraction of the training data is adversarially corrupted. We give the first polynomial-time PAC learning algorithms for these concept classes with dimension-independent error guarantees in the presence of nasty noise under the Gaussian distribution. In the nasty noise model, an omniscient adversary can arbitrarily corrupt a small fraction of both the unlabeled data points and their labels. This model generalizes well-studied noise models, including the malicious noise model and the agnostic (adversarial label noise) model. Prior to our work, the only concept class for which efficient malicious learning algorithms were known was the class of origin-centered halfspaces [KLS09, ABL17]. Specifically, our robust learning algorithm for low-degree PTFs succeeds under a number of tame distributions-including the Gaussian distribution and, more generally, any log-concave distribution with (approximately) known low-degree moments. For LTFs under the Gaussian distribution, we give a polynomial-time algorithm that achieves error O(ǫ), where ǫ is the noise rate. At the core of our PAC learning results is an efficient algorithm to approximate the low-degree Chow-parameters of any bounded function in the presence of nasty noise. To achieve this, we employ an iterative spectral method for outlier detection and removal, inspired by recent work in robust unsupervised learning. Our afore-mentioned algorithm succeeds for a range of distributions satisfying mild concentration bounds and moment assumptions. The correctness of our robust learning algorithm for intersections of halfspaces makes essential use of a novel robust inverse independence lemma that may be of broader interest.
translated by 谷歌翻译

2016-04-21
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an ε-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question: Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions that may be applicable to many other problems.
translated by 谷歌翻译
We study the computational complexity of approximating the 2-to-q norm of linear operators (defined as A 2→q = max v0 Av q /v 2) for q > 2, as well as connections between this question and issues arising in quantum information theory and the study of Khot's Unique Games Conjecture (UGC). We show the following: 1. For any constant even integer q 4, a graph G is a small-set expander if and only if the projector into the span of the top eigenvectors of G's adjacency matrix has bounded 2 → q norm. As a corollary, a good approximation to the 2 → q norm will refute the Small-Set Expansion Conjecture-a close variant of the UGC. We also show that such a good approximation can be computed in exp(n 2/q) time, thus obtaining a different proof of the known subexponential algorithm for Small-Set Expansion. 2. Constant rounds of the "Sum of Squares" semidefinite programing hierarchy certify an upper bound on the 2 → 4 norm of the projector to low-degree polynomials over the Boolean cube, as well certify the unsatisfiability of the "noisy cube" and "short code" based instances of Unique Games considered by prior works. This improves on the previous upper bound of exp(log O(1) n) rounds (for the "short code"), as well as separates the "Sum of Squares"/"Lasserre" hierarchy from weaker hierarchies that were known to require ω(1) rounds. * Microsoft Research New England.. Much of the work done while the author was an intern at Microsoft Research New England. 3. We show reductions between computing the 2 → 4 norm and computing the injective tensor norm of a tensor, a problem with connections to quantum information theory. Three corollaries are: (i) the 2 → 4 norm is NP-hard to approximate to precision inverse-polynomial in the dimension, (ii) the 2 → 4 norm does not have a good approximation (in the sense above) unless 3-SAT can be solved in time exp(√ n poly log(n)), and (iii) known algorithms for the quantum sep-arability problem imply a non-trivial additive approximation for the 2 → 4 norm.
translated by 谷歌翻译

2018-11-29

translated by 谷歌翻译
We prove two main results on how arbitrary linear threshold functions f (x) = sign(w · x − θ) over the n-dimensional Boolean hypercube can be approximated by simple threshold functions. Our first result shows that every n-variable threshold function f is ǫ-close to a threshold function depending only on Inf(f) 2 · poly(1/ǫ) many variables, where Inf(f) denotes the total influence or average sensitivity of f. This is an exponential sharpening of Friedgut's well-known theorem [Fri98], which states that every Boolean function f is ǫ-close to a function depending only on 2 O(Inf(f)/ǫ) many variables, for the case of threshold functions. We complement this upper bound by showing that Ω(Inf(f) 2 + 1/ǫ 2) many variables are required for ǫ-approximating threshold functions. Our second result is a proof that every n-variable threshold function is ǫ-close to a threshold function with integer weights at most poly(n) · 2 ˜ O(1/ǫ 2/3). This is a significant improvement, in the dependence on the error parameter ǫ, on an earlier result of [Ser07] which gave a poly(n) · 2 ˜ O(1/ǫ 2) bound. Our improvement is obtained via a new proof technique that uses strong anti-concentration bounds from probability theory. The new technique also gives a simple and modular proof of the original [Ser07] result, and extends to give low-weight approximators for threshold functions under a range of probability distributions beyond just the uniform distribution.
translated by 谷歌翻译
We study the prototypical problem of high-dimensional linear regression in a robust model where an ε-fraction of the samples can be adversarially corrupted. We focus on the fundamental setting where the covariates of the uncorrupted samples are drawn from a Gaussian distribution N (0, Σ) on R d. We give nearly tight upper bounds and computational lower bounds for this problem. Specifically, our main contributions are as follows: • For the case that the covariance matrix is known to be the identity, we give a sample near-optimal and computationally efficient algorithm that draws˜Odraws˜ draws˜O(d/ε 2) labeled examples and outputs a candidate hypothesis vector β that approximates the unknown regression vector β within ℓ 2-norm O(ε log(1/ε)σ), where σ is the standard deviation of the random observation noise. An error of Ω(εσ) is information-theoretically necessary, even with infinite sample size. Hence, the error guarantee of our algorithm is optimal, up to a logarithmic factor in 1/ε. Prior work gave an algorithm for this problem with sample complexity˜Ωcomplexity˜ complexity˜Ω(d 2 /ε 2) whose error guarantee scales with the ℓ 2-norm of β. • For the case of unknown covariance Σ, we show that we can efficiently achieve the same error guarantee of O(ε log(1/ε)σ), as in the known covariance case, using an additional˜O additional˜ additional˜O(d 2 /ε 2) unlabeled examples. On the other hand, an error of O(εσ) can be information-theoretically attained with O(d/ε 2) samples. We prove a Statistical Query (SQ) lower bound providing evidence that this quadratic tradeoff in the sample size is inherent. More specifically, we show that any polynomial time SQ learning algorithm for robust linear regression (in Huber's contamination model) with estimation complexity O(d 2−c), where c > 0 is an arbitrarily small constant, must incur an error of Ω(√ εσ).
translated by 谷歌翻译
We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise-where an ε-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error O(ε) in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of √ 2 and the running time is polynomial in d and 1/ε. When both the mean and covariance are unknown, the running time is polynomial in d and quasipolynomial in 1/ε. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings.
translated by 谷歌翻译
The Chow parameters of a Boolean function f : {−1, 1} n → {−1, 1} are its n + 1 degree-0 and degree-1 Fourier coefficients. It has been known since 1961 [Cho61, Tan61] that the (exact values of the) Chow parameters of any linear threshold function f uniquely specify f within the space of all Boolean functions, but until recently [OS11] nothing was known about efficient algorithms for reconstructing f (exactly or approximately) from exact or approximate values of its Chow parameters. We refer to this reconstruction problem as the Chow Parameters Problem. Our main result is a new algorithm for the Chow Parameters Problem which, given (sufficiently accurate approximations to) the Chow parameters of any linear threshold function f , runs in time˜Otime˜ time˜O(n 2)· (1/ǫ) O(log 2 (1/ǫ)) and with high probability outputs a representation of an LTF f ′ that is ǫ-close to f. The only previous algorithm [OS11] had running time poly(n) · 2 2 ˜ O(1/ǫ 2). As a byproduct of our approach, we show that for any linear threshold function f over {−1, 1} n , there is a linear threshold function f ′ which is ǫ-close to f and has all weights that are integers at most √ n · (1/ǫ) O(log 2 (1/ǫ)). This significantly improves the best previous result of [DS09] which gave a poly(n) · 2 ˜ O(1/ǫ 2/3) weight bound, and is close to the known lower bound of max{ √ n, (1/ǫ) Ω(log log(1/ǫ)) } [Gol06, Ser07]. Our techniques also yield improved algorithms for related problems in learning theory. In addition to being significantly stronger than previous work, our results are obtained using conceptually simpler proofs. The two main ingredients underlying our results are (1) a new structural result showing that for f any linear threshold function and g any bounded function, if the Chow parameters of f are close to the Chow parameters of g then f is close to g; (2) a new boosting-like algorithm that given approximations to the Chow parameters of a linear threshold function outputs a bounded function whose Chow parameters are close to those of
translated by 谷歌翻译
We give a new approach to the dictionary learning (also known as "sparse coding") problem of recovering an unknown n × m matrix A (for m n) from examples of the form y = Ax + e, where x is a random vector in R m with at most τm nonzero coordinates, and e is a random noise vector in R n with bounded magnitude. For the case m = O(n), our algorithm recovers every column of A within arbitrarily good constant accuracy in time m O(log m/ log(τ −1)) , in particular achieving polynomial time if τ = m −δ for any δ > 0, and time m O(log m) if τ is (a sufficiently small) constant. Prior algorithms with comparable assumptions on the distribution required the vector x to be much sparser-at most √ n nonzero coordinates-and there were intrinsic barriers preventing these algorithms from applying for denser x. We achieve this by designing an algorithm for noisy tensor decomposition that can recover, under quite general conditions, an approximate rank-one decomposition of a tensor T, given access to a tensor T ′ that is τ-close to T in the spectral norm (when considered as a matrix). To our knowledge, this is the first algorithm for tensor decomposition that works in the constant spectral-norm noise regime, where there is no guarantee that the local optima of T and T ′ have similar structures. Our algorithm is based on a novel approach to using and analyzing the Sum of Squares semidefinite programming hierarchy (Parrilo 2000, Lasserre 2001), and it can be viewed as an indication of the utility of this very general and powerful tool for unsupervised learning problems.
translated by 谷歌翻译

2018-11-05

translated by 谷歌翻译

2018-07-30

translated by 谷歌翻译
Let x be a random vector coming from any k-wise independent distribution over {−1, 1} n. For an n-variate degree-2 polynomial p, we prove that E[sgn(p(x))] is determined up to an additive ε for k = poly(1/ε). This answers an open question of Diakonikolas et al. (FOCS 2009). Using standard constructions of k-wise independent distributions, we obtain a broad class of explicit generators that ε-fool the class of degree-2 threshold functions with seed length log n · poly(1/ε). Our approach is quite robust: it easily extends to yield that the intersection of any constant number of degree-2 threshold functions is ε-fooled by poly(1/ε)-wise independence. Our results also hold if the entries of x are k-wise independent standard normals, implying for example that bounded independence derandomizes the Goemans-Williamson hyperplane rounding scheme. To achieve our results, we introduce a technique we dub multivariate FT-mollification, a generalization of the univariate form introduced by Kane et al. (SODA 2010) in the context of streaming algorithms. Along the way we prove a generalized hypercontractive inequality for quadratic forms which takes the operator norm of the associated matrix into account. These techniques may be of independent interest.
translated by 谷歌翻译
\${authors}

\${pubdate}
\${abstract_cn}
translated by 谷歌翻译