We study randomized sketching methods for approximately solving least-squares problem with a general convex constraint. The quality of a least-squares approximation can be assessed in different ways: either in terms of the value of the quadratic objective function (cost approximation), or in terms of some distance measure between the approximate minimizer and the true minimizer (solution approximation). Focusing on the latter criterion, our first main result provides a general lower bound on any randomized method that sketches both the data matrix and vector in a least-squares problem; as a surprising consequence, the most widely used least-squares sketch is sub-optimal for solution approximation. We then present a new method known as the iterative Hessian sketch, and show that it can be used to obtain approximations to the original least-squares problem using a projection dimension proportional to the statistical complexity of the least-squares minimizer, and a logarithmic number of iterations. We illustrate our general theory with simulations for both unconstrained and constrained versions of least-squares, including 1-regularization and nuclear norm constraints. We also numerically demonstrate the practicality of our approach in a real face expression classification experiment.
translated by 谷歌翻译
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditional quan-tile regression process through projection based on these estimated quantile curves. Our general quantile regression framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifice any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. As an important application, the statistical inference on conditional distribution functions is considered. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from sub-samples and can be carried out at almost no additional computational cost. Simulations confirm our statistical inferential theory.
translated by 谷歌翻译
When randomized ensembles such as bagging or random forests are used for binary classification , the prediction error of the ensemble tends to decrease and stabilize as the number of classifiers increases. However, the precise relationship between prediction error and ensemble size is unknown in practice. In the standard case when classifiers are aggregated by majority vote, the present work offers a way to quantify this convergence in terms of "algorithmic variance," i.e. the variance of prediction error due only to the randomized training algorithm. Specifically, we study a theoretical upper bound on this variance, and show that it is sharp-in the sense that it is attained by a specific family of randomized classifiers. Next, we address the problem of estimating the unknown value of the bound, which leads to a unique twist on the classical problem of non-parametric density estimation. In particular, we develop an estimator for the bound and show that its MSE matches optimal non-parametric rates under certain conditions. (Concurrent with this work, some closely related results have also been considered in Cannings and Samworth [1] and Lopes [2].)
translated by 谷歌翻译
我们研究了随机优化问题的统计推断和分布鲁棒解决方法,重点关注最优值的置信区间和渐近实现精确覆盖的解。我们开发了一个广义经验似然框架 - 基于从非参数$ f $ - 发散球构造的分布不确定性集 - 用于哈达玛可微函数,特别是随机优化问题。作为该理论的结果,我们提供了一种原则性方法,用于选择分布不确定区域的大小,以提供实现精确覆盖的单侧和双侧置信区间。我们还对我们的分布稳健的公式进行了渐近扩展,显示了多样性理由通过它们的方差来规范问题。最后,我们展示了我们研究的分布稳健配方的优化剂(基本上)与经典样品平均近似中的相同的一致性特性。我们的一般方法适用于快速混合的固定序列,包括几何遍历的Harris复发马氏链。
translated by 谷歌翻译
We provide fast algorithms for overconstrained p regression and related problems: for an n × d input matrix A and vector b ∈ R n , in O(nd log n) time we reduce the problem min x∈R d Ax − b p to the same problem with input matrix˜Amatrix˜ matrix˜A of dimension s × d and corresponding˜bcorresponding˜corresponding˜b of dimension s × 1. Here, ˜ A and˜band˜and˜b are a coreset for the problem, consisting of sampled and rescaled rows of A and b; and s is independent of n and polynomial in d. Our results improve on the best previous algorithms when n d, for all p ∈ [1, ∞) except p = 2; in particular, they improve the O(nd 1.376+) running time of Sohler and Woodruff (STOC, 2011) for p = 1, that uses asymptot-ically fast matrix multiplication, and the O(nd 5 log n) time of Dasgupta et al. (SICOMP, 2009) for general p, that uses ellipsoidal rounding. We also provide a suite of improved results for finding well-conditioned bases via ellipsoidal rounding, illustrating tradeoffs between running time and conditioning quality, including a one-pass conditioning algorithm for general p problems. To complement this theory, we provide a detailed empirical evaluation of implementations of our algorithms for p = 1, comparing them with several related algorithms. Among other things, our empirical results clearly show that, in the asymptotic regime, the theory is a very good guide to the practical performance of these algorithms. Our algorithms use our faster constructions of well-conditioned bases for p spaces and, for p = 1, a fast subspace embedding of independent interest that we call the Fast Cauchy Transform: a distribution over matrices Π : R n → R O(d log d) , found obliviously to A, that approximately preserves the 1 norms: that is, with large probability, simultaneously for all x, Ax 1 ≈ ΠAx 1 , with distortion O(d 2+η), for an arbitrarily small constant η > 0; and, moreover, ΠA can be computed in O(nd log d) time. The techniques underlying our Fast Cauchy Transform include fast Johnson-Lindenstrauss transforms, low-coherence matrices, and rescaling by Cauchy random variables.
translated by 谷歌翻译
该调查突出了数值线性代数的算法的最新进展,这些算法来自线性草图技术,其中给定的矩阵,首先通过将其乘以具有某些属性的(通常)随机矩阵将其压缩到更小的矩阵。然后可以在较小的矩阵上执行大部分昂贵的计算,从而加速原始问题的解决方案。在本次调查中,我们考虑最小二乘法和稳健回归问题,低秩近似和图形分析。我们还讨论了这些问题的一些变体。最后,我们讨论了草图方法的局限性。
translated by 谷歌翻译
低秩矩阵近似,例如截断奇异值分解和秩揭示QR分解,在印度分析和科学计算中起着重要作用。这项工作调查并扩展了最近的研究,这表明随机化提供了一个强大的工具来执行低秩矩阵近似。这些技术比传统方法更充分地利用现代计算体系结构,并打开处理真正大规模数据集的可能性。本文提出了一种用于构建计算部分矩阵分解的随机算法的模块化框架。这些方法使用randomsampling来识别捕获矩阵的大部分动作的子空间。然后将输入矩阵明确地或隐式地压缩到该子空间,并且确定性地操纵简化矩阵以获得期望的低秩。因式分解。在许多情况下,这种方法在准确性,速度和稳健性方面优于其经典竞争对手。这些声明得到了广泛的数值实验和详细的误差分析的支持。
translated by 谷歌翻译
我们研究了无线性方法,用于优化线性策略类的策略优化。我们专注于在应用于线性二次系统时表征这些方法的收敛速度,并研究驱动噪声和奖励反馈的各种设置。我们证明了这些方法可以合理地收敛到最优策略的任何预先指定的容差内,并且具有多个零阶评估,该评估是问题的容错,维度和曲率属性的显式多项式。我们的分析揭示了附加驱动噪声和随机初始化设置之间的一些有趣的差异,以及一点和两点向前反馈的设置。我们的理论通过对这些系统上的无衍生方法的广泛模拟得到证实。在此过程中,我们推导出随机零阶优化算法的收敛性,当应用于某类非凸问题时。
translated by 谷歌翻译
本文提出了一类利用分而治之的方法进行分布式统计量估计的新算法。我们证明了分而治之策略的关键优势之一是鲁棒性,这是大型分布式系统的一个重要特征。我们建立了这些分布式算法的性能与收敛非正态逼近率之间的联系,并证明了非渐近偏差保证,以及对所得估计量的限制定理。我们通过几个例子说明了我们的技术:特别是,我们获得了中值估计器的新结果,并为分布式最大似然估计提供了性能保证。
translated by 谷歌翻译
基于模型和无模型方法的有效性是强化学习(RL)中长期存在的问题。受RL最近在连续控制任务上取得的经验成功的启发,我们研究了线性二次调节器(LQR)上基于流行模型和无模型算法的样本复杂度。我们表明,对于策略评估,一个简单的基于模型的插件方法需要渐近地减少样本比经典最小二乘临时偏差(LSTD)估计器达到相同质量的解;两种方法之间的样本复杂性差距至少可以是规定的因素。对于政策评估,我们研究一个简单的问题实例家族,并表明名义(确定性等价原则)控制还要求状态维数的因子比政策梯度方法更少,以在这些实例上达到相同的控制性能水平。此外,即使采用通常在实践中使用的基线,差距仍然存在。据我们所知,这是第一个理论结果,它表明在连续控制任务中基于模型和无模型方法之间样本复杂性的分离。
translated by 谷歌翻译
我们研究了在$ T $是Toeplitz的假设下估计adistribution $ \ mathcal {D} $超过$ d $ -dimensional向量的协方差矩阵$ T $的查询复杂度。这种假设出现在许多信号处理问题中,其中任何两个测量之间的协方差仅取决于那些测量之间的时间或距离。我们对估计策略感兴趣,这些估计策略可能选择仅查看每个矢量样本$ x ^ {(\ ell)} \ sim \ mathcal {D} $中的条目子集,这通常等同于减少无线信号处理应用中的硬件和通信要求高级成像。我们的目标是最小化1)从$ \ mathcal {D} $中抽取的矢量样本数量和2)每个样本中访问的条目数量。我们提供了一些关于利用$ T $的Toeplitz结构的这些样本复杂性度量的第一个非渐近边界,并且通过这样做,显着改进了通用协方差矩阵的结果。我们的界限来自对经典和广泛使用的估计算法(以及一些新变体)的分析,包括基于根据所谓的稀疏标尺从每个矢量样本中选择条目的方法。在许多情况下,我们将上层边界与匹配或几乎匹配的下边界配对。除了适用于任何Toeplitz $ T $的结果之外,我们进一步研究了当$ T $接近低等级时的重要设置,这通常是实践中的情况。我们表明,基于稀疏标尺的方法在这个设置中表现得更好,样本复杂度在$ d $中线性地缩放。受此发现的推动,我们开发了一种新的协方差估计策略,该策略进一步改进了低秩情况下的所有现有方法:当$ T $排名为$ k $ ornearly rank- $ k $时,它实现了样本复杂度,取决于$ k的多项式$并且仅以$ d $对数。
translated by 谷歌翻译
Least squares approximation is a technique to find an approximate solution toa system of linear equations that has no exact solution. In a typical setting,one lets $n$ be the number of constraints and $d$ be the number of variables,with $n \gg d$. Then, existing exact methods find a solution vector in$O(nd^2)$ time. We present two randomized algorithms that provide very accuraterelative-error approximations to the optimal value and the solution vector of aleast squares approximation problem more rapidly than existing exactalgorithms. Both of our algorithms preprocess the data with the RandomizedHadamard Transform. One then uniformly randomly samples constraints and solvesthe smaller problem on those constraints, and the other performs a sparserandom projection and solves the smaller problem on those projectedcoordinates. In both cases, solving the smaller problem provides relative-errorapproximations, and, if $n$ is sufficiently larger than $d$, the approximatesolution can be computed in $O(nd \log d)$ time.
translated by 谷歌翻译
We analyze convergence rates of stochastic optimization procedures fornon-smooth convex optimization problems. By combining randomized smoothingtechniques with accelerated gradient methods, we obtain convergence rates ofstochastic optimization procedures, both in expectation and with highprobability, that have optimal dependence on the variance of the gradientestimates. To the best of our knowledge, these are the first variance-basedrates for non-smooth optimization. We give several applications of our resultsto statistical estimation problems, and provide experimental results thatdemonstrate the effectiveness of the proposed algorithms. We also describe howa combination of our algorithm with recent work on decentralized optimizationyields a distributed stochastic optimization algorithm that is order-optimal.
translated by 谷歌翻译
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an ε-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question: Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions that may be applicable to many other problems.
translated by 谷歌翻译
We reconsider randomized algorithms for the low-rank approximation of symmetric positive semi-definite (SPSD) matrices such as Laplacian and kernel matrices that arise in data analysis and machine learning applications. Our main results consist of an empirical evaluation of the performance quality and running time of sampling and projection methods on a diverse suite of SPSD matrices. Our results highlight complementary aspects of sampling versus projection methods; they characterize the effects of common data preprocessing steps on the performance of these algorithms; and they point to important differences between uniform sampling and nonuniform sampling methods based on leverage scores. In addition, our empirical results illustrate that existing theory is so weak that it does not provide even a qualitative guide to practice. Thus, we complement our empirical results with a suite of worst-case theoretical bounds for both random sampling and random projection methods. These bounds are qualitatively superior to existing bounds-e.g., improved additive-error bounds for spectral and Frobenius norm error and relative-error bounds for trace norm error-and they point to future directions to make these algorithms useful in even larger-scale machine learning applications.
translated by 谷歌翻译
Fitting high-dimensional statistical models often requires the use of non-linear parameter estimation procedures. As a consequence, it is generally impossible to obtain an exact characterization of the probability distribution of the parameter estimates. This in turn implies that it is extremely challenging to quantify the uncertainty associated with a certain parameter estimate. Concretely, no commonly accepted procedure exists for computing classical measures of uncertainty and statistical significance as confidence intervals or p-values for these models. We consider here high-dimensional linear regression problem, and propose an efficient algorithm for constructing confidence intervals and p-values. The resulting confidence intervals have nearly optimal size. When testing for the null hypothesis that a certain parameter is vanishing, our method has nearly optimal power. Our approach is based on constructing a 'de-biased' version of regularized M-estimators. The new construction improves over recent work in the field in that it does not assume a special structure on the design matrix. We test our method on synthetic data and a high-throughput genomic data set about riboflavin production rate, made publicly available by Bühlmann et al. (2014).
translated by 谷歌翻译
多年来,协方差矩阵的估计引起了统计研究界的广泛关注,部分原因在于主成分分析等重要应用。然而,经常使用经验协方差估计器(及其修改)对数据中的异常值非常敏感。正如PJ Huber在1964年所写的那样,“......这提出了一个问题,高斯本来可以提出这个问题,但据我所知,这只是几年前提出的问题(尤其是图基):如果真的分配会怎么样?与假定的正常偏差略有不同?现在众所周知,样本均值可能具有灾难性的不良表现......“在这个问题的推动下,我们开发了一个随机矩阵(元素方向)均值的新估计,其中包括协方差估计问题作为特殊情况。假设矩阵的条目仅具有有限的二阶矩,则该新估计器允许在运算符范数内的未知均值的亚高斯或亚指数集中。我们将解释我们构建背后的关键思想,以及协方差估计和矩阵完成问题的应用。
translated by 谷歌翻译
非高斯分量分析(NGCA)是多维数据分析中的一个问题。自2006年制定以来,NGCA在统计学和机器学习方面引起了相当大的注意。在这个问题中,我们在$ n $ -dimensional Euclidean空间中有一个随机变量$ X $。 $ n $-Dimensional Euclidean空间有一个未知的子空间$ U $,因此$ X $到$ U $的正交投影是标准的多维高斯和$ X $的正交投影到$ V $,正交补码$ U $是非高斯的,在某种意义上,它的所有一维边缘都不同于以矩为定义的特定度量的高斯。 NGCA问题是在给出$ X $的样本时近似非高斯子空间$ V $。 $ V $中的向量对应于“有趣”方向,而$ U $中的向量对应于数据非常嘈杂的方向。 NGCA模型最有趣的应用是当噪声的幅度与真实信号的幅度相当时的情况,传统的降噪技术(如PCA)不能直接应用。 NGCA也与降低Todimensionality和其他数据分析问题有关,例如ICA.NGCA类似的问题已经使用投影追踪等技术长期研究。我们给出一种算法,该算法在维数$ n $中采用多项式时间,并且具有逆多项式依赖于误差参数,该误差参数测量非高斯子空间和由算法输出的子空间之间的角度。我们的算法基于相对熵作为对比函数并适合投影追踪框架。我们开发的用于分析算法的技术可能用于其他相关问题。
translated by 谷歌翻译
我们提出了三种流行算法的高维分析,即Oja方法,GROUSE和PETRELS,用于从流和非完全不完整的观测中进行子空间估计。我们证明,通过适当的时间缩放,当环境维数$ n $倾向于无穷大时,真算子空间与算法给出的估计值之间的时变主角微弱地收敛于确定性过程。此外,限制过程可以完全表征为某些普通微分方程(ODE)的独特解决方案。还给出了有限样本约束,表明朝向这种限制的收敛率是$ \ mathcal {O}(1 / \ sqrt {n})$。除了提供算法动态性能的渐近精确预测之外,我们的高维分析还得出了一些见解,包括Oja方法和GROUSE之间的渐近等价,以及将缺失数据量与信噪比联系起来的精确缩放关系。通过分析极限ODE的解,我们还建立了与这些技术的稳态性能相关的相转移现象。
translated by 谷歌翻译
We consider derivative-free algorithms for stochastic and non-stochastic convex optimization problems that use only function values rather than gradients. Focusing on non-asymptotic bounds on convergence rates, we show that if pairs of function values are available, algorithms for d-dimensional optimization that use gradient estimates based on random perturbations suffer a factor of at most √ d in convergence rate over traditional stochastic gradient methods. We establish such results for both smooth and non-smooth cases, sharpening previous analyses that suggested a worse dimension dependence, and extend our results to the case of multiple (m ≥ 2) evaluations. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, establishing the sharpness of our achievable results up to constant (sometimes logarithmic) factors.
translated by 谷歌翻译