大型数据集创造了机会以及分析挑战。最近的发展是使用随机投影或草图绘制方法来进行统计和机器学习中的尺寸减少。在这项工作中,我们研究了线性回归的草绘算法的统计性能。假设我们使用随机草图矩阵随机投影数据矩阵和结果,减少样本量,并对结果数据进行线性回归。与原始线性回归相比,我们损失了多少?现有理论没有给出足够精确的答案,这在实践中使用随机预测已成为瓶颈。在本文中,我们引入了一个新的数学方法来解决这个问题,依赖于渐近随机矩阵理论和自由概率理论的最新结果。这是一个完美的契合,因为素描矩阵是randomin练习。我们允许尺寸和样本大小具有任意比例。我们在统一的框架中研究最流行的草图绘制方法,包括随机投影方法(高斯和iid投影,均匀正交投影,子采样随机化Hadamard变换),以及采样方法(包括统一,基于杠杆和贪婪的采样)。我们为这些方法的准确性损失找到了精确而简单的表达式。这些超出了经典的Johnson-Lindenstrauss类型结果,因为它们是精确的,而不是局限于常数。我们的理论公式在广泛的模拟和两个经验数据集中都是令人惊讶的准确。
translated by 谷歌翻译
We consider the matrix completion problem under a form of row/column weighted entrywise sampling , including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the "spikiness" and "low-rankness" of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with q-"balls" of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal.
translated by 谷歌翻译
现代大规模数据集会给最终计算负担带来巨大的负担。分布式计算已成为减轻负担的通用方法:数据集在机器上进行分区,机器在本地计算,并传达短消息。分布式数据也是由于隐私原因而产生的,例如在医学中。研究如何在分布式环境中进行统计推理和机器学习非常重要。在本文中,我们研究了在数据并行性下统计线性模型中的一步参数平均。我们对每台机器进行线性回归,并采用参数的加权平均值。与完整数据的线性回归相比,我们输了多少钱?在这里,我们研究了高维度中的性能损失估计误差,测试误差和置信区间长度,其中参数的数量与训练数据大小相当。我们发现了几个关键现象。首先,平均值不是最佳的,我们发现确切的性能损失。我们的结果在实践中很容易使用。其次,分布式框架对不同的问题有不同的影响。估计误差和置信区间长度增加很多,而预测误差增加得少得多。这些结果与模拟和数据分析示例相匹配。我们依赖于随机矩阵理论的最新结果,其中我们开发了一个确定性等价的新微积分作为扩展兴趣的工具。
translated by 谷歌翻译
在许多领域,从业者需要分析挑战传统单机计算的大型数据集。为了扩展数据分析,越来越需要分布式和并行计算方法。数据集分布在多个计算单元上,这些计算单元在本地进行大部分分析,并传送短消息。在这里,我们研究了该领域的一个基本且非常重要的问题:如何在分布式计算环境中进行岭回归?岭回归是一种非常流行的监督学习方法,具有多种最优性质,因此研究非常重要。我们研究了一次性方法,构建了每台机器上计算的脊力回归估计的加权组合。通过分析高维随机效应模型中的均方误差,其中每个预测器具有小的效应,我们发现了几个新现象。 1.无限工作者限制:分布式估计器适用于非常大的机器,这种现象我们称之为“无限工作者限制”。 2.最佳权重:由于脊的向下偏差,用于组合局部估计量的最佳权重总和而不是单位。因此,所有平均方法都不是最理想的。我们还提出了一种新的最佳加权一次性脊回归算法。我们在模拟研究中确认了我们的结果并使用百万歌数据集作为一个例子。在那里,我们可以节省至少100倍的计算时间,同时保证测试精度。
translated by 谷歌翻译
因子模型是一类强大的统计模型,广泛用于处理从基因组学和神经科学到经济学和金融学的各种应用中经常出现的依赖性测量。随着数据的收集规模不断扩大,统计机器学习面临一些新的挑战:高维度,观察变量之间的强依赖性,重尾变量和异质性。高维鲁棒因子分析是一个强大的工具包,可以克服这些挑战。本文对高维因子模型及其在统计学中的应用进行了选择性概述,包括因子调整的轮回模型选择(FarmSelect)和因子调整的鲁棒多重检验(FarmTest)。我们表明,经典方法,特别是主成分分析(PCA),可以适应许多新问题,并提供强大的统计估计和推理工具。我们强调PCA及其与矩阵扰动理论,稳健统计,随机投影,失败发现率等的联系,并通过几个应用说明这些领域的见解如何为现代挑战提供解决方案。我们还提出了因子模型和流行的统计学习问题之间的远距离联系,包括网络分析和低秩矩阵恢复。
translated by 谷歌翻译
The current paper presents a novel machinery for studying non-asymptotic minimax estimation of high-dimensional matrices, which yields tight minimax rates for a large collection of loss functions in a variety of problems. Based on the convex geometry of finite-dimensional Banach spaces, we first develop a volume ratio approach for determining minimax estimation rates of unconstrained normal mean matrices under all squared unitarily invariant norm losses. In addition, we establish the minimax rates for estimating mean matrices with submatrix sparsity, where the sparsity constraint introduces an additional term in the rate whose dependence on the norm differs completely from the rate of the unconstrained problem. Moreover, the approach is applicable to the matrix completion problem under the low-rank constraint. The new method also extends beyond the normal mean model. In particular, it yields tight rates in covariance matrix estimation and Poisson rate matrix estimation problems for all unitarily invariant norms.
translated by 谷歌翻译
We analyze a class of estimators based on convex relaxation for solving high-dimensional matrix decomposition problems. The observations are noisy realizations of a linear transformation X of the sum of an (approxi-mately) low rank matrix with a second matrix endowed with a complementary form of low-dimensional structure; this setup includes many statistical models of interest, including factor analysis, multi-task regression and robust covariance estimation. We derive a general theorem that bounds the Frobenius norm error for an estimate of the pair ((, ,) obtained by solving a convex optimization problem that combines the nuclear norm with a general decomposable regularizer. Our results use a "spikiness" condition that is related to, but milder than, singular vector incoherence. We specialize our general result to two cases that have been studied in past work: low rank plus an entrywise sparse matrix, and low rank plus a columnwise sparse matrix. For both models, our theory yields nonasymptotic Frobenius error bounds for both deterministic and stochastic noise matrices, and applies to matrices that can be exactly or approximately low rank, and matrices that can be exactly or approximately sparse. Moreover, for the case of stochastic noise matrices and the identity observation operator, we establish matching lower bounds on the minimax error. The sharpness of our nonasymptotic predictions is confirmed by numerical simulations.
translated by 谷歌翻译
低秩矩阵近似,例如截断奇异值分解和秩揭示QR分解,在印度分析和科学计算中起着重要作用。这项工作调查并扩展了最近的研究,这表明随机化提供了一个强大的工具来执行低秩矩阵近似。这些技术比传统方法更充分地利用现代计算体系结构,并打开处理真正大规模数据集的可能性。本文提出了一种用于构建计算部分矩阵分解的随机算法的模块化框架。这些方法使用randomsampling来识别捕获矩阵的大部分动作的子空间。然后将输入矩阵明确地或隐式地压缩到该子空间,并且确定性地操纵简化矩阵以获得期望的低秩。因式分解。在许多情况下,这种方法在准确性,速度和稳健性方面优于其经典竞争对手。这些声明得到了广泛的数值实验和详细的误差分析的支持。
translated by 谷歌翻译
We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the spike magnitude of leading eigenvalues, sample size, and dimensionality. This new regime allows high dimensionality and diverging eigenvalue spikes and provides new insights into the roles the leading eigenvalues, sample size, and dimensionality play in principal component analysis. The results are proven by a technical device, which swaps the role of rows and columns and converts the high-dimensional problems into low-dimensional ones. Our results are a natural extension of those in Paul (2007) to more general setting with new insights and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of the estimation of leading eigenvalues and eigen-vectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.
translated by 谷歌翻译
We study the basic problem of robust subspace recovery. That is, we assume adata set that some of its points are sampled around a fixed subspace and therest of them are spread in the whole ambient space, and we aim to recover thefixed underlying subspace. We first estimate "robust inverse sample covariance"by solving a convex minimization procedure; we then recover the subspace by thebottom eigenvectors of this matrix (their number correspond to the number ofeigenvalues close to 0). We guarantee exact subspace recovery under someconditions on the underlying data. Furthermore, we propose a fast iterativealgorithm, which linearly converges to the matrix minimizing the convexproblem. We also quantify the effect of noise and regularization and discussmany other practical and theoretical issues for improving the subspace recoveryin various settings. When replacing the sum of terms in the convex energyfunction (that we minimize) with the sum of squares of terms, we obtain thatthe new minimizer is a scaled version of the inverse sample covariance (whenexists). We thus interpret our minimizer and its subspace (spanned by itsbottom eigenvectors) as robust versions of the empirical inverse covariance andthe PCA subspace respectively. We compare our method with many other algorithmsfor robust PCA on synthetic and real data sets and demonstrate state-of-the-artspeed and accuracy.
translated by 谷歌翻译
In this paper, we propose a general framework for tensor singular value decomposition (tensor SVD), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.
translated by 谷歌翻译
从噪声线性测量中恢复低秩矩阵的任务在计算科学中发挥着核心作用。问题的平滑配方经常表现出不希望的现象:经典定义的条件数随着环境空间的尺寸而变得很差。相比之下,我们在这里表明,在各种具体情况下,非平稳性配方不会遭受相同类型的病态调节。因此,非光滑优化的标准算法,如非线性和近似线性方法,收敛于快速维度独立的当在解决方案的恒定相对误差内初始化时。此外,非光滑配方自然可以抵抗异常值。我们的框架包括诸如相位检索,盲解卷积,二次检测,矩阵完成和稳健PCA等重要的计算任务。关于这些问题的数值实验说明了所提出方法的好处。
translated by 谷歌翻译
The affine rank minimization problem consists of finding a matrix of minimum rank that satisfies a given system of linear equality constraints. Such problems have appeared in the literature of a diverse set of fields including system identification and control, Euclidean embedding, and collaborative filtering. Although specific instances can often be solved with specialized algorithms , the general affine rank minimization problem is NP-hard, because it contains vector cardinality minimization as a special case. In this paper, we show that if a certain restricted isometry property holds for the linear transformation defining the constraints, the minimum rank solution can be recovered by solving a convex optimization problem, namely the minimization of the nuclear norm over the given affine space. We present several random ensembles of equations where the restricted isometry property holds with overwhelming probability, provided the codimension of the subspace is Ω(r(m + n) log mn), where m, n are the dimensions of the matrix, and r is its rank. The techniques used in our analysis have strong parallels in the compressed sensing framework. We discuss how affine rank minimization generalizes this pre-existing concept and outline a dictionary relating concepts from cardinality minimization to those of rank minimization. We also discuss several algorithmic approaches to solving the norm minimization relaxations, and illustrate our results with numerical examples.
translated by 谷歌翻译
We consider two problems that arise in machine learning applications: the problem of recovering a planted sparse vector in a random linear subspace and the problem of decomposing a random low-rank overcomplete 3-tensor. For both problems, the best known guarantees are based on the sum-of-squares method. We develop new algorithms inspired by analyses of the sum-of-squares method. Our algorithms achieve the same or similar guarantees as sum-of-squares for these problems but the running time is significantly faster. For the planted sparse vector problem, we give an algorithm with running time nearly linear in the input size that approximately recovers a planted sparse vector with up to constant relative sparsity in a random subspace of R n of dimension up tõ Ω(√ n). These recovery guarantees match the best known ones of Barak, Kelner, and Steurer (STOC 2014) up to logarithmic factors. For tensor decomposition, we give an algorithm with running time close to linear in the input size (with exponent ≈ 1.086) that approximately recovers a component of a random 3-tensor over R n of rank up tõ Ω(n 4/3). The best previous algorithm for this problem due to Ge and Ma (RANDOM 2015) works up to rank˜Ωrank˜ rank˜Ω(n 3/2) but requires quasipolynomial time.
translated by 谷歌翻译
The problem of variable clustering is that of estimating groups of similar components of a p-dimensional vector X = (X1,. .. , Xp) from n independent copies of X. There exists a large number of algorithms that return data-dependent groups of variables, but their interpretation is limited to the algorithm that produced them. An alternative is model-based clustering, in which one begins by defining population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two MSC 2010 subject classifications: Primary 62H30; secondary 62C20 2 algorithms, COD and PECOK, tailored to G-block covariance models , and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we compare our methods with another popular clustering method, spectral clustering. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach. 1. Introduction. The problem of variable clustering is that of grouping similar components of a p-dimensional vector X = (X 1 ,. .. , X p). These groups are referred to as clusters. In this work we investigate the problem of cluster recovery from a sample of n independent copies of X. Variable clustering has had a long history in a variety of fields, with important examples stemming from gene expression data [41, 19, 23] or protein profile data [8]. The solutions to this problem are typically algorithmic and entirely data based. They include applications of K-means, hierarchical clustering, spectral clustering, or versions of them. The statistical properties of these procedures have received a very limited amount of investigation. It is not currently known what probabilistic cluster models on X can be estimated by these popular techniques, or by their modifications. More generally, model-based variable clustering has received a limited amount of attention. One net advantage o
translated by 谷歌翻译
Consider a dataset of vector-valued observations that consists of noisyinliers, which are explained well by a low-dimensional subspace, along withsome number of outliers. This work describes a convex optimization problem,called REAPER, that can reliably fit a low-dimensional model to this type ofdata. This approach parameterizes linear subspaces using orthogonal projectors,and it uses a relaxation of the set of orthogonal projectors to reach theconvex formulation. The paper provides an efficient algorithm for solving theREAPER problem, and it documents numerical experiments which confirm thatREAPER can dependably find linear structure in synthetic and natural data. Inaddition, when the inliers lie near a low-dimensional subspace, there is arigorous theory that describes when REAPER can approximate this subspace.
translated by 谷歌翻译
矩阵奇异值分解(SVD)在统计数据分析中很流行,它显示了提取嵌入在噪声矩阵观测中的未知低秩双子空间的优越效率。本文是关于噪声矩阵hasi.i.d时奇异子空间的统计推断。条目和我们的目标是从一个单一的噪声矩阵观察中构建未知奇异子空间的数据相关置信区域。我们为经验光谱投影仪提供了一个明确的表示公式。公式是整洁的,适用于确定性矩阵扰动。基于该表示公式,我们计算直到四阶近似,经验奇异子空间与真实奇异子空间之间的预期联合投影距离。然后,我们用显式归一化因子和不同的偏差校正水平证明了联合投影距离的正态近似。特别是,通过四阶偏差校正,我们证明了在信噪比(SNR)条件下的渐近正态性$ O \ big((d_1 + d_2)^ {9/16} \ big)$其中$ d_1 $和$ d_2 $表示矩阵大小。我们将借用随机矩阵理论的最新结果,提出奇异值的收缩估计。配备这些估计量,我们为经验奇异子空间和真实单个子空间之间的联合投影距离引入了依赖于数据的中心和归一化因子。因此,我们能够为真正的奇异子空间构建数据相关的置信区域,这些子区域渐近地获得预定的置信水平。还给出了渐近正态性的收敛速度。最后,我们提供全面的模拟结果来说明我们的理论发现。
translated by 谷歌翻译
我们考虑在高维和非高斯设置中估计半参数多索引模型的参数分量。这些模型构成了一类非线性模型,应用于信号处理,机器学习和统计。我们的估计器利用基于第一阶段和第二阶Stein身份的得分函数,并且不要求协变量满足文献中常见的高斯或椭圆对称假设。此外,为了处理重尾函数和响应,重要的是通过仔细的阈值来构造ouretimators他们的经验对象。我们证明了我们的估计器在几个设置中实现了接近最优的收敛统计速率。我们补充了我们的理论结果,证实了该理论的模拟实验。
translated by 谷歌翻译
我们考虑矩阵完成问题的确定模式的观察条目。在这种情况下,我们的目标是回答这个问题:在什么条件下,(矩阵完全问题)将存在(至少是局部的)唯一解,即底层真实矩阵是可识别的。我们从某个角度回答问题并概述几何视角。我们给出一个代数可验证的充分条件,我们称之为适定条件,用于MRMC解的局部唯一性。我们认为这种情况对于MRMC解决方案的局部稳定性是必要的,并且使用特征等级来证明该条件是通用的。我们还认为低秩近似方法比MRMC更稳定,并进一步提出了一种顺序统计测试程序,以确定观察到的条目的“真实”等级。最后,我们提供了旨在验证所提出理论的有效性的数值例子。
translated by 谷歌翻译
We introduce a new criterion, the Rank Selection Criterion (RSC), for selecting the optimal reduced rank estimator of the coefficient matrix in mul-tivariate response regression models. The corresponding RSC estimator minimizes the Frobenius norm of the fit plus a regularization term proportional to the number of parameters in the reduced rank model. The rank of the RSC estimator provides a consistent estimator of the rank of the coefficient matrix; in general, the rank of our estimator is a consistent estimate of the effective rank, which we define to be the number of singular values of the target matrix that are appropriately large. The consistency results are valid not only in the classic asymptotic regime, when n, the number of responses, and p, the number of predictors, stay bounded, and m, the number of observations, grows, but also when either, or both, n and p grow, possibly much faster than m. We establish minimax optimal bounds on the mean squared errors of our estimators. Our finite sample performance bounds for the RSC estimator show that it achieves the optimal balance between the approximation error and the penalty term. Furthermore, our procedure has very low computational complexity, linear in the number of candidate models, making it particularly appealing for large scale problems. We contrast our estimator with the nuclear norm penalized least squares (NNP) estimator, which has an inherently higher computational complexity than RSC, for multivariate regression models. We show that NNP has estimation properties similar to those of RSC, albeit under stronger conditions. However, it is not as parsimonious as RSC. We offer a simple correction of the NNP estimator which leads to consistent rank estimation. We verify and illustrate our theoretical findings via an extensive simulation study. 1. Introduction. In this paper, we propose and analyze dimension reduction-type estimators for multivariate response regression models. Given m observations of the responses Y i ∈ R n and predictors X i ∈ R p , we assume that the matrices Y = [Y 1 ,. .. , Y m ] and X = [X 1 ,. .. , X m ] are related via an unknown p × n matrix
translated by 谷歌翻译