大型数据集创造了机会以及分析挑战。最近的发展是使用随机投影或草图绘制方法来进行统计和机器学习中的尺寸减少。在这项工作中,我们研究了线性回归的草绘算法的统计性能。假设我们使用随机草图矩阵随机投影数据矩阵和结果,减少样本量,并对结果数据进行线性回归。与原始线性回归相比,我们损失了多少?现有理论没有给出足够精确的答案,这在实践中使用随机预测已成为瓶颈。在本文中,我们引入了一个新的数学方法来解决这个问题,依赖于渐近随机矩阵理论和自由概率理论的最新结果。这是一个完美的契合,因为素描矩阵是randomin练习。我们允许尺寸和样本大小具有任意比例。我们在统一的框架中研究最流行的草图绘制方法,包括随机投影方法(高斯和iid投影,均匀正交投影,子采样随机化Hadamard变换),以及采样方法(包括统一,基于杠杆和贪婪的采样)。我们为这些方法的准确性损失找到了精确而简单的表达式。这些超出了经典的Johnson-Lindenstrauss类型结果,因为它们是精确的,而不是局限于常数。我们的理论公式在广泛的模拟和两个经验数据集中都是令人惊讶的准确。
translated by 谷歌翻译
We consider the matrix completion problem under a form of row/column weighted entrywise sampling , including the case of uniform entrywise sampling as a special case. We analyze the associated random observation operator, and prove that with high probability, it satisfies a form of restricted strong convexity with respect to weighted Frobenius norm. Using this property, we obtain as corollaries a number of error bounds on matrix completion in the weighted Frobenius norm under noisy sampling and for both exact and near low-rank matrices. Our results are based on measures of the "spikiness" and "low-rankness" of matrices that are less restrictive than the incoherence conditions imposed in previous work. Our technique involves an M-estimator that includes controls on both the rank and spikiness of the solution, and we establish non-asymptotic error bounds in weighted Frobenius norm for recovering matrices lying with q-"balls" of bounded spikiness. Using information-theoretic methods, we show that no algorithm can achieve better estimates (up to a logarithmic factor) over these same sets, showing that our conditions on matrices and associated rates are essentially optimal.
translated by 谷歌翻译
The current paper presents a novel machinery for studying non-asymptotic minimax estimation of high-dimensional matrices, which yields tight minimax rates for a large collection of loss functions in a variety of problems. Based on the convex geometry of finite-dimensional Banach spaces, we first develop a volume ratio approach for determining minimax estimation rates of unconstrained normal mean matrices under all squared unitarily invariant norm losses. In addition, we establish the minimax rates for estimating mean matrices with submatrix sparsity, where the sparsity constraint introduces an additional term in the rate whose dependence on the norm differs completely from the rate of the unconstrained problem. Moreover, the approach is applicable to the matrix completion problem under the low-rank constraint. The new method also extends beyond the normal mean model. In particular, it yields tight rates in covariance matrix estimation and Poisson rate matrix estimation problems for all unitarily invariant norms.
translated by 谷歌翻译
因子模型是一类强大的统计模型,广泛用于处理从基因组学和神经科学到经济学和金融学的各种应用中经常出现的依赖性测量。随着数据的收集规模不断扩大,统计机器学习面临一些新的挑战:高维度,观察变量之间的强依赖性,重尾变量和异质性。高维鲁棒因子分析是一个强大的工具包,可以克服这些挑战。本文对高维因子模型及其在统计学中的应用进行了选择性概述,包括因子调整的轮回模型选择(FarmSelect)和因子调整的鲁棒多重检验(FarmTest)。我们表明,经典方法,特别是主成分分析(PCA),可以适应许多新问题,并提供强大的统计估计和推理工具。我们强调PCA及其与矩阵扰动理论,稳健统计,随机投影,失败发现率等的联系,并通过几个应用说明这些领域的见解如何为现代挑战提供解决方案。我们还提出了因子模型和流行的统计学习问题之间的远距离联系,包括网络分析和低秩矩阵恢复。
translated by 谷歌翻译
在本文中,我们研究具有交互式fixedeffects的面板回归模型。我们提出了两种基于最小化目标函数的估计方法。第一种方法通过核(迹线)范数正则化最小化平方残差的总和。第二种方法最大限度地减少了残差的核范数。我们建立了两个结果估计量的一致性。与现有的最小二乘(LS)估计器相比,这些估计器具有非常重要的计算优势,因为它们被定义为凸目标函数的最小化器。此外,核规范惩罚有助于解决交互式固定效应模型的潜在识别问题,特别是当回归量较低且因子数量未知时。我们还通过使用核范数正则化或最小化估计量作为最小值LS最小化迭代步骤的初始值来构造如何构造渐近等价于Bai(2009)和Moon和Weidner(2017)中的最小二乘(LS)估计的估计量。该迭代避免了任何非凸最小化,而原始LS估计问题通常是非凸的,并且可以具有多个局部最小值。
translated by 谷歌翻译
现代大规模数据集会给最终计算负担带来巨大的负担。分布式计算已成为减轻负担的通用方法:数据集在机器上进行分区,机器在本地计算,并传达短消息。分布式数据也是由于隐私原因而产生的,例如在医学中。研究如何在分布式环境中进行统计推理和机器学习非常重要。在本文中,我们研究了在数据并行性下统计线性模型中的一步参数平均。我们对每台机器进行线性回归,并采用参数的加权平均值。与完整数据的线性回归相比,我们输了多少钱?在这里,我们研究了高维度中的性能损失估计误差,测试误差和置信区间长度,其中参数的数量与训练数据大小相当。我们发现了几个关键现象。首先,平均值不是最佳的,我们发现确切的性能损失。我们的结果在实践中很容易使用。其次,分布式框架对不同的问题有不同的影响。估计误差和置信区间长度增加很多,而预测误差增加得少得多。这些结果与模拟和数据分析示例相匹配。我们依赖于随机矩阵理论的最新结果,其中我们开发了一个确定性等价的新微积分作为扩展兴趣的工具。
translated by 谷歌翻译
We study the basic problem of robust subspace recovery. That is, we assume adata set that some of its points are sampled around a fixed subspace and therest of them are spread in the whole ambient space, and we aim to recover thefixed underlying subspace. We first estimate "robust inverse sample covariance"by solving a convex minimization procedure; we then recover the subspace by thebottom eigenvectors of this matrix (their number correspond to the number ofeigenvalues close to 0). We guarantee exact subspace recovery under someconditions on the underlying data. Furthermore, we propose a fast iterativealgorithm, which linearly converges to the matrix minimizing the convexproblem. We also quantify the effect of noise and regularization and discussmany other practical and theoretical issues for improving the subspace recoveryin various settings. When replacing the sum of terms in the convex energyfunction (that we minimize) with the sum of squares of terms, we obtain thatthe new minimizer is a scaled version of the inverse sample covariance (whenexists). We thus interpret our minimizer and its subspace (spanned by itsbottom eigenvectors) as robust versions of the empirical inverse covariance andthe PCA subspace respectively. We compare our method with many other algorithmsfor robust PCA on synthetic and real data sets and demonstrate state-of-the-artspeed and accuracy.
translated by 谷歌翻译
In this paper, we propose a general framework for tensor singular value decomposition (tensor SVD), which focuses on the methodology and theory for extracting the hidden low-rank structure from high-dimensional tensor data. Comprehensive results are developed on both the statistical and computational limits for tensor SVD. This problem exhibits three different phases according to the signal-to-noise ratio (SNR). In particular, with strong SNR, we show that the classical higher-order orthogonal iteration achieves the minimax optimal rate of convergence in estimation; with weak SNR, the information-theoretical lower bound implies that it is impossible to have consistent estimation in general; with moderate SNR, we show that the non-convex maximum likelihood estimation provides optimal solution, but with NP-hard computational cost; moreover, under the hardness hypothesis of hypergraphic planted clique detection, there are no polynomial-time algorithms performing consistently in general.
translated by 谷歌翻译
We analyze a class of estimators based on convex relaxation for solving high-dimensional matrix decomposition problems. The observations are noisy realizations of a linear transformation X of the sum of an (approxi-mately) low rank matrix with a second matrix endowed with a complementary form of low-dimensional structure; this setup includes many statistical models of interest, including factor analysis, multi-task regression and robust covariance estimation. We derive a general theorem that bounds the Frobenius norm error for an estimate of the pair ((, ,) obtained by solving a convex optimization problem that combines the nuclear norm with a general decomposable regularizer. Our results use a "spikiness" condition that is related to, but milder than, singular vector incoherence. We specialize our general result to two cases that have been studied in past work: low rank plus an entrywise sparse matrix, and low rank plus a columnwise sparse matrix. For both models, our theory yields nonasymptotic Frobenius error bounds for both deterministic and stochastic noise matrices, and applies to matrices that can be exactly or approximately low rank, and matrices that can be exactly or approximately sparse. Moreover, for the case of stochastic noise matrices and the identity observation operator, we establish matching lower bounds on the minimax error. The sharpness of our nonasymptotic predictions is confirmed by numerical simulations.
translated by 谷歌翻译
低秩矩阵近似,例如截断奇异值分解和秩揭示QR分解,在印度分析和科学计算中起着重要作用。这项工作调查并扩展了最近的研究,这表明随机化提供了一个强大的工具来执行低秩矩阵近似。这些技术比传统方法更充分地利用现代计算体系结构,并打开处理真正大规模数据集的可能性。本文提出了一种用于构建计算部分矩阵分解的随机算法的模块化框架。这些方法使用randomsampling来识别捕获矩阵的大部分动作的子空间。然后将输入矩阵明确地或隐式地压缩到该子空间,并且确定性地操纵简化矩阵以获得期望的低秩。因式分解。在许多情况下,这种方法在准确性,速度和稳健性方面优于其经典竞争对手。这些声明得到了广泛的数值实验和详细的误差分析的支持。
translated by 谷歌翻译
The affine rank minimization problem consists of finding a matrix of minimum rank that satisfies a given system of linear equality constraints. Such problems have appeared in the literature of a diverse set of fields including system identification and control, Euclidean embedding, and collaborative filtering. Although specific instances can often be solved with specialized algorithms , the general affine rank minimization problem is NP-hard, because it contains vector cardinality minimization as a special case. In this paper, we show that if a certain restricted isometry property holds for the linear transformation defining the constraints, the minimum rank solution can be recovered by solving a convex optimization problem, namely the minimization of the nuclear norm over the given affine space. We present several random ensembles of equations where the restricted isometry property holds with overwhelming probability, provided the codimension of the subspace is Ω(r(m + n) log mn), where m, n are the dimensions of the matrix, and r is its rank. The techniques used in our analysis have strong parallels in the compressed sensing framework. We discuss how affine rank minimization generalizes this pre-existing concept and outline a dictionary relating concepts from cardinality minimization to those of rank minimization. We also discuss several algorithmic approaches to solving the norm minimization relaxations, and illustrate our results with numerical examples.
translated by 谷歌翻译
我们考虑矩阵完成问题的确定模式的观察条目。在这种情况下,我们的目标是回答这个问题:在什么条件下,(矩阵完全问题)将存在(至少是局部的)唯一解,即底层真实矩阵是可识别的。我们从某个角度回答问题并概述几何视角。我们给出一个代数可验证的充分条件,我们称之为适定条件,用于MRMC解的局部唯一性。我们认为这种情况对于MRMC解决方案的局部稳定性是必要的,并且使用特征等级来证明该条件是通用的。我们还认为低秩近似方法比MRMC更稳定,并进一步提出了一种顺序统计测试程序,以确定观察到的条目的“真实”等级。最后,我们提供了旨在验证所提出理论的有效性的数值例子。
translated by 谷歌翻译
We consider two problems that arise in machine learning applications: the problem of recovering a planted sparse vector in a random linear subspace and the problem of decomposing a random low-rank overcomplete 3-tensor. For both problems, the best known guarantees are based on the sum-of-squares method. We develop new algorithms inspired by analyses of the sum-of-squares method. Our algorithms achieve the same or similar guarantees as sum-of-squares for these problems but the running time is significantly faster. For the planted sparse vector problem, we give an algorithm with running time nearly linear in the input size that approximately recovers a planted sparse vector with up to constant relative sparsity in a random subspace of R n of dimension up tõ Ω(√ n). These recovery guarantees match the best known ones of Barak, Kelner, and Steurer (STOC 2014) up to logarithmic factors. For tensor decomposition, we give an algorithm with running time close to linear in the input size (with exponent ≈ 1.086) that approximately recovers a component of a random 3-tensor over R n of rank up tõ Ω(n 4/3). The best previous algorithm for this problem due to Ge and Ma (RANDOM 2015) works up to rank˜Ωrank˜ rank˜Ω(n 3/2) but requires quasipolynomial time.
translated by 谷歌翻译
Consider a dataset of vector-valued observations that consists of noisyinliers, which are explained well by a low-dimensional subspace, along withsome number of outliers. This work describes a convex optimization problem,called REAPER, that can reliably fit a low-dimensional model to this type ofdata. This approach parameterizes linear subspaces using orthogonal projectors,and it uses a relaxation of the set of orthogonal projectors to reach theconvex formulation. The paper provides an efficient algorithm for solving theREAPER problem, and it documents numerical experiments which confirm thatREAPER can dependably find linear structure in synthetic and natural data. Inaddition, when the inliers lie near a low-dimensional subspace, there is arigorous theory that describes when REAPER can approximate this subspace.
translated by 谷歌翻译
We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the spike magnitude of leading eigenvalues, sample size, and dimensionality. This new regime allows high dimensionality and diverging eigenvalue spikes and provides new insights into the roles the leading eigenvalues, sample size, and dimensionality play in principal component analysis. The results are proven by a technical device, which swaps the role of rows and columns and converts the high-dimensional problems into low-dimensional ones. Our results are a natural extension of those in Paul (2007) to more general setting with new insights and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of the estimation of leading eigenvalues and eigen-vectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies.
translated by 谷歌翻译
Principal component analysis (PCA) is a classical method for dimension-ality reduction based on extracting the dominant eigenvectors of the sample covariance matrix. However, PCA is well known to behave poorly in the "large p, small n" setting, in which the problem dimension p is comparable to or larger than the sample size n. This paper studies PCA in this high-dimensional regime, but under the additional assumption that the maximal eigenvector is sparse, say, with at most k nonzero components. We consider a spiked covariance model in which a base matrix is perturbed by adding a k-sparse maximal eigenvector, and we analyze two computa-tionally tractable methods for recovering the support set of this maximal eigenvector, as follows: (a) a simple diagonal thresholding method, which transitions from success to failure as a function of the rescaled sample size θ dia (n, p, k) = n/[k 2 log(p − k)]; and (b) a more sophisticated semidefinite programming (SDP) relaxation, which succeeds once the rescaled sample size θ sdp (n, p, k) = n/[k log(p − k)] is larger than a critical threshold. In addition, we prove that no method, including the best method which has exponential-time complexity, can succeed in recovering the support if the order parameter θ sdp (n, p, k) is below a threshold. Our results thus highlight an interesting trade-off between computational and statistical efficiency in high-dimensional inference.
translated by 谷歌翻译
We study an instance of high-dimensional inference in which the goal is to estimate a matrix * ∈ R m 1 ×m 2 on the basis of N noisy observations. The unknown matrix * is assumed to be either exactly low rank, or "near" low-rank, meaning that it can be well-approximated by a matrix with low rank. We consider a standard M-estimator based on regularization by the nuclear or trace norm over matrices, and analyze its performance under high-dimensional scaling. We define the notion of restricted strong convexity (RSC) for the loss function, and use it to derive nonasymptotic bounds on the Frobenius norm error that hold for a general class of noisy observation models, and apply to both exactly low-rank and approximately low rank matrices. We then illustrate consequences of this general theory for a number of specific matrix models, including low-rank multivariate or multi-task regression , system identification in vector autoregressive processes and recovery of low-rank matrices from random projections. These results involve nonasymp-totic random matrix theory to establish that the RSC condition holds, and to determine an appropriate choice of regularization parameter. Simulation results show excellent agreement with the high-dimensional scaling of the error predicted by our theory.
translated by 谷歌翻译
矩阵奇异值分解(SVD)在统计数据分析中很流行,它显示了提取嵌入在噪声矩阵观测中的未知低秩双子空间的优越效率。本文是关于噪声矩阵hasi.i.d时奇异子空间的统计推断。条目和我们的目标是从一个单一的噪声矩阵观察中构建未知奇异子空间的数据相关置信区域。我们为经验光谱投影仪提供了一个明确的表示公式。公式是整洁的,适用于确定性矩阵扰动。基于该表示公式,我们计算直到四阶近似,经验奇异子空间与真实奇异子空间之间的预期联合投影距离。然后,我们用显式归一化因子和不同的偏差校正水平证明了联合投影距离的正态近似。特别是,通过四阶偏差校正,我们证明了在信噪比(SNR)条件下的渐近正态性$ O \ big((d_1 + d_2)^ {9/16} \ big)$其中$ d_1 $和$ d_2 $表示矩阵大小。我们将借用随机矩阵理论的最新结果,提出奇异值的收缩估计。配备这些估计量,我们为经验奇异子空间和真实单个子空间之间的联合投影距离引入了依赖于数据的中心和归一化因子。因此,我们能够为真正的奇异子空间构建数据相关的置信区域,这些子区域渐近地获得预定的置信水平。还给出了渐近正态性的收敛速度。最后,我们提供全面的模拟结果来说明我们的理论发现。
translated by 谷歌翻译
This paper studies the problem of estimating the covariance of a collection of vectors using only highly compressed measurements of each vector. An estimator based on back-projections of these compressive samples is proposed and analyzed. A distribution-free analysis shows that by observing just a single linear measurement of each vector, one can consistently estimate the covariance matrix, in both infinity and spectral norm, and this same analysis leads to precise rates of convergence in both norms. Via information-theoretic techniques, lower bounds showing that this estimator is minimax-optimal for both infinity and spectral norm estimation problems are established. These results are also specialized to give matching upper and lower bounds for estimating the population covariance of a collection of Gaussian vectors, again in the compressive measurement model. The analysis conducted in this paper shows that the effective sample complexity for this problem is scaled by a factor of m 2 /d 2 where m is the compression dimension and d is the ambient dimension. Applications to subspace learning (Principal Components Analysis) and learning over distributed sensor networks are also discussed.
translated by 谷歌翻译