众所周知,样本协方差在光谱中具有一致的偏差,例如惠art矩阵的频谱遵循Marchenko-Pastur法。我们在这项工作中引入了迭代算法的“浓度”,它积极消除这种偏差并恢复小和中等维度的真实频谱。
translated by 谷歌翻译
In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. arXiv:1906.11300 characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum $\ell_2$-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of arXiv:1906.11300, e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative.
translated by 谷歌翻译
Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.
translated by 谷歌翻译
Johnson-Lindenstrauss保证在将高尺寸确定性载体嵌入低维向量时在随机投影中保存某些拓扑结构。在这项工作中,我们试图了解随机投影如何影响随机向量的规范。特别是我们证明了随机向量的规范的分布$ x \ in \ mathbb {r} ^ n $,其条目是i.i.d.随机变量,由随机投影保留$ s:\ mathbb {r} ^ n \ to \ mathbb {r} ^ m $。更准确地说,\ [\ frac {x ^ ts ^ tsx - mn} {\ sqrt {\ sigma ^ 2 m ^ 2n + 2mn ^ 2}} \ X.RightArrow [\四分M / n \到0 \ Quad] {M,n \ to \ idty} \ mathcal {n}(0,1)\]
translated by 谷歌翻译
教师 - 学生模型提供了一个框架,其中可以以封闭形式描述高维监督学习的典型情况。高斯I.I.D的假设然而,可以认为典型教师 - 学生模型的输入数据可以被认为过于限制,以捕获现实数据集的行为。在本文中,我们介绍了教师和学生可以在不同的空格上行动的模型的高斯协变态概括,以固定的,而是通用的特征映射。虽然仍处于封闭形式的仍然可解决,但这种概括能够捕获广泛的现实数据集的学习曲线,从而兑现师生框架的潜力。我们的贡献是两倍:首先,我们证明了渐近培训损失和泛化误差的严格公式。其次,我们呈现了许多情况,其中模型的学习曲线捕获了使用内​​核回归和分类学习的现实数据集之一,其中盒出开箱特征映射,例如随机投影或散射变换,或者与散射变换预先学习的 - 例如通过培训多层神经网络学到的特征。我们讨论了框架的权力和局限性。
translated by 谷歌翻译
在机器学习和高维统计领域的有限样本理论中,恒定指定的浓度不平等至关重要。我们获得了独立亚网络随机变量总和的更清晰和常数的浓度不平等,这导致了两个尾巴的混合物:尺寸的小偏差和较大偏差的小偏差。这些界限是新的,并通过更清晰的常数改善了现有的界限。另外,如果应保留斜体,则新的子韦布尔参数。请检查整个文本。还提出了提出的,它可以为随机变量(向量)恢复紧密浓度不平等。对于统计应用,我们给出了$ \ ell_2 $ - 估计系数在负二项式回归中的估计系数时,当重尾协变量是稀疏结构分布的亚weibull时,这是负二项式回归的新结果。在应用随机矩阵时,我们得出了Bai-Yin定理的非反应版本,用于具有指数尾巴边界的亚weibull条目。最后,通过为没有第二瞬间条件的对数截断的Z-测验器演示一个子静电区域,我们讨论并定义了独立观测值的sub-weibull类型稳健估计器$ \ {x_i \} _ {i = 1 }^{n} $没有指数矩条件。
translated by 谷歌翻译
Markowitz mean-variance portfolios with sample mean and covariance as input parameters feature numerous issues in practice. They perform poorly out of sample due to estimation error, they experience extreme weights together with high sensitivity to change in input parameters. The heavy-tail characteristics of financial time series are in fact the cause for these erratic fluctuations of weights that consequently create substantial transaction costs. In robustifying the weights we present a toolbox for stabilizing costs and weights for global minimum Markowitz portfolios. Utilizing a projected gradient descent (PGD) technique, we avoid the estimation and inversion of the covariance operator as a whole and concentrate on robust estimation of the gradient descent increment. Using modern tools of robust statistics we construct a computationally efficient estimator with almost Gaussian properties based on median-of-means uniformly over weights. This robustified Markowitz approach is confirmed by empirical studies on equity markets. We demonstrate that robustified portfolios reach the lowest turnover compared to shrinkage-based and constrained portfolios while preserving or slightly improving out-of-sample performance.
translated by 谷歌翻译
对于从分布$ \ mu $采样的图值数据,根据选择度量计算样品矩。在这项工作中,我们为图表集合了由$ \ ell_2 $规范定义的伪金属,相应的邻接矩阵的特征值之间。我们使用此伪度量标准和图值数据集的各个样本矩来推断分布的参数$ \ hat {\ mu} $,并将其解释为$ \ mu $的近似值。我们通过实验验证复杂的分布$ \ mu $可以很好地近似地使用这种方法。
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
The spectra of random feature matrices provide essential information on the conditioning of the linear system used in random feature regression problems and are thus connected to the consistency and generalization of random feature models. Random feature matrices are asymmetric rectangular nonlinear matrices depending on two input variables, the data and the weights, which can make their characterization challenging. We consider two settings for the two input variables, either both are random variables or one is a random variable and the other is well-separated, i.e. there is a minimum distance between points. With conditions on the dimension, the complexity ratio, and the sampling variance, we show that the singular values of these matrices concentrate near their full expectation and near one with high-probability. In particular, since the dimension depends only on the logarithm of the number of random weights or the number of data points, our complexity bounds can be achieved even in moderate dimensions for many practical setting. The theoretical results are verified with numerical experiments.
translated by 谷歌翻译
We investigate a general matrix factorization for deviance-based data losses, extending the ubiquitous singular value decomposition beyond squared error loss. While similar approaches have been explored before, our method leverages classical statistical methodology from generalized linear models (GLMs) and provides an efficient algorithm that is flexible enough to allow for structural zeros and entry weights. Moreover, by adapting results from GLM theory, we provide support for these decompositions by (i) showing strong consistency under the GLM setup, (ii) checking the adequacy of a chosen exponential family via a generalized Hosmer-Lemeshow test, and (iii) determining the rank of the decomposition via a maximum eigenvalue gap method. To further support our findings, we conduct simulation studies to assess robustness to decomposition assumptions and extensive case studies using benchmark datasets from image face recognition, natural language processing, network analysis, and biomedical studies. Our theoretical and empirical results indicate that the proposed decomposition is more flexible, general, and robust, and can thus provide improved performance when compared to similar methods. To facilitate applications, an R package with efficient model fitting and family and rank determination is also provided.
translated by 谷歌翻译
我们系统地{研究基于内核的图形laplacian(gl)的光谱},该图在非null设置中由高维和嘈杂的随机点云构成,其中点云是从低维几何对象(如歧管)中采样的,被高维噪音破坏。我们量化了信号和噪声在信号噪声比(SNR)的不同状态下如何相互作用,并报告GL的{所产生的特殊光谱行为}。此外,我们还探索了GL频谱上的内核带宽选择,而SNR的不同状态则导致带宽的自适应选择,这与实际数据中的共同实践相吻合。该结果为数据集嘈杂时的从业人员提供了理论支持。
translated by 谷歌翻译
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
translated by 谷歌翻译
这项调查旨在提供线性模型及其背后的理论的介绍。我们的目标是对读者进行严格的介绍,并事先接触普通最小二乘。在机器学习中,输出通常是输入的非线性函数。深度学习甚至旨在找到需要大量计算的许多层的非线性依赖性。但是,这些算法中的大多数都基于简单的线性模型。然后,我们从不同视图中描述线性模型,并找到模型背后的属性和理论。线性模型是回归问题中的主要技术,其主要工具是最小平方近似,可最大程度地减少平方误差之和。当我们有兴趣找到回归函数时,这是一个自然的选择,该回归函数可以最大程度地减少相应的预期平方误差。这项调查主要是目的的摘要,即线性模型背后的重要理论的重要性,例如分布理论,最小方差估计器。我们首先从三种不同的角度描述了普通的最小二乘,我们会以随机噪声和高斯噪声干扰模型。通过高斯噪声,该模型产生了可能性,因此我们引入了最大似然估计器。它还通过这种高斯干扰发展了一些分布理论。最小二乘的分布理论将帮助我们回答各种问题并引入相关应用。然后,我们证明最小二乘是均值误差的最佳无偏线性模型,最重要的是,它实际上接近了理论上的极限。我们最终以贝叶斯方法及以后的线性模型结束。
translated by 谷歌翻译
The study of stability and sensitivity of statistical methods or algorithms with respect to their data is an important problem in machine learning and statistics. The performance of the algorithm under resampling of the data is a fundamental way to measure its stability and is closely related to generalization or privacy of the algorithm. In this paper, we study the resampling sensitivity for the principal component analysis (PCA). Given an $ n \times p $ random matrix $ \mathbf{X} $, let $ \mathbf{X}^{[k]} $ be the matrix obtained from $ \mathbf{X} $ by resampling $ k $ randomly chosen entries of $ \mathbf{X} $. Let $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ denote the principal components of $ \mathbf{X} $ and $ \mathbf{X}^{[k]} $. In the proportional growth regime $ p/n \to \xi \in (0,1] $, we establish the sharp threshold for the sensitivity/stability transition of PCA. When $ k \gg n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically orthogonal. On the other hand, when $ k \ll n^{5/3} $, the principal components $ \mathbf{v} $ and $ \mathbf{v}^{[k]} $ are asymptotically colinear. In words, we show that PCA is sensitive to the input data in the sense that resampling even a negligible portion of the input may completely change the output.
translated by 谷歌翻译
我们研究了与深神经网络分析有关的随机矩阵产物的奇异值的分布。然而,矩阵类似于样品协方差矩阵的乘积,一个重要的区别是,假定的种群协方差矩阵是非随机或随机的,但独立于统计和随机矩阵理论中的随机数据矩阵,现在是随机数据的某些功能矩阵(深神经网络术语中的突触重量矩阵)。该问题在最近的工作[25,13]中已通过使用自由概率理论的技术。但是,自由概率理论涉及独立于数据矩阵的人口协方差矩阵,因此必须证明其适用性。使用随机矩阵理论的技术版本,对于具有独立条目的高斯数据矩阵,具有独立条目的高斯数据矩阵(一种自由概率的标准分析模型)的理由。在本文中,我们使用另一种更简化的随机矩阵理论技术的版本将[22]的结果推广到突触重量矩阵的条目仅是独立分布的随机变量,均值和有限第四,片刻。特别是,这扩展了所谓的宏观普遍性在被考虑的随机矩阵上的特性。
translated by 谷歌翻译
本文考虑了深神经网络中随机矩阵普遍性的几个方面。在最近的实验工作中,我们使用与局部统计相关的随机矩阵的普遍特性,以基于其Hessians的现实模型来获得对深神经网络的实际含义。特别是,我们得出了深度神经网络光谱中异常值的普遍方面,并证明了随机矩阵局部定律在流行的预处理梯度下降算法中的重要作用。我们还通过基于统计物理学和随机矩阵理论的工具的一般参数,对深度神经网络损失表面的见解。
translated by 谷歌翻译
我们给出了第一个多项式 - 时间,多项式 - 样本,差异私人估算器,用于任意高斯分发$ \ mathcal {n}(\ mu,\ sigma)$ in $ \ mathbb {r} ^ d $。所有以前的估算器都是非变性的,具有无限的运行时间,或者要求用户在参数$ \ mu $和$ \ sigma $上指定先验的绑定。我们算法中的主要新技术工具是一个新的差别私有预处理器,它从任意高斯$ \ mathcal {n}(0,\ sigma)$中采用样本,并返回矩阵$ a $,使得$ a \ sigma a ^ t$具有恒定的条件号。
translated by 谷歌翻译
本文涉及来自神经网络研究的一些非线性随机矩阵集合的最大特征值的渐近分布。更确切地说,我们考虑$ m = \ frac {1} {m} yy ^ \ top $ w $ y = f(wx)$ worth w $和$ x $ with w $和$ x $是随机矩形矩阵。以中心的条目。这模拟了单层随机馈通神经网络的数据协方差矩阵或共轭内核。函数$ F $应用于entryWish,可以被视为神经网络的激活功能。我们表明,最大的特征值具有与某种众所周知的线性随机矩阵集合相同的极限(概率)。特别是,我们将非线性模型的最大特征值的渐近极限与信息 - 正噪声随机矩阵的渐近极限相关联,根据函数$ f $和$ w $和$ x的分发建立可能的阶段转换$。对于机器学习来说,这可能是有意义的。
translated by 谷歌翻译
我们研究了在高维主成分分析中恢复支持的元学习(即非零条目集)。我们通过从辅助任务中学到的信息来降低新任务中足够的样本复杂性。我们假设每个任务都是具有不同支持的不同随机主组件(PC)矩阵,并且PC矩阵的支持联合很小。然后,我们通过最大化$ l_1 $调查的预测协方差来汇总所有任务中的数据,以执行单个PC矩阵的不当估计,以确定具有很高的概率,只要有足够的任务$ M,就可以恢复真正的支持联盟$和足够数量的样本$ o \ left(\ frac {\ log(p)} {m} \ right)$对于每个任务,对于$ p $ - 维矢量。然后,对于一项新颖的任务,我们证明了$ l_1 $ regularized的预测协方差的最大化,并具有额外的约束,即支持是估计支持联盟的一个子集,可以将成功支持恢复的足够样本复杂性降低到$ o( \ log | j |)$,其中$ j $是从辅助任务中恢复的支持联盟。通常,对于稀疏矩阵而言,$ | j | $将少于$ p $。最后,我们通过数值模拟证明了实验的有效性。
translated by 谷歌翻译