Generalized Eigenvalue Problems (GEPs) encompass a range of interesting dimensionality reduction methods. Development of efficient stochastic approaches to these problems would allow them to scale to larger datasets. Canonical Correlation Analysis (CCA) is one example of a GEP for dimensionality reduction which has found extensive use in problems with two or more views of the data. Deep learning extensions of CCA require large mini-batch sizes, and therefore large memory consumption, in the stochastic setting to achieve good performance and this has limited its application in practice. Inspired by the Generalized Hebbian Algorithm, we develop an approach to solving stochastic GEPs in which all constraints are softly enforced by Lagrange multipliers. Then by considering the integral of this Lagrangian function, its pseudo-utility, and inspired by recent formulations of Principal Components Analysis and GEPs as games with differentiable utilities, we develop a game-theory inspired approach to solving GEPs. We show that our approaches share much of the theoretical grounding of the previous Hebbian and game theoretic approaches for the linear case but our method permits extension to general function approximators like neural networks for certain GEPs for dimensionality reduction including CCA which means our method can be used for deep multiview representation learning. We demonstrate the effectiveness of our method for solving GEPs in the stochastic setting using canonical multiview datasets and demonstrate state-of-the-art performance for optimizing Deep CCA.
translated by 谷歌翻译
广义特征值问题(GEP)是数值线性代数中的基本概念。它捕获了许多经典的机器学习问题的解决方案,例如规范相关分析,独立组件分析,部分最小二乘,线性判别分析,主要组件,后继功能等。尽管如此,在处理大量数据集时,大多数通用求解器都非常昂贵,而研究则集中在为特定问题实例找到有效的解决方案。在这项工作中,我们开发了顶级$ K $ GEP的游戏理论公式,其NASH平衡是一组广义特征向量。我们还提出了一种可行的算法,并保证了与NASH的渐近收敛。当前的最新方法需要$ \ MATHCAL {O}(d^2k)$复杂性,当尺寸数量($ d $)较大时,这是高昂的昂贵。我们展示了如何实现$ \ MATHCAL {O}(dk)$复杂性,比例缩放到数据集$ 100 \ times $ $比先前方法评估的$。从经验上讲,我们证明我们的算法能够解决各种GEP问题实例,包括对神经网络激活的大规模分析。
translated by 谷歌翻译
Many machine learning problems encode their data as a matrix with a possibly very large number of rows and columns. In several applications like neuroscience, image compression or deep reinforcement learning, the principal subspace of such a matrix provides a useful, low-dimensional representation of individual data. Here, we are interested in determining the $d$-dimensional principal subspace of a given matrix from sample entries, i.e. from small random submatrices. Although a number of sample-based methods exist for this problem (e.g. Oja's rule \citep{oja1982simplified}), these assume access to full columns of the matrix or particular matrix structure such as symmetry and cannot be combined as-is with neural networks \citep{baldi1989neural}. In this paper, we derive an algorithm that learns a principal subspace from sample entries, can be applied when the approximate subspace is represented by a neural network, and hence can be scaled to datasets with an effectively infinite number of rows and columns. Our method consists in defining a loss function whose minimizer is the desired principal subspace, and constructing a gradient estimate of this loss whose bias can be controlled. We complement our theoretical analysis with a series of experiments on synthetic matrices, the MNIST dataset \citep{lecun2010mnist} and the reinforcement learning domain PuddleWorld \citep{sutton1995generalization} demonstrating the usefulness of our approach.
translated by 谷歌翻译
主成分分析(PCA)是大数据时代的维度减少的Workhorse工具。虽然经常被忽视,但PCA的目的不仅可以减少数据维度,而且还要产生不相关的功能。此外,现代世界中不断增加的数据量通常需要在多台机器上存储数据样本,这会排除使用集中式PCA算法。本文重点介绍了PCA的双重目标,即功能的维度和特征的脱钩,但在分布式环境中。这需要估计数据协方差矩阵的特征向量,而不是仅估计特征向量跨越的子空间,当数据分布在机器网络上时。尽管最近已经提出了几种分布式PCA问题的分布式解决方案,但这些解决方案的收敛保证和/或通信开销仍然是一个问题。随着通信效率的眼睛,介绍了一种基于前馈神经网络的一种时级分布式PCA算法,其被称为分布式Sanger的算法(DSA),该算法(DSA)估计数据协方差矩阵的特征向量,当数据分布在一个无向连接的网络上时机器。此外,所提出的算法被示出为线性地收敛到真实解决方案的邻域。还提供了数值结果以证明所提出的解决方案的功效。
translated by 谷歌翻译
这项调查旨在提供线性模型及其背后的理论的介绍。我们的目标是对读者进行严格的介绍,并事先接触普通最小二乘。在机器学习中,输出通常是输入的非线性函数。深度学习甚至旨在找到需要大量计算的许多层的非线性依赖性。但是,这些算法中的大多数都基于简单的线性模型。然后,我们从不同视图中描述线性模型,并找到模型背后的属性和理论。线性模型是回归问题中的主要技术,其主要工具是最小平方近似,可最大程度地减少平方误差之和。当我们有兴趣找到回归函数时,这是一个自然的选择,该回归函数可以最大程度地减少相应的预期平方误差。这项调查主要是目的的摘要,即线性模型背后的重要理论的重要性,例如分布理论,最小方差估计器。我们首先从三种不同的角度描述了普通的最小二乘,我们会以随机噪声和高斯噪声干扰模型。通过高斯噪声,该模型产生了可能性,因此我们引入了最大似然估计器。它还通过这种高斯干扰发展了一些分布理论。最小二乘的分布理论将帮助我们回答各种问题并引入相关应用。然后,我们证明最小二乘是均值误差的最佳无偏线性模型,最重要的是,它实际上接近了理论上的极限。我们最终以贝叶斯方法及以后的线性模型结束。
translated by 谷歌翻译
自我监督的学习(SSL)推测,投入和成对的积极关系足以学习有意义的表示。尽管SSL最近达到了一个里程碑:在许多模式下,胜过监督的方法\点,理论基础是有限的,特定于方法的,并且未能向从业者提供原则上的设计指南。在本文中,我们提出了一个统一的框架,这些框架是在光谱歧管学习的掌舵下,以解决这些局限性。通过这项研究的过程,我们将严格证明Vic​​reg,Simclr,Barlowtwins等。对应于诸如Laplacian eigenmaps,多维缩放等方面的同名光谱方法。然后,此统一将使我们能够获得(i)每种方法的闭合形式的最佳表示,(ii)每种方法的线性态度中的封闭形式的最佳网络参数,(iii)在期间使用的成对关系的影响对每个数量和下游任务性能的培训,以及最重要的是,(iv)分别针对全球和局部光谱嵌入方法的对比度和非对抗性方法之间的第一个理论桥梁,暗示了每种方法的益处和限制。例如,(i)如果成对关系与下游任务一致,则可以成功采用任何SSL方法并将恢复监督方法,但是在低数据状态下,Vicreg的不变性超参数应该很高; (ii)如果成对关系与下游任务未对准,则与SIMCLR或BARLOWTWINS相比,具有小型不变性高参数的VICREG。
translated by 谷歌翻译
量子哈密顿学习和量子吉布斯采样的双重任务与物理和化学中的许多重要问题有关。在低温方案中,这些任务的算法通常会遭受施状能力,例如因样本或时间复杂性差而遭受。为了解决此类韧性,我们将量子自然梯度下降的概括引入了参数化的混合状态,并提供了稳健的一阶近似算法,即量子 - 固定镜下降。我们使用信息几何学和量子计量学的工具证明了双重任务的数据样本效率,因此首次将经典Fisher效率的开创性结果推广到变异量子算法。我们的方法扩展了以前样品有效的技术,以允许模型选择的灵活性,包括基于量子汉密尔顿的量子模型,包括基于量子的模型,这些模型可能会规避棘手的时间复杂性。我们的一阶算法是使用经典镜下降二元性的新型量子概括得出的。两种结果都需要特殊的度量选择,即Bogoliubov-Kubo-Mori度量。为了从数值上测试我们提出的算法,我们将它们的性能与现有基准进行了关于横向场ISING模型的量子Gibbs采样任务的现有基准。最后,我们提出了一种初始化策略,利用几何局部性来建模状态的序列(例如量子 - 故事过程)的序列。我们从经验上证明了它在实际和想象的时间演化的经验上,同时定义了更广泛的潜在应用。
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译
监督主体组件分析(SPCA)的方法旨在将标签信息纳入主成分分析(PCA),以便提取的功能对于预测感兴趣的任务更有用。SPCA的先前工作主要集中在优化预测误差上,并忽略了提取功能解释的最大化方差的价值。我们为SPCA提出了一种新的方法,该方法共同解决了这两个目标,并从经验上证明我们的方法主导了现有方法,即在预测误差和变异方面都超越了它们的表现。我们的方法可容纳任意监督的学习损失,并通过统计重新制定提供了广义线性模型的新型低级扩展。
translated by 谷歌翻译
We study the learning dynamics of self-predictive learning for reinforcement learning, a family of algorithms that learn representations by minimizing the prediction error of their own future latent representations. Despite its recent empirical success, such algorithms have an apparent defect: trivial representations (such as constants) minimize the prediction error, yet it is obviously undesirable to converge to such solutions. Our central insight is that careful designs of the optimization dynamics are critical to learning meaningful representations. We identify that a faster paced optimization of the predictor and semi-gradient updates on the representation, are crucial to preventing the representation collapse. Then in an idealized setup, we show self-predictive learning dynamics carries out spectral decomposition on the state transition matrix, effectively capturing information of the transition dynamics. Building on the theoretical insights, we propose bidirectional self-predictive learning, a novel self-predictive algorithm that learns two representations simultaneously. We examine the robustness of our theoretical insights with a number of small-scale experiments and showcase the promise of the novel representation learning algorithm with large-scale experiments.
translated by 谷歌翻译
网络数据通常在各种应用程序中收集,代表感兴趣的功能之间直接测量或统计上推断的连接。在越来越多的域中,这些网络会随着时间的流逝而收集,例如不同日子或多个主题之间的社交媒体平台用户之间的交互,例如在大脑连接性的多主体研究中。在分析多个大型网络时,降低降低技术通常用于将网络嵌入更易于处理的低维空间中。为此,我们通过专门的张量分解来开发用于网络集合的主组件分析(PCA)的框架,我们将半对称性张量PCA或SS-TPCA术语。我们得出计算有效的算法来计算我们提出的SS-TPCA分解,并在标准的低级别信号加噪声模型下建立方法的统计效率。值得注意的是,我们表明SS-TPCA具有与经典矩阵PCA相同的估计精度,并且与网络中顶点数的平方根成正比,而不是预期的边缘数。我们的框架继承了古典PCA的许多优势,适用于广泛的无监督学习任务,包括识别主要网络,隔离有意义的更改点或外出观察,以及表征最不同边缘的“可变性网络”。最后,我们证明了我们的提案对模拟数据的有效性以及经验法律研究的示例。用于建立我们主要一致性结果的技术令人惊讶地简单明了,可能会在其他各种网络分析问题中找到使用。
translated by 谷歌翻译
低维歧管假设认为,在许多应用中发现的数据,例如涉及自然图像的数据(大约)位于嵌入高维欧几里得空间中的低维歧管上。在这种情况下,典型的神经网络定义了一个函数,该函数在嵌入空间中以有限数量的向量作为输入。但是,通常需要考虑在训练分布以外的点上评估优化网络。本文考虑了培训数据以$ \ mathbb r^d $的线性子空间分配的情况。我们得出对由神经网络定义的学习函数变化的估计值,沿横向子空间的方向。我们研究了数据歧管的编纂中与网络的深度和噪声相关的潜在正则化效应。由于存在噪声,我们还提出了训练中的其他副作用。
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
通过内核矩阵或图形laplacian矩阵代表数据点的光谱方法已成为无监督数据分析的主要工具。在许多应用程序场景中,可以通过神经网络嵌入的光谱嵌入可以在数据样本上进行训练,这为实现自动样本外扩展以及计算可扩展性提供了一种有希望的方法。在Spectralnet的原始论文中采用了这种方法(Shaham等人,2018年),我们称之为Specnet1。当前的论文引入了一种名为SpecNet2的新神经网络方法,以计算光谱嵌入,该方法优化了特征问题的等效目标,并删除了SpecNet1中的正交层。 SpecNet2还允许通过通过梯度公式跟踪每个数据点的邻居来分离图形亲和力矩阵的行采样和列。从理论上讲,我们证明了新的无正交物质目标的任何局部最小化均显示出领先的特征向量。此外,证明了使用基于批处理的梯度下降法的这种新的无正交目标的全局收敛。数值实验证明了在模拟数据和图像数据集上Specnet2的性能和计算效率的提高。
translated by 谷歌翻译
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
translated by 谷歌翻译
In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.
translated by 谷歌翻译
在本文中,我们应对PCA:异质性的重大挑战。当从不同趋势的不同来源收集数据的同时仍具有一致性时,提取共享知识的同时保留每个来源的独特功能至关重要。为此,我们提出了个性化的PCA(PERPCA),该PCA(PERPCA)使用相互正交的全球和本地主要组件来编码唯一的和共享的功能。我们表明,在轻度条件下,即使协方差矩阵截然不同,也可以通过约束优化问题来识别和恢复独特的和共享的特征。此外,我们设计了一种完全由分布式stiefel梯度下降来解决问题的完全联合算法。该算法引入了一组新的操作,称为通用缩回,以处理正交性约束,并且仅要求跨来源共享全局PC。我们证明了在合适的假设下算法的线性收敛。全面的数值实验突出了PERPCA在特征提取和异质数据集预测方面的出色性能。作为将共享和唯一功能从异质数据集解除共享和独特功能的系统方法,PERPCA在几种任务中找到了应用程序,包括视频细分,主题提取和分布式聚类。
translated by 谷歌翻译
对比学习在各种自我监督的学习任务中取得了最先进的表现,甚至优于其监督的对应物。尽管其经验成功,但对为什么对比学习作品的理论认识仍然有限。在本文中,(i)我们证明,对比学习胜过AutoEncoder,一种经典无监督的学习方法,适用于特征恢复和下游任务;(ii)我们还说明标记数据在监督对比度学习中的作用。这为最近的发现提供了理论支持,即对标签对比学习的结果提高了域名下游任务中学识表的表现,但它可能会损害转移学习的性能。我们通过数值实验验证了我们的理论。
translated by 谷歌翻译
通过在线规范相关性分析的问题,我们提出了\ emph {随机缩放梯度下降}(SSGD)算法,以最小化通用riemannian歧管上的随机功能的期望。 SSGD概括了投影随机梯度下降的思想,允许使用缩放的随机梯度而不是随机梯度。在特殊情况下,球形约束的特殊情况,在广义特征向量问题中产生的,我们建立了$ \ sqrt {1 / t} $的令人反感的有限样本,并表明该速率最佳最佳,直至具有积极的积极因素相关参数。在渐近方面,一种新的轨迹平均争论使我们能够实现局部渐近常态,其速率与鲁普特 - Polyak-Quaditsky平均的速率匹配。我们将这些想法携带在一个在线规范相关分析,从事文献中的第一次获得了最佳的一次性尺度算法,其具有局部渐近融合到正常性的最佳一次性尺度算法。还提供了用于合成数据的规范相关分析的数值研究。
translated by 谷歌翻译
我们研究了使用尖刺,现场依赖的随机矩阵理论研究迷你批次对深神经网络损失景观的影响。我们表明,批量黑森州的极值值的大小大于经验丰富的黑森州。我们还获得了类似的结果对Hessian的概括高斯牛顿矩阵近似。由于我们的定理,我们推导出作为批量大小的最大学习速率的分析表达式,为随机梯度下降(线性缩放)和自适应算法(例如ADAM(Square Root Scaling)提供了通知实际培训方案,例如光滑,非凸深神经网络。虽然随机梯度下降的线性缩放是在我们概括的更多限制性条件下导出的,但是适应优化者的平方根缩放规则是我们的知识,完全小说。随机二阶方法和自适应方法的百分比,我们得出了最小阻尼系数与学习率与批量尺寸的比率成比例。我们在Cifar-$ 100 $和ImageNet数据集上验证了我们的VGG / WimerEsnet架构上的索赔。根据我们对象检的调查,我们基于飞行学习率和动量学习者开发了一个随机兰齐齐竞争,这避免了对这些关键的超参数进行昂贵的多重评估的需求,并在预残留的情况下显示出良好的初步结果Cifar的architecure - $ 100 $。
translated by 谷歌翻译