Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from O(n 3 ) to O(n 2 ). Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
translated by 谷歌翻译
低精度算术对神经网络的训练产生了变革性的影响,从而减少了计算,记忆和能量需求。然而,尽管有希望,低精确的算术对高斯流程(GPS)的关注很少,这主要是因为GPS需要在低精确度中不稳定的复杂线性代数例程。我们研究以一半精度训练GP时可能发生的不同故障模式。为了避免这些故障模式,我们提出了一种多方面的方法,该方法涉及具有重新构造,混合精度和预处理的共轭梯度。我们的方法大大提高了低精度在各种设置中的偶联梯度的数值稳定性和实践性能,从而使GPS能够在单个GPU上以10美元的$ 10 $ 10 $ 10 $ 10 $ 10的数据点进行培训,而没有任何稀疏的近似值。
translated by 谷歌翻译
高斯工艺高参数优化需要大核矩阵的线性溶解和对数确定因子。迭代数值技术依赖于线性溶液的共轭梯度方法(CG)和对数数据的随机痕迹估计的迭代数值技术变得越来越流行。这项工作介绍了用于预处理这些计算的新算法和理论见解。虽然在CG的背景下对预处理有充分的理解,但我们证明了它也可以加速收敛并减少对数数据及其衍生物的估计值的方差。我们证明了对数确定性,对数 - 界限可能性及其衍生物的预处理计算的一般概率误差界限。此外,我们得出了一系列内核 - 前提组合的特定速率,这表明可以达到指数收敛。我们的理论结果可以证明对内核超参数的有效优化,我们在大规模的基准问题上进行经验验证。我们的方法可以加速训练,最多可以达到数量级。
translated by 谷歌翻译
虽然最近的共轭梯度方法和LanczoS分解的工作已经实现了可扩展的高斯工艺推论,但在几种实现中,这些迭代方法似乎在学习内核超参数中的数值不稳定性以及较差的测试可能性方面似乎奋斗。通过调查CG公差,预处理等级和Lanczos分解等级,我们提供了一个特别简单的处方来纠正这些问题:我们建议人们使用小的CG公差($ \ epsilon \ leq 0.01 $)和大的根分解大小($ r \ geq 5000 $)。此外,我们表明L-BFGS-B是迭代GPS的引人注目的优化器,实现了较少的渐变更新的收敛性。
translated by 谷歌翻译
Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
我们开发了一个计算程序,以估计具有附加噪声的半摩托车高斯过程回归模型的协方差超参数。也就是说,提出的方法可用于有效估计相关误差的方差,以及基于最大化边际似然函数的噪声方差。我们的方法涉及适当地降低超参数空间的维度,以简化单变量的根发现问题的估计过程。此外,我们得出了边际似然函数及其衍生物的边界和渐近线,这对于缩小高参数搜索的初始范围很有用。使用数值示例,我们证明了与传统参数优化相比,提出方法的计算优势和鲁棒性。
translated by 谷歌翻译
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse. It is derived by approximating various large blocks of the Fisher (corresponding to entire layers) as being the Kronecker product of two much smaller matrices. While only several times more expensive to compute than the plain stochastic gradient, the updates produced by K-FAC make much more progress optimizing the objective, which results in an algorithm that can be much faster than stochastic gradient descent with momentum in practice. And unlike some previously proposed approximate natural-gradient/Newton methods which use high-quality non-diagonal curvature matrices (such as Hessian-free optimization), K-FAC works very well in highly stochastic optimization regimes. This is because the cost of storing and inverting K-FAC's approximation to the curvature matrix does not depend on the amount of data used to estimate it, which is a feature typically associated only with diagonal or low-rank approximations to the curvature matrix.
translated by 谷歌翻译
When comparing approximate Gaussian process (GP) models, it can be helpful to be able to generate data from any GP. If we are interested in how approximate methods perform at scale, we may wish to generate very large synthetic datasets to evaluate them. Na\"{i}vely doing so would cost \(\mathcal{O}(n^3)\) flops and \(\mathcal{O}(n^2)\) memory to generate a size \(n\) sample. We demonstrate how to scale such data generation to large \(n\) whilst still providing guarantees that, with high probability, the sample is indistinguishable from a sample from the desired GP.
translated by 谷歌翻译
高斯进程(GPS)是通过工程学的社会和自然科学的应用程序学习和统计数据的重要工具。它们构成具有良好校准的不确定性估计的强大的内核非参数方法,然而,由于其立方计算复杂度,从货架上的GP推理程序仅限于具有数千个数据点的数据集。因此,在过去几年中已经开发出许多稀疏的GPS技术。在本文中,我们专注于GP回归任务,并提出了一种基于来自几个本地和相关专家的聚合预测的新方法。因此,专家之间的相关程度可以在独立于完全相关的专家之间变化。考虑到他们的相关性导致了一致的不确定性估算,汇总了专家的个人预测。我们的方法在限制案件中恢复了专家的独立产品,稀疏GP和全GP。呈现的框架可以处理一般的内核函数和多个变量,并且具有时间和空间复杂性,在专家和数据样本的数量中是线性的,这使得我们的方法是高度可扩展的。我们展示了我们提出的方法的卓越性能,这是我们提出的综合性和几个实际数据集的最先进的GP近似方法的卓越性能,以及具有确定性和随机优化的若干现实世界数据集。
translated by 谷歌翻译
我们提出了一个算法框架,用于近距离矩阵上的量子启发的经典算法,概括了Tang的突破性量子启发算法开始的一系列结果,用于推荐系统[STOC'19]。由量子线性代数算法和gily \'en,su,low和wiebe [stoc'19]的量子奇异值转换(SVT)框架[SVT)的动机[STOC'19],我们开发了SVT的经典算法合适的量子启发的采样假设。我们的结果提供了令人信服的证据,表明在相应的QRAM数据结构输入模型中,量子SVT不会产生指数量子加速。由于量子SVT框架基本上概括了量子线性代数的所有已知技术,因此我们的结果与先前工作的采样引理相结合,足以概括所有有关取消量子机器学习算法的最新结果。特别是,我们的经典SVT框架恢复并经常改善推荐系统,主成分分析,监督聚类,支持向量机器,低秩回归和半决赛程序解决方案的取消结果。我们还为汉密尔顿低级模拟和判别分析提供了其他取消化结果。我们的改进来自识别量子启发的输入模型的关键功能,该模型是所有先前量子启发的结果的核心:$ \ ell^2 $ -Norm采样可以及时近似于其尺寸近似矩阵产品。我们将所有主要结果减少到这一事实,使我们的简洁,独立和直观。
translated by 谷歌翻译
我们提供了来自两个常见的低级内核近似产生的近似高斯过程(GP)回归的保证:基于随机傅里叶功能,并基于截断内核的Mercer扩展。特别地,我们将kullback-leibler在精确的gp和由一个上述低秩近似的一个与其内核中的一个引起的kullback-leibler发散相结合,以及它们的相应预测密度之间,并且我们还绑定了预测均值之间的误差使用近似GP使用精确的GP计算的矢量和预测协方差矩阵之间的载体。我们为模拟数据和标准基准提供了实验,以评估我们理论界的有效性。
translated by 谷歌翻译
我们提供了来自两个常见的低级内核近似产生的近似高斯过程(GP)回归的保证:基于随机傅里叶功能,并基于截断内核的Mercer扩展。特别地,我们将kullback-leibler在精确的gp和由一个上述低秩近似的一个与其内核中的一个引起的kullback-leibler发散相结合,以及它们的相应预测密度之间,并且我们还绑定了预测均值之间的误差使用近似GP使用精确的GP计算的矢量和预测协方差矩阵之间的载体。我们为模拟数据和标准基准提供了实验,以评估我们理论界的有效性。
translated by 谷歌翻译
随机梯度下降(SGD)及其变体已经建立为具有独立样本的大型机器学习问题的进入算法,由于其泛化性能和内在的计算优势。然而,随机梯度是具有相关样本的全梯度的偏置估计的事实导致了对SGD在相关环境中的表现和阻碍其在这种情况下使用的理解缺乏理论理解。在本文中,我们专注于高斯过程(GP)的近似参数估计,并通过证明小纤维SGD收敛到完整日志似然丢失功能的关键点来打破屏障的一步,并恢复速率$率的模型超参数o(\ frac {1} {k})$ k $迭代,达到统计误差术语,具体取决于小靶大小。我们的理论担保仍然存在,内核功能表现出指数或多项式EIGENDECAY,这是通过GPS常用的各种核的满足。模拟和实时数据集的数值研究表明,Minibatch SGD在最先进的GP方法上具有更好的推广,同时降低了计算负担并开启了GPS的新的,先前未开发的数据大小制度。
translated by 谷歌翻译
求解线性系统的迭代方法的收敛速率$ \ mathbf {a} x = b $通常取决于矩阵$ \ mathbf {a} $的条件号。预处理是通过以计算廉价的方式减少该条件号来加速这些方法的常用方式。在本文中,我们通过左或右对角线重构重新审视如何最好地提高$ \ mathbf {a}条件号的数十年。我们在几个方向上取得了这个问题。首先,我们为缩放$ \ mathbf {a} $的经典启发式提供了新的界限(a.k.a.jacobi预处理)。我们证明了这种方法将$ \ MATHBF {a} $的条件号减少到最佳可能缩放的二次因素中。其次,我们为结构化混合包装和覆盖了Semidefinite程序(MPC SDP)提供了一个求解器,它计算$ \ mathbf {a} $ in $ \ widetilde {o}(\ text {nnz}(\ mathbf {a})\ cdot \ text {poly}(\ kappa ^ \ star))$ time;这与在缩放到$ \ widetilde {o}(\ text {poly}(\ kappa ^ \ star))$ factors之后求解线性系统的成本匹配。第三,我们证明了足够一般的宽度无关的MPC SDP求解器将暗示我们考虑的缩放问题的近乎最佳的运行时间,以及与平均调理措施有关的自然变体。最后,我们突出了我们的预处理技术与半随机噪声模型的连接,以及在几种统计回归模型中降低风险的应用。
translated by 谷歌翻译
我们为函数开发启发式插值方法$ t \ mapsto \ log \ det \ left(\ mathbf {a} + t \ t \ mathbf {b} \ right)$和$ t \ mapsto \ mapsto \ operatatorNAME {trace}Mathbf {a} + t \ mathbf {b})^{p} \ right)$,其中矩阵$ \ mathbf {a} $ and $ \ mathbf {b} $是Hermitian and Hermitian and阳性(semi)和$ P $$ t $是实际变量。这些功能在统计,机器学习和计算物理学的许多应用中都有特征。提出的插值函数基于对这些函数的尖锐边界的修改。我们通过数值示例证明了所提出的方法的准确性和性能,即高斯过程回归的边际最大似然估计以及用广义交叉验证方法对脊回归的正则参数的估计。
translated by 谷歌翻译
Models in which the covariance matrix has the structure of a sparse matrix plus a low rank perturbation are ubiquitous in machine learning applications. It is often desirable for learning algorithms to take advantage of such structures, avoiding costly matrix computations that often require cubic time and quadratic storage. This is often accomplished by performing operations that maintain such structures, e.g. matrix inversion via the Sherman-Morrison-Woodbury formula. In this paper we consider the matrix square root and inverse square root operations. Given a low rank perturbation to a matrix, we argue that a low-rank approximate correction to the (inverse) square root exists. We do so by establishing a geometric decay bound on the true correction's eigenvalues. We then proceed to frame the correction has the solution of an algebraic Ricatti equation, and discuss how a low-rank solution to that equation can be computed. We analyze the approximation error incurred when approximately solving the algebraic Ricatti equation, providing spectral and Frobenius norm forward and backward error bounds. Finally, we describe several applications of our algorithms, and demonstrate their utility in numerical experiments.
translated by 谷歌翻译
随机旋转的Cholesky(RPCholesky)是一种用于计算N X N阳性半芬酸矩阵(PSD)矩阵的等级K近似的天然算法。RPCholesky只需几行代码就可以实现。它仅需要(k+1)n进入评估,o(k^2 n)其他算术操作。本文对其实验和理论行为进行了首次认真研究。从经验上讲,rpcholesky匹配或改善了低级别PSD近似的替代算法的性能。此外,RPCholesky可证明达到了近乎最佳的近似保证。该算法的简单性,有效性和鲁棒性强烈支持其在科学计算和机器学习应用中的使用。
translated by 谷歌翻译
这项调查旨在提供线性模型及其背后的理论的介绍。我们的目标是对读者进行严格的介绍,并事先接触普通最小二乘。在机器学习中,输出通常是输入的非线性函数。深度学习甚至旨在找到需要大量计算的许多层的非线性依赖性。但是,这些算法中的大多数都基于简单的线性模型。然后,我们从不同视图中描述线性模型,并找到模型背后的属性和理论。线性模型是回归问题中的主要技术,其主要工具是最小平方近似,可最大程度地减少平方误差之和。当我们有兴趣找到回归函数时,这是一个自然的选择,该回归函数可以最大程度地减少相应的预期平方误差。这项调查主要是目的的摘要,即线性模型背后的重要理论的重要性,例如分布理论,最小方差估计器。我们首先从三种不同的角度描述了普通的最小二乘,我们会以随机噪声和高斯噪声干扰模型。通过高斯噪声,该模型产生了可能性,因此我们引入了最大似然估计器。它还通过这种高斯干扰发展了一些分布理论。最小二乘的分布理论将帮助我们回答各种问题并引入相关应用。然后,我们证明最小二乘是均值误差的最佳无偏线性模型,最重要的是,它实际上接近了理论上的极限。我们最终以贝叶斯方法及以后的线性模型结束。
translated by 谷歌翻译
我们研究基于Krylov子空间的迭代方法,用于在任何Schatten $ p $ Norm中的低级别近似值。在这里,通过矩阵向量产品访问矩阵$ a $ $如此$ \ | a(i -zz^\ top)\ | _ {s_p} \ leq(1+ \ epsilon)\ min_ {u^\ top u = i_k} } $,其中$ \ | m \ | _ {s_p} $表示$ m $的单数值的$ \ ell_p $ norm。对于$ p = 2 $(frobenius norm)和$ p = \ infty $(频谱规范)的特殊情况,musco and Musco(Neurips 2015)获得了基于Krylov方法的算法,该方法使用$ \ tilde {o}(k)(k /\ sqrt {\ epsilon})$ matrix-vector产品,改进na \“ ive $ \ tilde {o}(k/\ epsilon)$依赖性,可以通过功率方法获得,其中$ \ tilde {o} $抑制均可抑制poly $(\ log(dk/\ epsilon))$。我们的主要结果是仅使用$ \ tilde {o}(kp^{1/6}/\ epsilon^{1/3} {1/3})$ matrix $ matrix的算法 - 矢量产品,并为所有$ p \ geq 1 $。为$ p = 2 $工作,我们的限制改进了先前的$ \ tilde {o}(k/\ epsilon^{1/2})$绑定到$ \ tilde {o}(k/\ epsilon^{1/3})$。由于schatten- $ p $和schatten-$ \ infty $ norms在$(1+ \ epsilon)$ pers $ p时相同\ geq(\ log d)/\ epsilon $,我们的界限恢复了Musco和Musco的结果,以$ p = \ infty $。此外,我们证明了矩阵矢量查询$ \ omega的下限(1/\ epsilon^ {1/3})$对于任何固定常数$ p \ geq 1 $,表明令人惊讶的$ \ tilde {\ theta}(1/\ epsilon^{ 1/3})$是常数〜$ k $的最佳复杂性。为了获得我们的结果,我们介绍了几种新技术,包括同时对多个Krylov子空间进行优化,以及针对分区操作员的不平等现象。我们在[1,2] $中以$ p \的限制使用了Araki-lieb-thirring Trace不平等,而对于$ p> 2 $,我们呼吁对安装分区操作员的规范压缩不平等。
translated by 谷歌翻译