Gaussian processes scale prohibitively with the size of the dataset. In response, many approximation methods have been developed, which inevitably introduce approximation error. This additional source of uncertainty, due to limited computation, is entirely ignored when using the approximate posterior. Therefore in practice, GP models are often as much about the approximation method as they are about the data. Here, we develop a new class of methods that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended. The most common GP approximations map to an instance in this class, such as methods based on the Cholesky factorization, conjugate gradients, and inducing points. For any method in this class, we prove (i) convergence of its posterior mean in the associated RKHS, (ii) decomposability of its combined posterior covariance into mathematical and computational covariances, and (iii) that the combined variance is a tight worst-case bound for the squared error between the method's posterior mean and the latent function. Finally, we empirically demonstrate the consequences of ignoring computational uncertainty and show how implicitly modeling it improves generalization performance on benchmark datasets.
translated by 谷歌翻译
高斯工艺高参数优化需要大核矩阵的线性溶解和对数确定因子。迭代数值技术依赖于线性溶液的共轭梯度方法(CG)和对数数据的随机痕迹估计的迭代数值技术变得越来越流行。这项工作介绍了用于预处理这些计算的新算法和理论见解。虽然在CG的背景下对预处理有充分的理解,但我们证明了它也可以加速收敛并减少对数数据及其衍生物的估计值的方差。我们证明了对数确定性,对数 - 界限可能性及其衍生物的预处理计算的一般概率误差界限。此外,我们得出了一系列内核 - 前提组合的特定速率,这表明可以达到指数收敛。我们的理论结果可以证明对内核超参数的有效优化,我们在大规模的基准问题上进行经验验证。我们的方法可以加速训练,最多可以达到数量级。
translated by 谷歌翻译
Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from O(n 3 ) to O(n 2 ). Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
translated by 谷歌翻译
Linear partial differential equations (PDEs) are an important, widely applied class of mechanistic models, describing physical processes such as heat transfer, electromagnetism, and wave propagation. In practice, specialized numerical methods based on discretization are used to solve PDEs. They generally use an estimate of the unknown model parameters and, if available, physical measurements for initialization. Such solvers are often embedded into larger scientific models or analyses with a downstream application such that error quantification plays a key role. However, by entirely ignoring parameter and measurement uncertainty, classical PDE solvers may fail to produce consistent estimates of their inherent approximation error. In this work, we approach this problem in a principled fashion by interpreting solving linear PDEs as physics-informed Gaussian process (GP) regression. Our framework is based on a key generalization of a widely-applied theorem for conditioning GPs on a finite number of direct observations to observations made via an arbitrary bounded linear operator. Crucially, this probabilistic viewpoint allows to (1) quantify the inherent discretization error; (2) propagate uncertainty about the model parameters to the solution; and (3) condition on noisy measurements. Demonstrating the strength of this formulation, we prove that it strictly generalizes methods of weighted residuals, a central class of PDE solvers including collocation, finite volume, pseudospectral, and (generalized) Galerkin methods such as finite element and spectral methods. This class can thus be directly equipped with a structured error estimate and the capability to incorporate uncertain model parameters and observations. In summary, our results enable the seamless integration of mechanistic models as modular building blocks into probabilistic models.
translated by 谷歌翻译
近年来目睹了采用灵活的机械学习模型进行乐器变量(IV)回归的兴趣,但仍然缺乏不确定性量化方法的发展。在这项工作中,我们为IV次数回归提出了一种新的Quasi-Bayesian程序,建立了最近开发的核化IV模型和IV回归的双/极小配方。我们通过在$ l_2 $和sobolev规范中建立最低限度的最佳收缩率,并讨论可信球的常见有效性来分析所提出的方法的频繁行为。我们进一步推出了一种可扩展的推理算法,可以扩展到与宽神经网络模型一起工作。实证评价表明,我们的方法对复杂的高维问题产生了丰富的不确定性估计。
translated by 谷歌翻译
低精度算术对神经网络的训练产生了变革性的影响,从而减少了计算,记忆和能量需求。然而,尽管有希望,低精确的算术对高斯流程(GPS)的关注很少,这主要是因为GPS需要在低精确度中不稳定的复杂线性代数例程。我们研究以一半精度训练GP时可能发生的不同故障模式。为了避免这些故障模式,我们提出了一种多方面的方法,该方法涉及具有重新构造,混合精度和预处理的共轭梯度。我们的方法大大提高了低精度在各种设置中的偶联梯度的数值稳定性和实践性能,从而使GPS能够在单个GPU上以10美元的$ 10 $ 10 $ 10 $ 10 $ 10的数据点进行培训,而没有任何稀疏的近似值。
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
基于内核的模型,例如内核脊回归和高斯工艺在机器学习应用程序中无处不在,用于回归和优化。众所周知,基于内核的模型的主要缺点是高计算成本。给定$ n $样本的数据集,成本增长为$ \ Mathcal {o}(n^3)$。在某些情况下,现有的稀疏近似方法可以大大降低计算成本,从而有效地将实际成本降低到$ \ natercal {o}(n)$。尽管取得了显着的经验成功,但由于近似值而导致的误差的分析范围的现有结果仍然存在显着差距。在这项工作中,我们为NyStr \“ Om方法和稀疏变分高斯过程近似方法提供新颖的置信区间,我们使用模型的近似(代理)后差解释来建立这些方法。我们的置信区间可改善性能。回归和优化问题的界限。
translated by 谷歌翻译
从大型套装中选择不同的和重要的项目,称为地标是机器学习兴趣的问题。作为一个具体示例,为了处理大型训练集,内核方法通常依赖于基于地标的选择或采样的低等级矩阵NYSTR \“OM近似值。在此上下文中,我们提出了一个确定性和随机的自适应算法在培训数据集中选择地标点。这些地标与克尼利克里斯特步函数序列的最小值有关。除了ChristOffel功能和利用分数之间的已知联系,我们的方法也有限决定性点过程(DPP)也是如此解释。即,我们的建设以类似于DPP的方式促进重要地标点之间的多样性。此外,我们解释了我们的随机自适应算法如何影响内核脊回归的准确性。
translated by 谷歌翻译
高斯进程(GPS)是通过工程学的社会和自然科学的应用程序学习和统计数据的重要工具。它们构成具有良好校准的不确定性估计的强大的内核非参数方法,然而,由于其立方计算复杂度,从货架上的GP推理程序仅限于具有数千个数据点的数据集。因此,在过去几年中已经开发出许多稀疏的GPS技术。在本文中,我们专注于GP回归任务,并提出了一种基于来自几个本地和相关专家的聚合预测的新方法。因此,专家之间的相关程度可以在独立于完全相关的专家之间变化。考虑到他们的相关性导致了一致的不确定性估算,汇总了专家的个人预测。我们的方法在限制案件中恢复了专家的独立产品,稀疏GP和全GP。呈现的框架可以处理一般的内核函数和多个变量,并且具有时间和空间复杂性,在专家和数据样本的数量中是线性的,这使得我们的方法是高度可扩展的。我们展示了我们提出的方法的卓越性能,这是我们提出的综合性和几个实际数据集的最先进的GP近似方法的卓越性能,以及具有确定性和随机优化的若干现实世界数据集。
translated by 谷歌翻译
我们提供了来自两个常见的低级内核近似产生的近似高斯过程(GP)回归的保证:基于随机傅里叶功能,并基于截断内核的Mercer扩展。特别地,我们将kullback-leibler在精确的gp和由一个上述低秩近似的一个与其内核中的一个引起的kullback-leibler发散相结合,以及它们的相应预测密度之间,并且我们还绑定了预测均值之间的误差使用近似GP使用精确的GP计算的矢量和预测协方差矩阵之间的载体。我们为模拟数据和标准基准提供了实验,以评估我们理论界的有效性。
translated by 谷歌翻译
我们提供了来自两个常见的低级内核近似产生的近似高斯过程(GP)回归的保证:基于随机傅里叶功能,并基于截断内核的Mercer扩展。特别地,我们将kullback-leibler在精确的gp和由一个上述低秩近似的一个与其内核中的一个引起的kullback-leibler发散相结合,以及它们的相应预测密度之间,并且我们还绑定了预测均值之间的误差使用近似GP使用精确的GP计算的矢量和预测协方差矩阵之间的载体。我们为模拟数据和标准基准提供了实验,以评估我们理论界的有效性。
translated by 谷歌翻译
许多现代数据集,从神经影像和地统计数据等领域都以张量数据的随机样本的形式来说,这可以被理解为对光滑的多维随机功能的嘈杂观察。来自功能数据分析的大多数传统技术被维度的诅咒困扰,并且随着域的尺寸增加而迅速变得棘手。在本文中,我们提出了一种学习从多维功能数据样本的持续陈述的框架,这些功能是免受诅咒的几种表现形式的。这些表示由一组可分离的基函数构造,该函数被定义为最佳地适应数据。我们表明,通过仔细定义的数据的仔细定义的减少转换的张测仪分解可以有效地解决所得到的估计问题。使用基于差分运算符的惩罚,并入粗糙的正则化。也建立了相关的理论性质。在模拟研究中证明了我们对竞争方法的方法的优点。我们在神经影像动物中得出真正的数据应用。
translated by 谷歌翻译
高斯过程状态空间模型通过在转换功能上放置高斯过程来以原则方式捕获复杂的时间依赖性。这些模型具有自然的解释,作为离散的随机微分方程,但困难的长期序列的推断是困难的。快速过渡需要紧密离散化,而慢速转换需要在长副图层上备份梯度。我们提出了一种由多个组件组成的新型高斯过程状态空间架构,每个组件都培训不同的分辨率,以对不同时间尺度进行模拟效果。组合模型允许在自适应刻度上进行时间进行时间,为具有复杂动态的任意长序列提供有效推断。我们在半合成数据和发动机建模任务上基准我们的新方法。在这两个实验中,我们的方法对其最先进的替代品仅比单一时间级运行的最先进的替代品。
translated by 谷歌翻译
考虑了建立UNKONWN地面真相函数值的样本外界限的问题。内核及其相关的希尔伯特空间是本文所采用的主要形式主义,以及一个观察模型,在该模型中,输出被有限的测量噪声损坏。噪声可以源于任何紧凑的分布,并且没有对可用数据进行独立假设。在这种情况下,我们显示计算紧密的,有限样本的不确定性范围等于求解参数四次约束线性程序。接下来,建立了我们方法的属性,并研究了其与另一种方法的关系。提出了数值实验,以说明如何在许多情况下应用理论,并将其与其他封闭形式的替代方案进行对比。
translated by 谷歌翻译
This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second-and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
translated by 谷歌翻译
我们提供了新的基于梯度的方法,以便有效解决广泛的病态化优化问题。我们考虑最小化函数$ f:\ mathbb {r} ^ d \ lightarrow \ mathbb {r} $的问题,它是隐含的可分解的,作为$ m $未知的非交互方式的总和,强烈的凸起功能并提供方法这解决了这个问题,这些问题是缩放(最快的对数因子)作为组件的条件数量的平方根的乘积。这种复杂性绑定(我们证明几乎是最佳的)可以几乎指出的是加速梯度方法的几乎是指数的,这将作为$ F $的条件数量的平方根。此外,我们提供了求解该多尺度优化问题的随机异标变体的有效方法。而不是学习$ F $的分解(这将是过度昂贵的),而是我们的方法应用一个清洁递归“大步小步”交错标准方法。由此产生的算法使用$ \ tilde {\ mathcal {o}}(d m)$空间,在数字上稳定,并打开门以更细粒度的了解凸优化超出条件号的复杂性。
translated by 谷歌翻译
最近出现了变异推断,成为大规模贝叶斯推理中古典马尔特·卡洛(MCMC)的流行替代品。变异推断的核心思想是贸易统计准确性以达到计算效率。它旨在近似后部,以降低计算成本,但可能损害其统计准确性。在这项工作中,我们通过推论模型选择中的案例研究研究了这种统计和计算权衡。侧重于具有对角和低级精度矩阵的高斯推论模型(又名变异近似族),我们在两个方面启动了对权衡的理论研究,贝叶斯后期推断误差和频繁的不确定性不确定定量误差。从贝叶斯后推理的角度来看,我们表征了相对于精确后部的变异后部的误差。我们证明,鉴于固定的计算预算,较低的推论模型会产生具有较高统计近似误差的变异后期,但计算误差较低。它减少了随机优化的方差,进而加速收敛。从频繁的不确定性定量角度来看,我们将变异后部的精度矩阵视为不确定性估计值。我们发现,相对于真实的渐近精度,变异近似遭受了来自数据的采样不确定性的附加统计误差。此外,随着计算预算的增加,这种统计误差成为主要因素。结果,对于小型数据集,推论模型不必全等级即可达到最佳估计误差。我们最终证明了在经验研究之间的这些统计和计算权衡推论,从而证实了理论发现。
translated by 谷歌翻译
We study a natural extension of classical empirical risk minimization, where the hypothesis space is a random subspace of a given space. In particular, we consider possibly data dependent subspaces spanned by a random subset of the data, recovering as a special case Nystrom approaches for kernel methods. Considering random subspaces naturally leads to computational savings, but the question is whether the corresponding learning accuracy is degraded. These statistical-computational tradeoffs have been recently explored for the least squares loss and self-concordant loss functions, such as the logistic loss. Here, we work to extend these results to convex Lipschitz loss functions, that might not be smooth, such as the hinge loss used in support vector machines. This unified analysis requires developing new proofs, that use different technical tools, such as sub-gaussian inputs, to achieve fast rates. Our main results show the existence of different settings, depending on how hard the learning problem is, for which computational efficiency can be improved with no loss in performance.
translated by 谷歌翻译
Interacting particle or agent systems that display a rich variety of swarming behaviours are ubiquitous in science and engineering. A fundamental and challenging goal is to understand the link between individual interaction rules and swarming. In this paper, we study the data-driven discovery of a second-order particle swarming model that describes the evolution of $N$ particles in $\mathbb{R}^d$ under radial interactions. We propose a learning approach that models the latent radial interaction function as Gaussian processes, which can simultaneously fulfill two inference goals: one is the nonparametric inference of {the} interaction function with pointwise uncertainty quantification, and the other one is the inference of unknown scalar parameters in the non-collective friction forces of the system. We formulate the learning problem as a statistical inverse problem and provide a detailed analysis of recoverability conditions, establishing that a coercivity condition is sufficient for recoverability. Given data collected from $M$ i.i.d trajectories with independent Gaussian observational noise, we provide a finite-sample analysis, showing that our posterior mean estimator converges in a Reproducing kernel Hilbert space norm, at an optimal rate in $M$ equal to the one in the classical 1-dimensional Kernel Ridge regression. As a byproduct, we show we can obtain a parametric learning rate in $M$ for the posterior marginal variance using $L^{\infty}$ norm, and the rate could also involve $N$ and $L$ (the number of observation time instances for each trajectory), depending on the condition number of the inverse problem. Numerical results on systems that exhibit different swarming behaviors demonstrate efficient learning of our approach from scarce noisy trajectory data.
translated by 谷歌翻译