现代深度学习方法对许多超参数非常敏感,并且由于最先进模型的长训练时间,香草贝叶斯超参数优化通常在计算上是不可行的。另一方面,基于随机搜索的基​​于强盗的配置评估方法缺乏指导,并且不能快速收敛到最佳配置。在这里,我们建议结合贝叶斯优化和基于带宽的方法的优点,以实现最佳两个世界:强大的时间性能和快速收敛到最佳配置。我们提出了一种新的实用的最先进的超参数优化方法,它在广泛的问题类型上始终优于贝叶斯优化和超带,包括高维玩具函数,支持向量机,前馈神经网络,贝叶斯神经网络,深度执行学习和卷积神经网络。我们的方法坚固耐用,功能多样,同时在概念上简单易行。
translated by 谷歌翻译
Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success , for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization , we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.
translated by 谷歌翻译
Different neural network architectures, hyperparameters and training protocols lead to different performances as a function of time. Human experts routinely inspect the resulting learning curves to quickly terminate runs with poor hyperparameter settings and thereby considerably speed up manual hyperparameter optimization. The same information can be exploited in automatic hyperparameter optimization by means of a probabilistic model of learning curves across hyperparameter settings. Here, we study the use of Bayesian neural networks for this purpose and improve their performance by a specialized learning curve layer.
translated by 谷歌翻译
Bayesian optimization is an effective methodology for the global optimizationof functions with expensive evaluations. It relies on querying a distributionover functions defined by a relatively cheap surrogate model. An accurate modelfor this distribution over functions is critical to the effectiveness of theapproach, and is typically fit using Gaussian processes (GPs). However, sinceGPs scale cubically with the number of observations, it has been challenging tohandle objectives whose optimization requires many evaluations, and as such,massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to GPsto model distributions over functions. We show that performing adaptive basisfunction regression with a neural network as the parametric form performscompetitively with state-of-the-art GP-based approaches, but scales linearlywith the number of data rather than cubically. This allows us to achieve apreviously intractable degree of parallelism, which we apply to large scalehyperparameter optimization, rapidly finding competitive models on benchmarkobject recognition tasks using convolutional networks, and image captiongeneration using neural language models.
translated by 谷歌翻译
深度高斯过程(DGP)是GaussianProcesses的分层概括,它将良好校准的不确定性估计与多层模型的高灵活性相结合。这些模型面临的最大挑战之一是精确推断是难以处理的。当前的现有技术参考方法,变分推理(VI),采用高斯近似到后验分布。这可能是通常多模式后路的潜在差的单峰近似。在这项工作中,我们提供了后验的非高斯性质的证据,并且我们应用随机梯度哈密顿蒙特卡罗方法直接从中进行采样。为了有效地优化超参数,我们引入了移动窗口MCEM算法。与VI对应的计算成本相比,这导致了更好的预测。因此,我们的方法为DGP中的推理建立了一种新的先进技术。
translated by 谷歌翻译
许多对科学计算和机器学习感兴趣的概率模型具有昂贵的黑盒可能性,这些可能性阻止了贝叶斯推理的标准技术的应用,例如MCMC,其需要接近梯度或大量可能性评估。我们在这里介绍一种新颖的样本有效推理框架,VariationalBayesian Monte Carlo(VBMC)。 VBMC将变分推理与基于高斯过程的有源采样贝叶斯积分结合起来,使用latterto有效逼近变分目标中的难以求的积分。我们的方法产生了后验分布的非参数近似和模型证据的近似下界,对模型选择很有用。我们在几种合成可能性和神经元模型上展示VBMC,其中包含来自真实神经元的数据。在所有测试的问题和维度(高达$ D = 10 $)中,VBMC始终如一地通过有限的可能性评估预算重建后验证和模型证据,而不像其他仅在非常低维度下工作的方法。我们的框架作为一种新颖的工具,具有昂贵的黑盒可能性,可用于后期模型推理。
translated by 谷歌翻译
深度学习中的不确定性计算对于设计健壮且可靠的系统至关重要。变分推理(VI)是用于这种计算的有前途的方法,但与最大似然方法相比需要更多的努力来实现和执行。在本文中,我们提出了新的自然梯度算法来减少高斯平均场VI的这种努力。我们的算法可以通过在梯度评估期间扰乱网络权重来在Adam优化器内实现,并且可以通过使用适应学习速率的向量来廉价地获得不确定性估计。与现有的VI方法相比,这需要更低的内存,计算和实现工作量,同时获得可比质量的不确定性估计。我们的实证结果证实了这一点,并进一步表明我们的算法中的权重扰动对于强化学习和随机优化的探索是有用的。
translated by 谷歌翻译
This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a data set is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks.
translated by 谷歌翻译
我们开发了一种自动变分方法,用于推导具有高斯过程(GP)先验和一般可能性的模型。该方法支持多个输出和多个潜在函数,不需要条件似然的详细知识,只需将其评估为ablack-box函数。使用高斯混合作为变分分布,我们表明使用来自单变量高斯分布的样本可以有效地估计证据下界及其梯度。此外,该方法可扩展到大数据集,这是通过使用诱导变量使用增广先验来实现的。支持最稀疏GP近似的方法,以及并行计算和随机优化。我们在小数据集,中等规模数据集和大型数据集上定量和定性地评估我们的方法,显示其在不同似然模型和稀疏性水平下的竞争力。在涉及航空延误预测和手写数字分类的大规模实验中,我们表明我们的方法与可扩展的GP回归和分类的最先进的硬编码方法相同。
translated by 谷歌翻译
Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, en-tropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyper-parameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost.
translated by 谷歌翻译
Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.
translated by 谷歌翻译
近似贝叶斯计算(ABC)是贝叶斯推理的一种方法,当可能性不可用时,但是可以从模型中进行模拟。然而,许多ABC算法需要大量的模拟,这可能是昂贵的。为了降低计算成本,已经提出了贝叶斯优化(BO)和诸如高斯过程的模拟模型。贝叶斯优化使人们可以智能地决定在哪里评估模型下一个,但是常见的BO策略不是为了估计后验分布而设计的。我们的论文解决了文献中的这一差距。我们建议计算ABC后验密度的不确定性,这是因为缺乏模拟来准确估计这个数量,并且定义了测量这种不确定性的aloss函数。然后,我们建议选择下一个评估位置,以尽量减少预期的损失。实验表明,与普通BO策略相比,所提出的方法通常产生最准确的近似。
translated by 谷歌翻译
由于通用性,使用简单性和贝叶斯预测的效用,高斯过程(GP)回归已广泛应用于机器人技术。特别是,GP回归的主要实现是基于内核的,因为通过利用内核函数作为无限维特征,可以拟合任意非线性函数。虽然结合先前信息有可能大大提高基于内核的GP回归的数据效率,但通过选择内核函数和相关的超参数来表达复杂的先验通常具有挑战性且不直观。此外,基于内核的GPregression的计算复杂度随着样本数量的不足而缩小,限制了其在可获得大量数据的情况下的应用。在这项工作中,我们提出了ALPaCA,一种有效的贝叶斯回归算法,可以解决这些问题。 ALPaCA使用样本函数的数据集来学习特定于域的有限维特征编码,以及相关权重之前的先验,使得该特征空间中的贝叶斯线性回归产生对后密度的准确在线预测。这些功能是神经网络,通过元学习方法进行训练。 ALPaCA从数据集中提取所有先行信息,而不是依赖于任意限制性内核超参数的选择。此外,它大大降低了样本的复杂性,并允许扩展到大型系统。我们研究了ALPaCA在两个简单回归问题,两个模拟机器人系统以及人类执行的车道变换驾驶任务上的表现。我们发现,我们的方法优于基于内核的GP回归,以及theart元学习方法的状态,从而为机器人中的多种回归任务提供了一种有前途的插件工具,其中可扩展性和数据效率是重要的。
translated by 谷歌翻译
我们提出了SWA-Gaussian(SWAG),一种简单,可扩展,通用的方法,用于深度学习中的不确定性表示和校准。随机权重平均(SWA),计算随机梯度下降(SGD)的第一时刻,用修改的学习率计划迭代最近,我们已经证明了它可以改善深度学习中的泛化。利用SWAG,我们使用SWA解决方案作为第一时刻拟合高斯,并且还从SGD迭代得到秩和对角协方差,在神经网络权重上形成近似后验分布;我们从这个高斯分布中抽样来进行贝叶斯模型平均。 Weempirically发现SWAG近似于真实后验的形状,与描述SGD迭代的静态分布的结果一致。此外,我们证明SWAG在各种计算机视觉任务上表现良好,包括样本外检测,校准和转移学习,与许多流行的替代品相比,包括MC压差,KFACLaplace和温度缩放。
translated by 谷歌翻译
We propose minimum regret search (MRS), a novel acquisition function for Bayesian optimization. MRS bears similarities with information-theoretic approaches such as en-tropy search (ES). However, while ES aims in each query at maximizing the information gain with respect to the global maximum, MRS aims at minimizing the expected simple regret of its ultimate recommendation for the optimum. While empirically ES and MRS perform similar in most of the cases, MRS produces fewer out-liers with high simple regret than ES. We provide empirical results both for a synthetic single-task optimization problem as well as for a simulated multi-task robotic control problem.
translated by 谷歌翻译
Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisationsof Gaussian processes (GPs) and are formally equivalent to neural networks withmultiple, infinitely wide hidden layers. DGPs are nonparametric probabilisticmodels and as such are arguably more flexible, have a greater capacity togeneralise, and provide better calibrated uncertainty estimates thanalternative deep models. This paper develops a new approximate Bayesianlearning scheme that enables DGPs to be applied to a range of medium to largescale regression problems for the first time. The new method uses anapproximate Expectation Propagation procedure and a novel and efficientextension of the probabilistic backpropagation algorithm for learning. Weevaluate the new method for non-linear regression on eleven real-worlddatasets, showing that it always outperforms GP regression and is almost alwaysbetter than state-of-the-art deterministic and sampling-based approximateinference methods for Bayesian neural networks. As a by-product, this workprovides a comprehensive analysis of six approximate Bayesian methods fortraining neural networks.
translated by 谷歌翻译
许多现代无监督或半监督机器学习算法依赖于贝叶斯概率模型。这些模型通常难以处理,因此需要进行近似推断。变分推理(VI)通过解决优化问题,使我们可以通过更简单的变分分布来近似高维贝叶斯后验。这种方法已成功应用于各种模型和大规模应用。在这篇综述中,我们对变分推断的最新趋势进行了概述。我们首先介绍standardmean字段变分推理,然后回顾以下方面的最新进展:(a)可扩展的VI,包括随机近似,(b)通用VI,它将VI的适用性扩展到一大类其他难以处理的模型,如非共轭模型,(c)准确的VI,其中包括超出平均场近似或非典型差异的变分模型,以及(d)摊销的VI,它利用推理网络实现推理超局部潜在变量。最后,我们提供了有希望的未来研究方向的摘要。
translated by 谷歌翻译
机器学习算法经常需要仔细调整模型超参数,正则化项和优化参数。不幸的是,这种调整通常是一种“黑色艺术”,需要专家经验,不成文的经验法则,或者有时需要强力搜索。更具吸引力的是开发自动方法的想法,该方法可以优化给定学习算法的性能以适应手头的任务。在这项工作中,我们考虑了贝叶斯优化框架内的自动调整问题,其中学习算法的泛化性能被建模为来自高斯过程(GP)的样本。由GP诱导的可处理的后部分布导致有效使用先前实验收集的信息,从而能够对接下来尝试的参数进行最佳选择。在这里,我们展示了高斯过程先验和相关推理过程的效果如何对贝叶斯优化的成功或失败产生很大影响。我们表明,在思考机器学习算法中,思考能够导致超出专家级性能的结果。我们还描述了新的算法,它们考虑了学习实验的可变成本(持续时间),并且可以利用多个核心的存在进行并行实验。我们证明了这些提出的算法改进了以前的自动程序,并且可以在包括潜在Dirichlet分配,结构化SVM和卷积神经网络在内的各种当代算法上进行或超越人类专家级优化。
translated by 谷歌翻译
Standard Gaussian processes (GPs) model observations' noise as constant throughout input space. This is often a too restrictive assumption, but one that is needed for GP inference to be tractable. In this work we present a non-standard variational approximation that allows accurate inference in heteroscedastic GPs (i.e., under input-dependent noise conditions). Computational cost is roughly twice that of the standard GP, and also scales as O(n 3). Accuracy is verified by comparing with the golden standard MCMC and its effectiveness is illustrated on several synthetic and real datasets of diverse characteristics. An application to volatility forecasting is also considered.
translated by 谷歌翻译
用反向传播训练的大型多层神经网络最近在广泛的问题中实现了最先进的结果。然而,使用反向支持进行神经网络学习仍然具有一些缺点,例如,对数据进行大量的超参数,缺乏校准的概率预测,以及过度拟合训练数据的趋势。原则上,贝叶斯学习神经网络的方法没有这些问题。但是,现有的贝叶斯技术缺乏对大数据集和网络规模的可扩展性。在这项工作中,我们提出了一种新的可扩展贝叶斯神经网络学习方法,称为概率反向传播(PBP)。与经典反向传播类似,PBP通过计算网络中概率的前向传播,然后进行梯度的后向计算。对十个真实世界数据集进行的一系列实验表明,PBP显着快于其他技术,同时提供竞争性预测能力。我们的实验还表明,PBP可以准确估计网络权重的后验方差。
translated by 谷歌翻译