现代深度学习方法对许多超参数非常敏感,并且由于最先进模型的长训练时间,香草贝叶斯超参数优化通常在计算上是不可行的。另一方面,基于随机搜索的基​​于强盗的配置评估方法缺乏指导,并且不能快速收敛到最佳配置。在这里,我们建议结合贝叶斯优化和基于带宽的方法的优点,以实现最佳两个世界:强大的时间性能和快速收敛到最佳配置。我们提出了一种新的实用的最先进的超参数优化方法,它在广泛的问题类型上始终优于贝叶斯优化和超带,包括高维玩具函数,支持向量机,前馈神经网络,贝叶斯神经网络,深度执行学习和卷积神经网络。我们的方法坚固耐用,功能多样,同时在概念上简单易行。
translated by 谷歌翻译
Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. Despite its success , for large datasets, training and validating a single configuration often takes hours, days, or even weeks, which limits the achievable performance. To accelerate hyperparameter optimization , we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods or the recently proposed bandit strategy Hyperband.
translated by 谷歌翻译
Different neural network architectures, hyperparameters and training protocols lead to different performances as a function of time. Human experts routinely inspect the resulting learning curves to quickly terminate runs with poor hyperparameter settings and thereby considerably speed up manual hyperparameter optimization. The same information can be exploited in automatic hyperparameter optimization by means of a probabilistic model of learning curves across hyperparameter settings. Here, we study the use of Bayesian neural networks for this purpose and improve their performance by a specialized learning curve layer.
translated by 谷歌翻译
深度高斯过程(DGP)是GaussianProcesses的分层概括,它将良好校准的不确定性估计与多层模型的高灵活性相结合。这些模型面临的最大挑战之一是精确推断是难以处理的。当前的现有技术参考方法,变分推理(VI),采用高斯近似到后验分布。这可能是通常多模式后路的潜在差的单峰近似。在这项工作中,我们提供了后验的非高斯性质的证据,并且我们应用随机梯度哈密顿蒙特卡罗方法直接从中进行采样。为了有效地优化超参数,我们引入了移动窗口MCEM算法。与VI对应的计算成本相比,这导致了更好的预测。因此,我们的方法为DGP中的推理建立了一种新的先进技术。
translated by 谷歌翻译
Bayesian optimization is an effective methodology for the global optimizationof functions with expensive evaluations. It relies on querying a distributionover functions defined by a relatively cheap surrogate model. An accurate modelfor this distribution over functions is critical to the effectiveness of theapproach, and is typically fit using Gaussian processes (GPs). However, sinceGPs scale cubically with the number of observations, it has been challenging tohandle objectives whose optimization requires many evaluations, and as such,massively parallelizing the optimization. In this work, we explore the use of neural networks as an alternative to GPsto model distributions over functions. We show that performing adaptive basisfunction regression with a neural network as the parametric form performscompetitively with state-of-the-art GP-based approaches, but scales linearlywith the number of data rather than cubically. This allows us to achieve apreviously intractable degree of parallelism, which we apply to large scalehyperparameter optimization, rapidly finding competitive models on benchmarkobject recognition tasks using convolutional networks, and image captiongeneration using neural language models.
translated by 谷歌翻译
Recent work on Bayesian optimization has shown its effectiveness in global optimization of difficult black-box objective functions. Many real-world optimization problems of interest also have constraints which are unknown a priori. In this paper, we study Bayesian optimization for constrained problems in the general case that noise may be present in the constraint functions, and the objective and constraints may be evaluated independently. We provide motivating practical examples, and present a general framework to solve such problems. We demonstrate the effectiveness of our approach on optimizing the performance of online latent Dirichlet allocation subject to topic sparsity constraints, tuning a neural network given test-time memory constraints, and optimizing Hamiltonian Monte Carlo to achieve maximal effectiveness in a fixed time, subject to passing standard convergence diagnostics.
translated by 谷歌翻译
我们开发了一种自动变分方法,用于推导具有高斯过程(GP)先验和一般可能性的模型。该方法支持多个输出和多个潜在函数,不需要条件似然的详细知识,只需将其评估为ablack-box函数。使用高斯混合作为变分分布,我们表明使用来自单变量高斯分布的样本可以有效地估计证据下界及其梯度。此外,该方法可扩展到大数据集,这是通过使用诱导变量使用增广先验来实现的。支持最稀疏GP近似的方法,以及并行计算和随机优化。我们在小数据集,中等规模数据集和大型数据集上定量和定性地评估我们的方法,显示其在不同似然模型和稀疏性水平下的竞争力。在涉及航空延误预测和手写数字分类的大规模实验中,我们表明我们的方法与可扩展的GP回归和分类的最先进的硬编码方法相同。
translated by 谷歌翻译
为了解决机器学习问题,通常需要执行数据处理,建模和超参数调整,这称为模型选择和超参数优化。自动机器学习(AutoML)的目标是设计能够自动执行模型选择和超参数优化的方法。 agiven数据集的人为干预。在本文中,我们提出了一种元学习方法,该方法可以通过利用从先前实验收集的元数据,以有效的方式从预定义的一组候选管道中搜索高性能机器学习管道。更具体地说,我们的方法结合了自适应贝叶斯回归模型和神经网络基础函数以及贝叶斯优化的采集函数。自适应贝叶斯回归模型能够从先前的数据中捕获知识,从而预测机器学习管道在新数据集上的性能。然后使用采集功能根据预测指导可能的管道系统。实验证明我们的方法可以快速识别一系列测试数据集的高性能管道,并优于基线方法。
translated by 谷歌翻译
我们提出了SWA-Gaussian(SWAG),一种简单,可扩展,通用的方法,用于深度学习中的不确定性表示和校准。随机权重平均(SWA),计算随机梯度下降(SGD)的第一时刻,用修改的学习率计划迭代最近,我们已经证明了它可以改善深度学习中的泛化。利用SWAG,我们使用SWA解决方案作为第一时刻拟合高斯,并且还从SGD迭代得到秩和对角协方差,在神经网络权重上形成近似后验分布;我们从这个高斯分布中抽样来进行贝叶斯模型平均。 Weempirically发现SWAG近似于真实后验的形状,与描述SGD迭代的静态分布的结果一致。此外,我们证明SWAG在各种计算机视觉任务上表现良好,包括样本外检测,校准和转移学习,与许多流行的替代品相比,包括MC压差,KFACLaplace和温度缩放。
translated by 谷歌翻译
许多对科学计算和机器学习感兴趣的概率模型具有昂贵的黑盒可能性,这些可能性阻止了贝叶斯推理的标准技术的应用,例如MCMC,其需要接近梯度或大量可能性评估。我们在这里介绍一种新颖的样本有效推理框架,VariationalBayesian Monte Carlo(VBMC)。 VBMC将变分推理与基于高斯过程的有源采样贝叶斯积分结合起来,使用latterto有效逼近变分目标中的难以求的积分。我们的方法产生了后验分布的非参数近似和模型证据的近似下界,对模型选择很有用。我们在几种合成可能性和神经元模型上展示VBMC,其中包含来自真实神经元的数据。在所有测试的问题和维度(高达$ D = 10 $)中,VBMC始终如一地通过有限的可能性评估预算重建后验证和模型证据,而不像其他仅在非常低维度下工作的方法。我们的框架作为一种新颖的工具,具有昂贵的黑盒可能性,可用于后期模型推理。
translated by 谷歌翻译
Hamiltonian Monte Carlo(HMC)抽样方法提供了一种机制,用于在Metropolis-Hastings框架中定义具有高接受概率的远程建议,从而比标准随机游走建议更有效地探索状态空间。近年来,这种方法的普及已经显着增长。然而,HMC方法的限制是用于模拟哈密顿动力系统所需的梯度计算 - 这种计算在涉及大样本量或流数据的问题中是不可行的。相反,我们必须依赖于从数据子集计算的噪声梯度估计。在本文中,我们探讨了这种随机梯度HMC方法的性质。令人惊讶的是,随机近似的自然实现可能是任意不好的。为了解决这个问题,我们引入了一个变量,它使用二阶Langevin动力学和摩擦力来抵消噪声梯度的影响,将所需的目标分布保持为不变分布。模拟数据的结果验证了我们的理论。我们还提供了使用神经网络进行分类任务的方法和在线贝叶斯矩阵因子化的应用。
translated by 谷歌翻译
用反向传播训练的大型多层神经网络最近在广泛的问题中实现了最先进的结果。然而,使用反向支持进行神经网络学习仍然具有一些缺点,例如,对数据进行大量的超参数,缺乏校准的概率预测,以及过度拟合训练数据的趋势。原则上,贝叶斯学习神经网络的方法没有这些问题。但是,现有的贝叶斯技术缺乏对大数据集和网络规模的可扩展性。在这项工作中,我们提出了一种新的可扩展贝叶斯神经网络学习方法,称为概率反向传播(PBP)。与经典反向传播类似,PBP通过计算网络中概率的前向传播,然后进行梯度的后向计算。对十个真实世界数据集进行的一系列实验表明,PBP显着快于其他技术,同时提供竞争性预测能力。我们的实验还表明,PBP可以准确估计网络权重的后验方差。
translated by 谷歌翻译
深度学习中的不确定性计算对于设计健壮且可靠的系统至关重要。变分推理(VI)是用于这种计算的有前途的方法,但与最大似然方法相比需要更多的努力来实现和执行。在本文中,我们提出了新的自然梯度算法来减少高斯平均场VI的这种努力。我们的算法可以通过在梯度评估期间扰乱网络权重来在Adam优化器内实现,并且可以通过使用适应学习速率的向量来廉价地获得不确定性估计。与现有的VI方法相比,这需要更低的内存,计算和实现工作量,同时获得可比质量的不确定性估计。我们的实证结果证实了这一点,并进一步表明我们的算法中的权重扰动对于强化学习和随机优化的探索是有用的。
translated by 谷歌翻译
Deep Gaussian processes (DGPs) are multi-layer hierarchical generalisationsof Gaussian processes (GPs) and are formally equivalent to neural networks withmultiple, infinitely wide hidden layers. DGPs are nonparametric probabilisticmodels and as such are arguably more flexible, have a greater capacity togeneralise, and provide better calibrated uncertainty estimates thanalternative deep models. This paper develops a new approximate Bayesianlearning scheme that enables DGPs to be applied to a range of medium to largescale regression problems for the first time. The new method uses anapproximate Expectation Propagation procedure and a novel and efficientextension of the probabilistic backpropagation algorithm for learning. Weevaluate the new method for non-linear regression on eleven real-worlddatasets, showing that it always outperforms GP regression and is almost alwaysbetter than state-of-the-art deterministic and sampling-based approximateinference methods for Bayesian neural networks. As a by-product, this workprovides a comprehensive analysis of six approximate Bayesian methods fortraining neural networks.
translated by 谷歌翻译
This paper makes two contributions to Bayesian machine learning algorithms. Firstly, we propose stochastic natural gradient expectation propagation (SNEP), a novel alternative to expectation propagation (EP), a popular variational inference algorithm. SNEP is a black box variational algorithm, in that it does not require any simplifying assumptions on the distribution of interest, beyond the existence of some Monte Carlo sampler for estimating the moments of the EP tilted distributions. Further, as opposed to EP which has no guarantee of convergence, SNEP can be shown to be convergent, even when using Monte Carlo moment estimates. Secondly, we propose a novel architecture for distributed Bayesian learning which we call the posterior server. The posterior server allows scalable and robust Bayesian learning in cases where a data set is stored in a distributed manner across a cluster, with each compute node containing a disjoint subset of data. An independent Monte Carlo sampler is run on each compute node, with direct access only to the local data subset, but which targets an approximation to the global posterior distribution given all data across the whole cluster. This is achieved by using a distributed asynchronous implementation of SNEP to pass messages across the cluster. We demonstrate SNEP and the posterior server on distributed Bayesian learning of logistic regression and neural networks.
translated by 谷歌翻译
Hamiltonian Monte Carlo (HMC) samples efficiently from high-dimensional posterior distributions with proposed parameter draws obtained by iterating on a discretized version of the Hamiltonian dynamics. The iterations make HMC computationally costly, especially in problems with large datasets, since it is necessary to compute posterior densities and their derivatives with respect to the parameters. Naively computing the Hamiltonian dynamics on a subset of the data causes HMC to lose its key ability to generate distant parameter proposals with high acceptance probability. The key insight in our article is that efficient subsampling HMC for the parameters is possible if both the dynamics and the acceptance probability are computed from the same data subsample in each complete HMC iteration. We show that this is possible to do in a principled way in a HMC-within-Gibbs framework where the subsample is updated using a pseudo marginal MH step and the parameters are then updated using an HMC step, based on the current subsample. We show that our subsampling methods are fast and compare favorably to two popular sampling algorithms that utilize gradient estimates from data subsampling. We also explore the current limitations of subsampling HMC algorithms by varying the quality of the variance reducing control variates used in the estimators of the posterior density and its gradients.
translated by 谷歌翻译
随机梯度马尔可夫链蒙特卡罗(SG-MCMC)在大规模贝叶斯模型中模拟后验样本已变得越来越流行。然而,现有的SG-MCMC方案并不适用于任何特定的概率模型,即使对底层动力系统的简单修改也需要显着的物理直觉。本文介绍了第一个学习算法,该算法允许自动设计SG-MCMC采样器的基础连续动态。学习的采样器通过状态相关的漂移和扩散推广了哈密顿动力学,实现了神经网络能量景观的快速遍历和有效探索。实验验证了贝叶斯全连通神经网络和贝叶斯递归神经网络任务的拟议方法,表明该学习的采样器优于通用,手工设计的SG-MCMC算法,并扩展到不同的数据集和更大的架构。
translated by 谷歌翻译
Bayesian optimization has recently been proposed as a framework for automatically tuning the hyperparameters of machine learning models and has been shown to yield state-of-the-art performance with impressive ease and efficiency. In this paper, we explore whether it is possible to transfer the knowledge gained from previous optimizations to new tasks in order to find optimal hyperparameter settings more efficiently. Our approach is based on extending multi-task Gaussian processes to the framework of Bayesian optimization. We show that this method significantly speeds up the optimization process when compared to the standard single-task approach. We further propose a straightforward extension of our algorithm in order to jointly minimize the average error across multiple tasks and demonstrate how this can be used to greatly speed up k-fold cross-validation. Lastly, we propose an adaptation of a recently developed acquisition function, en-tropy search, to the cost-sensitive, multi-task setting. We demonstrate the utility of this new acquisition function by leveraging a small dataset to explore hyper-parameter settings for a large dataset. Our algorithm dynamically chooses which dataset to query in order to yield the most information per unit cost.
translated by 谷歌翻译
贝叶斯优化(BO)是指用于对昂贵的黑盒函数进行全局优化的一套技术,它使用函数的内省贝叶斯模型来有效地找到最优值。虽然BO已经在许多应用中成功应用,但现代优化任务迎来了传统方法失败的新挑战。在这项工作中,我们展示了Dragonfly,这是一个开源Python库,用于可扩展和强大的BO.Dragonfly包含多个最近开发的方法,允许BO应用于具有挑战性的现实世界环境;这些包括更好的处理更高维域的方法,当昂贵函数的廉价近似可用时处理多保真评估的方法,优化结构化组合空间的方法,例如神经网络架构的空间,以及处理并行评估的方法。此外,我们在BO中开发了新的方法改进,用于选择贝叶斯模型,选择采集函数,以及优化具有不同变量类型和附加约束的过复杂域。我们将Dragonfly与一套用于全局优化的其他软件包和算法进行比较,并证明当上述方法集成时,它们可以显着改善BO的性能。 Dragonfly图书馆可在dragonfly.github.io上找到。
translated by 谷歌翻译
机器学习的许多实际应用需要数据有效的黑盒功能优化,例如,识别超参数或过程设置。然而,容易获得的算法通常被设计为通用优化器,因此对于特定任务而言通常是次优的。因此,提出了一种学习优化器的方法,该优化器自动适应于给定类别的目标函数,例如,在sim-to-realapplications的上下文中。所提出的方法不是从头开始学习优化,而是基于着名的贝叶斯优化框架。只有采集函数(AF)被学习的神经网络所取代,因此得到的算法仍然能够利用高斯过程的经过验证的广义化能力。我们在几个模拟以及模拟到真实传输任务上进行实验。结果表明,学习的优化器(1)在一般函数类上始终表现优于或与已知AF相媲美,并且(2)可以使用廉价模拟自动识别函数类的结构属性并转换该知识以快速适应实际硬件任务,从而显着优于现有的与问题无关的AF。
translated by 谷歌翻译