The case for fully Bayesian optimisation in small-sample trials

Yuji Saikai

Abstract

While sample efficiency is the main motive for use of Bayesian optimisation when black-box functions are expensive to evaluate, the standard approach based on type II maximum likelihood (ML-II) may fail and result in disappointing performance in small-sample trials. The paper provides three compelling reasons to adopt fully Bayesian optimisation (FBO) as an alternative. First, failures of ML-II are more commonplace than implied by the existing studies using the contrived settings. Second, FBO is more robust than ML-II, and the price of robustness is almost trivial. Third, FBO has become simple to implement and fast enough to be practical. The paper supports the argument using relevant experiments, which reflect the current practice regarding models, algorithms, and software platforms. Since the benefits seem to outweigh the costs, researchers should consider adopting FBO for their applications so that they can guard against potential failures that end up wasting precious research resources.

Keywords: BoTorch, expected improvement, fully Bayesian Gaussian process, Pyro, type II maximum likelihood

\addbibresource

references.bib \AtEveryBibitem\clearfieldmonth \AtEveryBibitem\clearfieldday \AtEveryBibitem\clearfieldversion ^†^†affiliationtext: School of Mathematics and Statistics, the University of Melbourne

1 Introduction

Sample efficiency is the predominant motive for use of Bayesian optimisation (BO) when optimising black-box functions that are expensive to evaluate. Function evaluations are said to be expensive when each evaluation costs a significant amount of resources including time and money. Function evaluations take place typically in the form of either real-world trials or computer simulations. While the former case clearly can be expensive, because of the lack of analytic expressions for objective functions and/or constraints, the latter case can also be expensive. For example, researchers used BO to learn robot controllers with only a handful of trials \citepchatzilygeroudis_survey_2020. Consequently, in practical applications of BO, small-sample performance is of the greatest concern.

Gaussian process (GP) is the de facto standard surrogate model for BO, and GPs must be specified in terms of kernel hyperparameters before applying BO. Since hyperparameters suitable for particular problems are rarely known beforehand, these are estimated based on observations and sequentially updated throughout the trials, commonly using type II maximum likelihood (ML-II) \citepjones_efficient_1998, mackay_comparison_1999, lalchand_approximate_2020. In practice, estimating hyperparameters with small samples is a challenging task \citepwang_theoretical_2014. In particular, maximisation of the marginal likelihood is numerically difficult because a wide range of length-scales, both very short and intermediate values, are generally consistent with small data \citeprasmussen_gaussian_2006. Several studies provide detailed accounts of how BO fails in various situations \citepjones_taxonomy_2001, forrester_global_2008,bull_convergence_2011,benassi_robust_2011. Nonetheless, despite the challenge, this is precisely the kind of problems for which researchers use BO—locating inputs that yield good enough function values using only small samples.

To address the challenge, the fully Bayesian approach has been proposed in the context of GP regression \citepohagan_curve_1978,handcock_bayesian_1993 and specifically in the context of BO \citeplocatelli_adaptive_1995,locatelli_bayesian_1997,osborne_gaussian_2009. \citetbenassi_robust_2011 demonstrated that fully Bayesian optimisation (FBO) was statistically more robust in adversarial circumstances than ML-II. However, the study imposed strong assumptions, namely, the use of conjugate prior for output-scale, the lack of automatic relevance determination, and discretisation of length-scale. In today’s hardware and software environments, these assumptions are unnecessarily restrictive and should be relaxed to obtain more general insights.

Over the past decade, a handful of studies adopted FBO that took advantage of more sophisticated algorithms and modern software platforms \citepsnoek_practical_2012,hernandez-lobato_predictive_2014,eriksson_high-dimensional_2021. However, in these studies, FBO was adopted only because it served for evaluating their theoretical claims, and no further justification for the adoption was provided \citepde_ath_how_2021. Apart from these studies, the research on FBO remains sparse, and direct comparison between ML-II and FBO is virtually non-existent in the recent literature.

This paper provides three compelling reasons why researchers should consider adopting FBO for their applications where evaluations are expensive. First, failures of ML-II are more commonplace than implied by the existing studies which used the contrived scenarios. To support this argument, the experiments are set up only using common specifications, which other researchers may find in their applications, including those of the objective function, random initialisation, models, and algorithms. Second, FBO is more robust than ML-II, and the price of the robustness is almost trivial. This is shown in the results obtained from the identical experimental settings. In addition, to make the results relevant in modern applications, the restrictive assumptions in FBO specifications previously used in the literature are relaxed. Third, thanks to the modern software platforms, FBO is now simple to implement and fast enough to be practical. To demonstrate this, the experiments are implemented using Pyro \citepbingham_pyro_2019 and BoTorch \citepbalandat_botorch_2020. In addition, the code used to implement the experiments is written deliberately in a simple fashion so that it can serve as a template for other applications. The last point is a small but practical contribution because the existing code examples for FBO found in BoTorch package are written specifically for the particular study [eriksson_high-dimensional_2021], and modification for other FBO applications is not very straightforward.

In what follows, first, BO is reviewed as a technique to address a dynamic problem, which is deemed a more accurate formulation when involving evaluation budgets. The reader is reminded that, in this formulation, an optimal solution is intractable and suboptimality of BO is by design in practical applications (Section 2). Next, FBO is reviewed where the marginalised acquisition function and its approximation are highlighted as the key difference (Section 3). Then, the experiments are described with emphasis of the roles played by Pyro and BoTorch, and the results are discussed (Section 4). Finally, some concluding remarks are provided for future applications (Section 5). All the code and data are available on the website (https://github.com/ysaikai/case4fbo).

2 Bayesian optimisation

A common formulation of the BO problem is a static optimisation problem:

min x \in X f (x)

(1)

where $f$ is the black-box objective function, and $X$ is a compact subset of Euclidean space. When BO is adopted in practice, however, it is often the case that function evaluations are expensive, and there is an evaluation budget. If an evaluation budget is explicitly modelled, a more accurate problem formulation is the following.

Let $N$ be the total number of evaluations allowed. Then, the BO problem is a sequential decision problem; that is, choose $x_{n} \in X$ for each $n \in {1, \dots, N}$ based on the history, the preceding observations

{(x_{0}, f (x_{0})), \dots, (x_{n - 1}, f (x_{n - 1}))},

in order to minimise the best observed value after $N$ evaluations:

min n \in {1, \dots, N} f (x_{n}) .

Note that this is a dynamic problem as each choice depends on the preceding choices.

Since evaluations are allowed to be made only $N$ times, it is necessary to alternatively search for an optimal sequence $x_{1}^{*}, \dots, x_{N}^{*}$ without directly evaluating $f$ . This is where the notion of surrogate comes in. When BO with a GP surrogate is used to address the problem, an implicit assumption is that $f$ is a sample path of the specified GP as a stochastic process. For the fixed GP, it is known that an optimal sequence ${^x}_{1}, \dots, {^x}_{N}$ in terms of the Bayes risk is, in principle, obtained by dynamic programming \citepbetro_bayesian_1991. However, except for trivial cases, the problem is numerically intractable \citepbenassi_robust_2011. Furthermore, as mentioned above, specifying a suitable GP for a particular problem beforehand is rarely possible in practical problems. Instead, it is specified based on observations and sequentially updated throughout the trials.

The expected improvement (EI) acquisition function is a practical one-step lookahead strategy \citepmockus_application_1978, jones_efficient_1998. As a myopic strategy, EI is necessarily suboptimal \citepginsbourger_towards_2010 but proved reasonable in many applications \citepshahriari_taking_2015. Although many different acquisition functions have been proposed, EI maintains the practical advantage owing to the intuitive notion of “improvement”, the relatively light computational requirement, and the proven performance.

3 Fully Bayesian optimisation

Performance of BO is dictated by the GP surrogate and the acquisition function. Since the former seems to take precedence over the latter [shahriari_taking_2015], and a GP is characterised by kernel hyperparameters, it is vitally important to carefully handle kernel hyperparameters and their uncertainty for successful BO applications. The principled way to handle parameter uncertainty is to follow the Bayesian formalism \citepgelman_bayesian_2013, according to which functions of uncertain parameters are weighted by the posterior probabilities of the parameters. (While the maximum a posteriori estimation is also used in practice to incorporate prior information, in the light of the Bayesian formalism, it does not fully reflect the parameter uncertainty and omitted in this paper.)

Let $θ$ and $D_{n}$ denote respectively the hyperparameter and the data observed for the first $n$ steps. Also, let $α (x | θ, D_{n})$ and $p (θ | D_{n})$ denote respectively the acquisition function and the posterior of $θ$ . Then, the marginalised acquisition function, the weighted average according to $p (θ | D_{n})$ , is

α (x | D_{n}) = E_{θ | D_{n}} α (x | θ, D_{n}) = \int α (x | θ, D_{n}) p (θ | D_{n}) d θ .

(2)

In FBO, this is the function to maximise for the next query point

x_{n + 1} \in argmax x \in X α (x | D_{n}) .

(3)

In other words, due to the randomness in $θ$ , it is a stochastic optimisation problem. Since the above integral generally permits no analytic expression for $α (x | D_{n})$ , in practice, use of approximation methods is required to carry out the maximisation. In BO contexts, the maximisation is commonly addressed by the sample average approximation \citephomem-de-mello_monte_2014, which is well supported by BoTorch as reviewed below.

Except for trivial cases, the posterior $p (θ | D_{n})$ is a complicated distribution, for which no off-the-shelf sampler is available, so it is necessary to use some approximation methods to obtain samples. Among many different methods, Hamiltonian Monte Carlo is the gold standard \citeperiksson_high-dimensional_2021 and, in particular, the No-U-Turn sampler \citephoffman_no-u-turn_2014 is an attractive option as it spares researchers the trouble of manually tuning the step size and path length in Hamiltonian Monte Carlo.

After obtaining $M$ Monte Carlo samples, the marginalised acquisition function is approximated by the finite sum:

α (x | D_{n}) \approx \frac{1}{M} M \sum m = 1 α (x | θ_{m}, D_{n}) .

(4)

Given $M$ samples, the right-hand side is a deterministic function, which is commonly maximised using quasi-Newton methods such as L-BFGS-B [byrd_limited_1995].

4 Experiments

4.1 The test problem

To compare performances of the two approaches, ML-II and FBO, the experiments were conducted using common specifications, which many researchers may find in their applications. The performance metrics is simple “regret” after $N$ evaluations, difference between the true minimum and the best observed value over $N$ evaluations. Different $N \in {1, \dots, 30}$ corresponds to a different evaluation budget. The following table contains key specifications of the test problem.

Objective function	Ackley function on $X \subset R^{2}$
GP Kernel	Matérn 5/2
GP Mean function	0 for all $x$
Noise variance	$10^{- 6}$
Prior for output-scale	Log normal with mean 10 and standard deviation 10
Prior for length-scale	Log normal with mean 0.5 and standard deviation 0.5

The objective function was Ackley function \citepackley_connectionist_1987, one of the common optimisation test functions \citepsurjanovic_optimization_nodate. The function was specified by following the recommendation \citepsurjanovic_optimization_nodate, which is also implemented and the default setting in BoTorch. For example, the function domain was set to $X = [- 32.768, 32.768] \times [- 32.768, 32.768]$ .

The total 101 experiments were conducted, corresponding to 101 distinct $x_{0}$ and random seeds (0-100) appropriately set for reproducibility. Fixing random seeds was necessary not only for the hyperparameter sampling but also for the random multistart used by BoTorch in optimisation of acquisition functions and marginal likelihood. Each experiment was initialised by $x_{0}$ , which was shared between ML-II and FBO. Then, the subsequent $N$ observations were made by following each approach.

The acquisition function is EI, which works relatively well for noise-less test functions \citepde_ath_how_2021. The class of GP is a basic one—the zero-mean function, Matérn(5/2) kernel with automatic relevance determination, and an output-scale parameter. Although the objective function is noise-less, following the convention, the noise variance is set to $10^{- 6}$ for numerical stability. As a result, there were three hyperparameters: one output-scale and two length-scales. In FBO, three prior distributions were assumed to be independent.

While gamma distribution is also common, for both output-scale and length-scales, log normal distribution was adopted as it has strictly diminishing probability towards 0, which helps to avoid sampling unreasonably small length-scales, which could in turn cause numerical problems. The mean for each prior was chosen based on empirical evidence, while the standard deviation was set so that the relative standard deviation was equal to 1.

Note that the prior for length-scale reflects the fact that inputs to GP are normalised to between 0 and 1. Note also that output observations used for GP are standardised to have mean 0 and standard deviation 1, which justifies the use of the zero-mean function. The sampler was Hamiltonian Monte Carlo and the No-U-Turn sampler. Following the recommendation \citeperiksson_high-dimensional_2021, at each iteration, $512$ samples were first drawn as warm-up/burn-in samples and then $256$ samples were drawn, which were thinned to $M = 16 = 256 / 16$ samples by keeping only every 16th sample.

Many of the existing software implementations specify log normal distribution with two parameters for the mean and standard deviation of the corresponding normal distribution. In this case, when the desired mean and standard deviation of log normal distribution are $μ$ and $σ$ , following are the corresponding parameters of normal distribution:

	$μ_{N}$	$= log μ - \frac{1}{2} log (\frac{σ^{2}}{μ^{2}} + 1)$
	$σ_{N}$	$= \sqrt{log (\frac{σ^{2}}{μ^{2}} + 1)} .$

4.2 Software platforms

Software platforms play a crucial role in modern machine learning research and practice. In BO applications, key numerical operations include computation of posterior GP, optimisation of acquisition functions, optimisation of marginal likelihood (for ML-II), and sampling of hyperparameters (for FBO). In this paper, Pyro and BoTorch were used to implement the experiments. The software versions used for the experimentation are 3.9.7, 1.8.0, and 0.6.4 for Python, Pyro, and BoTorch respectively.

Pyro facilitates posterior sampling of kernel hyperparameters. In general, Hamiltonian Monte Carlo and the No-U-Turn sampler are sophisticated algorithms that compute the gradient of the posterior probability density and solve the ordinary differential equations based on the Hamiltonian while automatically tuning the step size and path length for the solver. As a result, manual implementation requires a significant amount of coding. With Pyro, however, the implementation becomes quite simple and straightforward; essentially, it is only necessary to specify the priors and desired number of samples.

BoTorch facilitates implementation of GPs and acquisition functions, computation of posterior GP, optimisation of marginal likelihood, and optimisation of acquisition functions. While in FBO optimisation of the acquisition function takes place using the sample average approximation (Eq.4), BoTorch is designed to handle this operation efficiently.

Both platforms are built upon PyTorch and in particular take advantage of its automatic differentiation functionality, which facilitates computing the gradient in optimisation by BoTorch and Hamiltonian Monte Carlo by Pyro. Moreover, both platforms automatically implement parallel processing and utilise multiple cores of CPU/GPU if available.

4.3 Results and Discussion

Figure 1 illustrates the results by plotting the regret at each evaluation budget $N \in {1, \dots, 30}$ for both ML-II and FBO in terms of median and region formed by the 10th and 90th percentiles. The regret distribution at $N = 0$ , which is identical between ML-II and FBO, reflects the distribution of function evaluations $f (x_{0})$ at 101 random initial $x_{0}$ . This initial distribution largely remains unchanged at $N = 1$ and $N = 2$ . Then, indicated by the downward-sloped median curves, performances of both approaches start to improve at $N = 3$ and continue until $N = 30$ , although the rate of improvement slows down after $N \approx 10$ . Throughout all the evaluation budgets, the median performances and downward dispersion from the medians are very similar between ML-II and FBO. In contrast, as $N$ increases, two distributions in terms of upward dispersion increasingly diverge. While the upward dispersion for FBO continues to shrink, the one for ML-II remains large, indicating the greater number of failures. For more details, Figure 2 and 3 plot histograms for $N = 20$ and $30$ respectively.

Figure 1: The results from 101 experiments for each number of evaluation budgets (N=1,…,30). 101 experiments are distinguished from each other by 101 distinct initial samples $x_{0}$ and random seeds. The solid lines represent the median values. The shaded regions are formed by the 10th and 90th percentiles.

Figure 2: Regret distribution of 101 samples after 20 evaluations.

Figure 3: Regret distribution of 101 samples after 30 evaluations.

Given the strong similarity in median and downward dispersion, the shrinking upward dispersion indicates that FBO has clear advantage over ML-II in terms of statistical robustness. However, bear in mind that the advantage is only statistical, and FBO certainly can fail as indicated by the occurrence of large regrets in Figure 2 and 3. First of all, since FBO uses sampling of random hyperparameters, poor choices of output-scale and length-scale can happen. However, the influence of these unlucky samples can be reduced by increasing the number of Monte Carlo samples ( $M$ in Eq.4).

More fundamentally, even if using the exact marginalised acquisition function (Eq.2), FBO with EI (or any other practical acquisition function) is still myopic and may fail when $x_{0}$ is unfavourable. As analysed in the literature, such unfavourable initial samples “fool” the interpretation of data and lead to “deceptive” GPs with ill-updated hyperparameters \citepforrester_global_2008, benassi_robust_2011, which in turn result in uninformative acquisition functions and undermine the effectiveness of BO. In real-world applications where function evaluations are expensive, researchers may not have enough information about problems and have to choose initial samples more of less randomly. Consequently, the failures caused by random $x_{0}$ in the experiments are deemed a relevant warning to researchers.

What FBO can achieve is to dilute the harm by adopting a continuum of data interpretations and averaging them out according to the posterior distribution of hyperparameters. Even 10 years ago, without restrictive assumptions, FBO might not have been a practical option. Now, thanks to the recent development of both software and hardware technologies, it can be a viable option for many practical problems. Given the ease of implementation, the price of the robustness advantage of FBO is only additional computation, which is approximately 10 times greater than for ML-II. Considering the expensiveness of experiments, however, the extra computing time is relatively trivial in many applications. Thus, use of FBO is a simple way to guard against potential failures that end up wasting precious research resources.

5 Conclusion

While small-sample performance is the main reason to use Bayesian optimisation when function evaluations are expensive, the standard approach based on type II maximum likelihood may fail in many practical applications. The paper made the case for fully Bayesian optimisation as a more robust and viable alternative. Thanks to the modern software and hardware technologies, the implementation is simple, and the execution is fast enough to be practical in many cases. Since the benefits seem to outweigh the costs, researchers should consider adopting fully Bayesian optimisation for their applications so that they can guard against potential failures that end up wasting precious research resources.

\printbibliography