Bézier Gaussian Processes
for Tall and Wide Data

Martin Jørgensen and
Michael A. Osborne

Abstract.

Modern approximations to Gaussian processes are suitable for “tall data”, with a cost that scales well in the number of observations, but under-performs on “wide data”, scaling poorly in the number of input features. That is, as the number of input features grows, good predictive performance requires the number of summarising variables, and their associated cost, to grow rapidly. We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features, but requires only linear cost in both number of observations and input features. This scaling is achieved through our introduction of the Bézier buttress, which allows approximate inference without computing matrix inverses or determinants. We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression, and empirically demonstrate the kernel’s ability to scale to both tall and wide datasets.

Martin Jørgensen, University of Oxford, martinj@robots.ox.ac.uk

Michael A. Osborne, University of Oxford, mosb@robots.ox.ac.uk

Gaussian processes (GPs) are a probabilistic approach to modelling functions that permit tractable Bayesian inference. They are, however, notorious for their poor scalability. In recent decades, this criticism has been challenged. Several approximate methods now allow GPs to scale to millions of data points. Yet, scalability in the number of data points is merely one challenge of big data. There are still problems associated with the input dimensionality – one aspect of the famed curse of dimensionality. burt2020gpviconv analysed the most studied approximation, the so-called sparse inducing points methods, and showed it to be accurate for low dimensional inputs. Alarmingly, exponentially many inducing points are still needed in high-dimensional input spaces, that is, for problems with a large number of features. As such, despite modern GP approximations scaling to tall data, they are still discounted when concerning wide data.

In response to this, there exist GP approximations built on simplices or grid-structures in the input space (wilson2015kernel; pmlr-v84-gardner18a; kapoor2021skiing). These take advantage of attractive fast linear algebra, but are often limited by memory in higher dimensions. Their advantage is the ability to fill the input space with structured points, so all observations have a close neighbour.

We propose a new kernel for GP regression that requires neither matrix inversion nor determinant calculation – GPs’ two core computational sinners. Additionally, we cover the input space with exponentially many points, but introduce an approximation that grows only linearly in computational complexity. That is, our method scales linearly in both the number of data points and the number of input dimensions, whilst being space-filling in the input domain. A limiting assumption of our kernel is its restriction to a box-bounded domain in the input space.

GPs are indispensable to fields where uncertainty is a driver in decision-making mechanisms. Such fields include Bayesian optimisation, active learning and reinforcement learning. The critical decision mechanism is the exploration-exploitation trade-off. One ability useful in such fields is to assign high uncertainty to unexplored regions, just as does an exact GP. We show that our proposed model also assigns high uncertainty to unexplored regions, suggesting our model as well-suited to decision-making problems.

1. Background

Bézier curves and surfaces are parametrised geometric objects that have found great usage in computer-aided design and robotics (prautzsch2002bezier). The simplest Bézier curve is the linear interpolation of two points $p_{0}$ and $p_{1}$ in $R^{D}$ ; the Bézier curve of order $1$

c (t) = (1 - t) p_{0} + t p_{1}, t \in [0, 1] .

(1.1)

Higher order curves are generalised in the following way: the order- $ν$ Bézier curve is defined as

c (t) = ν \sum i = 0 B_{i}^{ν} (t) p_{i}, t \in [0, 1] .

(1.2)

In Bézier terms, $p_{i}$ are referred to as control points. Notice an order- $ν$ Bézier curve has $ν + 1$ control points. $B_{i}^{ν}$ denotes the $i$ th Bernstein polynomial of order $ν$ . They are defined as

B_{i}^{ν} (t) = \frac{ν!}{i! (ν - i)!} t^{i} (1 - t)^{ν - i} .

(1.3)

By going from curves to surfaces we wish to extend from the scalar $t$ to a spatial input $x \in [0, 1]^{d}$ , for $d > 1$ . Here, we can define Bézier $d$ -surfaces as

c_{d} (x) = ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 B_{i_{1}}^{ν_{1}} (x_{1}) \dots B_{i_{d}}^{ν_{d}} (x_{d}) p_{i_{1}, \dots, i_{d}}, x = (x_{1}, x_{2}, \dots, x_{d}) \in [0, 1]^{d} .

(1.4)

Figure 1 gives a visual illustration of a $2$ -dimensional surface embedded in $R^{3}$ . In the literature, it is difficult to find any studies of $d$ -surfaces for $d > 2$ . This paper targets especially this high-dimensional input case. We restrict our output dimension to $1$ , the regression problem, but the methods naturally extend to multidimensional outputs. The red points in Figure 1 show how each control points has an associated location in the input space. They are placed on a grid-like structure, and the order of each dimension determines how fine the mesh-grid of the hypercube is; i.e. how dense the input-space is filled.

Gaussian processes (GPs) are meticulously studied in probability and statistics (williams2006gaussian). They provide a way to define a probability distribution over functions. This makes them useful, as priors, to build structures for quantifying uncertainty in prediction. They are defined as a probability measure over functions $f : X \to R$ , such that any collection of elements $(x_{1}, x_{2}, \dots, x_{n})$ in $X$ have their associated output $(f (x_{1}), \dots, f (x_{n}))$ following a joint Gaussian distribution. This distribution is fully determined by a mean function $m : X \to R$ and a positive semi-definite kernel function $k : X \times X \to R$ . They admit exact Bayesian inference. However, exact inference comes at a prohibitive worst-case computational cost of $O (n^{3})$ , where $n$ is the number of training points, due to computing inverse and determinant of the kernel matrix.

Sparse Gaussian processes (snelson2005sparse) overcome this burden by conditioning on $m$ inducing points, reducing complexity to $O (n m^{2})$ , where usually $m < < n$ . The inducing points, denoted $u$ , are then marginalised to obtain an approximate posterior of $f$ . The variational posterior mean and variance, at a location $x^{*}$ are then given by

	$E [f (x^{*})]$	$= k (x^{*}, Z) k (Z, Z)^{- 1} μ_{u},$		(1.5)
	$Var (f (x^{*}))$	$= k (x^{}, x^{}) - k (x^{}, Z) k (Z, Z)^{- 1} (k (Z, Z) - Σ_{u}) k (Z, Z)^{- 1} k (Z, x^{}),$		(1.6)

under the assumption of a constant zero prior mean function. This assumption is easily relaxed if needed. Here $Z$ denotes the inducing locations in the input space, i.e. $f (Z) = u \sim N (μ_{u}, Σ_{u}) .$ Under further assumption of Gaussian observation noise $ϵ$ , i.e. $y^{*} = f (x^{*}) + ϵ$ , then pmlr-v5-titsias09a showed the optimal $μ_{u}$ and $Σ_{u}$ are known analytically. In a sought analogy to Figure 1, $Z$ would be the red points and $u$ would be the orange.

2. Bézier Gaussian Processes

Inspired by Bézier surfaces, we construct a Gaussian process $f : [0, 1]^{d} \to R$ as

f (x) = ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 B_{i_{1}}^{ν_{1}} (x_{1}) \dots B_{i_{d}}^{ν_{d}} (x_{d}) P_{i_{1}, \dots, i_{d}},

(2.1)

where $P_{i_{1}, i_{2}, \dots, i_{d}} \sim N (ϑ_{i_{1}, \dots, i_{d}}, Σ_{i_{1}, \dots, i_{d}})$ are Gaussian variables and $x = (x_{1}, \dots, x_{d})$ . Here, $x_{γ} \in [0, 1]$ for $γ = 1, \dots, d$ . It is easy to verify that $f$ satisfies the definition of a GP since it, for any $x$ , is a scaled sum of Gaussians. We assume that all $P_{i_{1}, \dots, i_{d}}$ are fully independent. With that assumption, we can make the following observation for the mean and kernel function

	$μ (x)$	$:= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 B_{i_{1}}^{ν_{1}} (x_{1}) \dots B_{i_{d}}^{ν_{d}} (x_{d}) ϑ_{i_{1}, \dots, i_{d}}, and$		(2.2)
	$k (x, z)$	$:= Cov (f (x), f (z)) = ν_{1} \sum i_{1} = 0 \dots ν_{d} \sum i_{d} = 0 B_{i_{1}}^{ν_{1}} (x_{1}) \dots B_{i_{d}}^{ν_{d}} (x_{d}) Σ_{i_{1}, \dots, i_{d}} B_{i_{1}}^{ν_{1}} (z_{1}) \dots B_{i_{d}}^{ν_{d}} (z_{d}) .$		(2.3)

The Bernstein polynomials can approximate any continuous function given the order is large enough, thus they make a good basis for GP regression (hildebrandt1933linear). Naturally, selecting a prior over $f$ comes down to selecting a prior over the random control points $P$ . The most common prior mean in GP regression, the constant zero function, is then easily obtained by $ϑ_{i_{1}, \dots, i_{d}} = 0$ for all $i$ . By construction, the choice of $Σ_{i_{1}, \dots, i_{d}}$ needs consideration to yield a convenient prior over $f$ . Mindlessly setting $Σ_{i_{1}, \dots, i_{d}} = 1$ would make $Var (f (x))$ collapse to zero quickly in the central region of the domain, especially as dimensions $d$ grow. This, of course, gives a much too narrow prior over $f$ .

Figure 2 (middle) shows that in the central region the standard deviation of $f$ is smaller due to the nature of the Bernstein polynomials. If we consider instead a two-dimensional input the standard deviation would collapse even more, as we would then see the shrinking effect for both dimension and multiply them. We can, however, adjust for this.

We define the inverse squared Bernstein adjusted prior to counter this effect. In all dimensions $γ = 1, \dots, d$ , let

ς_{γ} = A_{γ}^{- 1} 1_{ν_{γ} + 1}, where A_{i, j} = {(B_{j}^{ν_{γ}} (i / ν_{γ}))}_{γ}^{2},

(2.4)

and $ν_{γ}$ denotes the order of the dimension $γ$ . Then setting $Σ_{i_{1}, \dots, i_{d}} = \prod_{γ = 1}^{d} ς_{γ} (i_{γ})$ ensures that $Var (f (x)) \approx 1$ over the entire domain $[0, 1]^{d}$ . Eq. 2.4 solves a linear system, such that $Var (f (i / ν_{γ})) = 1$ , for $i = 0, \dots, ν_{γ}$ . This means a prior hardly distinguishable from standard stationary ones such as the RBF kernel. Visual representation of this prior is shown in Figure 2 (right). This adjustment works up to $ν_{γ} = 25$ , after which negative values occur.

Figure 2. Left: The $21$ Bernstein polynomials that make up the basis of a Bézier GP of order $20$ . We observe how they each impact a ‘local’ region along the [0,1] domain. Middle: If not accounting for the Bernstein polynomials behaviour, the variance is of $f$ is more narrow in the central region. Right: By adjusting, see Eq. 2.4, we can enforce a uniform variance over the domain.

Summarising, we introduced a kernel based on Bézier surfaces. An alternative viewpoint is that $f$ is a polynomial GP, but with Bernstein basis rather than the canonical basis. We remark that $f$ is defined outside the domain $[0, 1]^{d}$ ; any intuition about the prior there is not considered, and we will not pursue investigating data points outside this domain. Of course, for practical purposes this domain generalises without loss of generality to any rectangular domain $[a_{1}, b_{1}] \times \dots \times [a_{d}, b_{d}]$ . For presentation purposes we keep writing $[0, 1]^{d}$ . Next, we show how we infer an approximate posterior given data.

2.1. Variational inference

Let $P$ denote the set of all $P_{i_{1}, \dots, i_{d}}$ . As the prior of $f$ is fully determined by the random control points $P$ , the posterior of $f$ is determined by the posterior of these. As per above, we set the prior

p (P) = ν_{1} \prod i_{1} = 0 ν_{2} \prod i_{2} = 0 \dots ν_{d} \prod i_{d} = 0 p (P_{i_{1}, \dots, i_{d}}) := ν_{1} \prod i_{1} = 0 ν_{2} \prod i_{2} = 0 \dots ν_{d} \prod i_{d} = 0 N (0, Σ_{i_{1}, \dots, i_{d}}),

(2.5)

where $Σ_{i_{1}, \dots, i_{d}} = \prod_{γ = 1}^{d} ς_{γ} (i_{γ})$ . We utilise variational inference to approximate the posterior of the control points, and hence $f$ . This means we introduce variational control points. We assume they are fully independent (usually called the mean-field assumption), and have free parameters for the mean and variance, such that $P_{i_{1}, \dots, i_{d}} \sim N ({^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}, {^Σ}_{i_{1}, i_{2}, \dots, i_{d}})$ .

Assume we have observed data $D = {x_{j}, y_{j}}_{j = 1}^{n}$ . The key quantity in variational inference is the Kullback-Leibler divergence between the true posterior $p (P | y)$ and the variational approximation – which we denote $q (P)$ . The smaller divergence, the better approximation of the true posterior. Without access to the true posterior, the quantity is not computable. However, it has been shown this divergence is equal to the slack in Jensen’s inequality used of the log-marginal likelihood: $log p (y)$ .

	$log p (y)$	$= log \int p (y \| P) p (P) d P \geq \int log (\frac{p (y \| P) p (P)}{q (P)}) q (P) d P$		(2.6)
		$= E_{q (P)} [log p (y \| P)] - KL (q (P) ∥ p (P)) .$		(2.7)

Knowing this, we can approximate the true posterior with $q (P)$ by maximising Eq. (2.7). This is the evidence lower bound, and it is maximised with respect to the variational parameters ${^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}$ and ${^Σ}_{i_{1}, i_{2}, \dots, i_{d}}$ . This is fully analytical when the variational parameters and $σ^{2}$ are known.

We assume our observation model is disturbed with additive Gaussian noise, which in other words means our likelihood is Gaussian

p (y_{j} | P) := N (y_{j} | f (x_{j}), σ^{2}), σ^{2} > 0,

(2.8)

for each $j = 1, \dots, n$ and we assume they are independent conditioned on $P$ . With these assumption the first term in Eq. (2.7) becomes

E_{q (P)} [log p (y | P)] = - \frac{1}{2} n \sum j = 1 log (2 π) + log (σ^{2}) + \frac{{(y_{j} - E_{q (P)} [f (x_{j})])}_{j}^{2} + {Var}_{q (P)} (f (x_{j}))}{σ^{2}},

(2.9)

where $E_{q (P)} [f (x_{j})]$ and ${Var}_{q (P)} (f (x_{j}))$ are given as Eq. (2.2) and (2.3) respectively, but with the variational parameters ${^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}$ and ${^Σ}_{i_{1}, i_{2}, \dots, i_{d}}$ used.

The second term in Eq. (2.7) enjoys the independence of control points to split into sums

	$KL (q (P) ∥ p (P))$	$= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 KL (q (P_{i_{1}, \dots, i_{d}}) ∥ p (P_{i_{1}, \dots, i_{d}}))$		(2.10)
		$= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 ⎧ ⎪ ⎨ ⎪ ⎩ \frac{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}} - 1 + \frac{{^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}^{2}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}} + log \frac{Σ_{i_{1}, i_{2}, \dots, i_{d}}}{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}} ⎫ ⎪ ⎬ ⎪ ⎭ .$		(2.11)

When inspecting the evidence lower bound, Eq. (2.7), we see it has a data term forcing control points to fit the data, and a KL-term to make control points revert to the prior. Knowing how control points are allocated in the input domain, we expect control points in regions of no data revert to the prior. This is similar to what stationary kernels do in said regions. We verify visually in Section 4.

All together, Bézier GPs can be adjusted to have priors similar to stationary GPs, and have analogous posterior behaviour, which is favourable to many practitioners. But Bézier GPs scale. None of the terms in the evidence lower bound require matrix inversions or determinants. It is simple to mini-batch over the data points, utilising stochastic variational inference (hoffman2013stochastic), to scale it to large $n$ . However, nearly all terms require evaluations of huge sums if the input dimension is high. The next section is aimed at this problem.

2.2. Scalability with the Bézier buttress

Until this point, we have omitted addressing the number of random control points needed for Bézier GPs. Let us denote this number $τ$ . It can quickly be checked that $τ = \prod_{γ = 1}^{d} (ν_{γ} + 1)$ . This implies that to evaluate $f$ we must sum over $τ$ summands, which as $d$ increases, quickly becomes computationally cumbersome. $τ$ increases exponentially with $d$ . It is justifiable to view the random control points as inducing points; after all, they are Gaussian variables in the output space. Thus, it would be extremely valuable to manage exponentially many of them.

To overcome this, we introduce the Bézier buttress¹¹1A buttress is an architectural structure that provides support to a building.. We assume parameters of the random control points, say $ϑ$ , can parametrise $ϑ_{i_{1}, i_{2}, \dots, i_{d}} = \prod_{γ = 1}^{d} w_{i_{γ - 1}, i_{γ}, γ}$ , where $w_{0, i_{1}, 1} := w_{i_{1}, 1}$ . This assumption is the key of the Bézier buttress. Figure 3 provides visualisation. It visualises a source-sink graph, where each unique path from source to sink represents one unique control point with above parametrisation. The cyan highlighted path represents the $ϑ_{1, 2, 3} = w_{1, 1} w_{1, 2, 2} w_{2, 3, 3}$ , where we multiply the values along the path from source to sink. Notice last edges have value $1$ .

Figure 3. A Bézier buttress visualised. Here, the input dimension is $3$ , hence $3$ layers. Each layer has $4$ nodes, since each dimension has order $3$ . There exist $4^{3}$ paths from source (left square) to sink (right square), each path represents one unique control point. We can sum over all control points by sequential matrix multiplication from source to sink.

In the Bézier buttress there are $d$ layers, one for each input dimension, and $ν_{γ} + 1$ nodes in each layer $γ = 1, \dots, d$ . Borrowing from neural network terminology, a forward-pass is a sequential series of matrix multiplications which are element-wise warped with non-linearities, such as $tanh$ or ReLU. If we let our sequence of matrices be $w_{1}, w_{2}, \dots, w_{d}$ , where $w_{γ}$ is the matrix with entries ${w_{i, k, γ}}_{i, k}$ , then fixing ’the input’ to $1_{ν_{1} + 1}^{⊤}$ and the last matrix to $1_{ν_{d} + 1}$ , a forward pass is

1_{ν_{1} + 1}^{⊤} w_{1} w_{2} \dots w_{d} 1_{ν_{d} + 1} = ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 ϑ_{i_{1}, i_{2}, \dots, i_{d}} .

(2.12)

We see this forward pass computes a sum over all control points means. Naturally, we construct a Bézier buttress for summing over variances too, which must be restricted to positive weights. It was these sums that formed our bottleneck, in this parametrisation it is just a sequence of matrix products.

What about computing $f$ ? It comes down to a use of ‘non-linearities’ in the buttress. Multiplying element-wise the Bernstein polynomials as seen in Figure 3 (visualised only on $3$ rd layer), a forward pass computes either $E [f (x)]$ , or $Var (f (x))$ if using squared Bernstein polynomials. Each control point is then exactly multiplied by the correct polynomials from its way from source to sink. Notice, the ‘input’ is fixed to $1$ in the source, but the observed $x$ is appearing via the Bernstein polynomials along the way. We can write this too as a sequence of matrix products

E [f (x)] = ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 B_{i_{1}}^{ν_{1}} (x_{1}) \dots B_{i_{d}}^{ν_{d}} (x_{d}) ϑ_{i_{1}, \dots, i_{d}} = 1_{ν_{1} + 1}^{⊤} w_{1} B_{x_{1}} \dots w_{d} B_{x_{d}} 1_{ν_{d} + 1},

(2.13)

where $B_{x_{γ}}$ is a diagonal matrix with the $ν_{γ} + 1$ Bernstein polynomials on its diagonal. For the variance of $f (x)$ there exists a similar expression, of course with squared polynomials, and with the positive weights associated with the variance Bézier buttress.

On this inspection, all terms needed to compute Eq. (2.9) are available. All terms for the KL divergences are algebraic manipulations of Eq. (2.12) – these are explicit in the supplementary material. The key takeaway for Bézier buttress is that we parametrise each random control point as a product of weights. This can be seen as an amortisation, and we do inference on these weights rather than the control points themselves. Hence, no matrix inversions are needed to backpropagate through our objective, and a forward pass is only a sequence of matrix products.

2.3. Marginalising matrix commutativity

Matrix multiplication is not commutative. This implies the ordering of the matrices in Eq. (2.12) matter, which again implies how we order the input dimensions in the Bézier buttress is of importance. This is the price for the computational benefit this parametrisation gives. An ordering of the input dimensions is somewhat unnatural for spatial regression, so we present a way to overcome this in approximate Bayesian manner. Define $f = f_{1} + f_{2} + \dots + f_{r}$ , where each of these individual $f_{k}$ , $k \in {1, \dots, r}$ , are Bézier GPs with a random permutation of the ordering in the associated Bézier buttress. In other words, we let $f$ be an ensemble of Bézier GPs to (approximately) marginalise over all possible orderings. The number of all possible orderings quickly becomes too large for which to account; in practice, we set $r$ in a feasible region, say $20$ , which gives satisfactory empirical performance.

Another lens on this is that each control point is a sum of $r$ control points – each control point’s standard deviation is scaled by $r^{- 1}$ , to obtain the same prior as so far discussed. Oddly, we have then circumvented the problem of too many control points by introducing even more control points. That is, each control point mean parametrise $ϑ_{i_{1}, i_{2}, \dots, i_{d}} = \sum_{k = 1}^{r} \prod_{γ = 1}^{d} w_{i_{t_{k} (γ) - 1}, i_{t_{k} (γ)}, t_{k} (γ)}$ , where $t_{k}$ denotes a random permutation of $(1, \dots, d)$ . We remind again that similar expression exist for control point variances, restricted to positive weights.

As remarked, inference comes down to a forward pass in the Bézier buttress, a sequence of $d$ matrix multiplications. Assume all dimension are of order $ν$ , then the computational complexity of one forward pass is $O (d (ν + 1)^{2})$ . Now we need $r$ forward passes to marginalise the ordering, and $n$ forward passes, one for each observation, leaving final complexity of $O (n r d ν^{2})$ . Linear in $n$ and $d$ .

3. Related Work

Variational inference in GPs was initially considered by csato1999efficient and gibbs2000variational. In recent times, the focus has shifted focus to scalable solutions to accommodate the big data era. In this respect, pmlr-v5-titsias09a took a variational approach to the inducing points methods (quinonero2005unifying); later hensman2013 further enhanced scalability to allow for mini-batching and a wider class of likelihoods. Still, the need for more inducing points is of importance, especially as the number of input features grows.

The response to this has mostly revolved around exploiting some structure of the kernel. wilson2015kernel and wu2021hierarchical exploit specific structure that allow for fast linear algebra methods; similar to our method inducing locations tend to lie on grid. These grids expand fast as the input dimension increases, as also pointed out earlier in the article. kapoor2021skiing remedy this by instead of a rectangular grid, they consider the permutohedral lattice, such that each observation only embeds to $d + 1$ neighbours instead of $2^{d}$ , as in (wilson2015kernel).

Another approach to allowing for more inducing points is incorporating nearest neighbour search in the approximation (kim2005analyzing; nguyen2008local). tran2021sparse introduced sparse-within-sparse where they have many inducing points in memory, but each time search for the $k$ -nearest ones to any observation. They discard the remaining ones as they have little to no influence on the observation. wu2022variational made a variational pendant to this method.

Lastly, when dealing with high-dimensional inputs it is worth mentioning feature learning. That is, learning more low-dimensional features where GPs have better performance. The success of deep models has been transported to GPs by damianou2013deep and later scaled to large datasets in (salimbeni2017doubly). Another approach is Deep Kernel Learning (wilson2016deep; bradshaw2017adversarial), where feature extraction happens inside the kernel function; lately ober2021promises has investigated the limitations and benefits of these models.

We have treated structured control points as our version of inducing points; and by parametrising them with a Bézier buttress, we limit the expansion of grids to linear growth in parameters. We are not the first to consider the Bernstein polynomials as a basis for learning functions. petrone1999random used it to model kernel estimate probability density functions, and several follow up works (petrone1999bayesian; petrone2002consistency). hug2020introducing recently introduced Bézier GPs, but with a focus on time series (hug2022b). Our emphasis has been on spatial input; even the Bézier surface literature contain close to nothing on more than $2$ -dimensional surfaces.

4. Evaluation

We split our evaluation into four parts. First, we visually inspect the posterior on a one dimensional toy dataset to show how the control points behave, and indicate that there indeed is a stationary-like behaviour on the domain of the hypercube. Next, we test empirically on some standard UCI Benchmark datasets, to gives insight into when Bézier GPs are applicable. After that, we switch to tall and wide data – large both in the input dimension and in number of data points. These experiments give certainty that the method delivers on its key promise: scalability. Lastly, we turn our eyes to the method itself and investigate how performance is influenced by the ordering of dimensions.

Care is needed in optimising a Bézier GP – not all parameters are born equal. We split optimisation into two phases. First, we optimise all variational parameters, keeping the likelihood variance $σ^{2}$ fixed as $τ^{- 1}$ , with $τ$ being the number of control points. After this initial phase, we optimise $σ^{2}$ with all variational parameters fixed. We let both phases run for 10000 iterations with a mini-batch size of 500, for all datasets. Both phases use the Adam optimiser (kingma2015adam), the first phase with learning rate $0.001$ , and the second with learning rate $0.01$ . If not following a such a bespoke training scheme, we see a tendency for the posterior to revert to the prior, because the KL-term becomes too dominating initially. This training scheme is designed for the Gaussian likelihood, but we wish to emphasise that, in principle, the loss function is accurate for any choice of likelihood.

4.1. One dimensional visual inspection

We hypothesised the objective function, Eq. 16, would ensure, within the hypercube domain, that $f$ reverts to its prior in regions where data is scarce. To verify this we construct a small dataset to inspect. We generate one-dimensional inputs uniformly in the regions $[0, 0.33]$ and $[0.66, 1]$ ; we sample 20 observation in each region. The responsive variable is generated as $y (x) = 3 sin (16 x)$ . According to the hypothesis, $f$ should in the region $[0.33, 0.66]$ , tend towards zero in mean, and increase its variation here. We use a BézierGP of order $20$ to model the observations, since they are highly non-linear. Figure 4 shows the posterior distribution of $f$ to the left; we observe $f$ tends towards the prior in the middle region. The middle plot illustrates the distribution, both prior and posterior, of the $21$ control points. There is a clear tendency for the central-most points to align the posterior and posterior, enforcing this behaviour in $f$ . The non-equal priors are due to the inverse-squared Bernstein adjusted prior which ensures a uniform variation in $f$ over the domain, see Figure 2. The plot to the right in Figure 4 shows the behaviour foundational to practitioners of Bayesian optimisation and active learning etc., the variance increase away from data regions.

Figure 4. Left: Posterior distribution of $f$ . Middle: Posterior and prior distribution of control points. Right: Posterior variance as function over the domain. Variance increases in scarce data regions.

		power	protein	energy	boston	bike	keggdirected	concrete	elevators
	$n$	$9568$	$45730$	$768$	$506$	$17379$	$48837$	$1030$	$16599$
	$d$	$4$	$9$	$8$	$13$	$17$	$20$	$8$	$18$
		Test log-likelihood
	$m = 100$	$- 2.7789 \pm 0.04$	$- 2.9307 \pm 0.01$	$- 1.6450 \pm 0.07$	$- 2.4993 \pm 0.27$	$- 0.6661 \pm 0.03$	$0.6498 \pm 0.03$	$- 3.1839 \pm 0.07$	$0.9167 \pm 0.02$
SGPR	$m = 500$	$- 2.7440 \pm 0.04$	$- 2.8479 \pm 0.04$	$- 0.7707 \pm 0.13$	$- 2.4592 \pm 0.33$	$- 0.4680 \pm 0.04$	$0.7133 \pm 0.03$	$- 3.0658 \pm 0.08$	$0.9373 \pm 0.02$
SimplexGP		$- 3.4416 \pm 0.06$	$- 3.2745 \pm 0.04$	NA	NA	$- 1.0932 \pm 0.19$	$- 2.3241 \pm 5.13$	$- 4.0338 \pm 0.02$	$- 0.2633 \pm 0.01$
	$ν = 5$	$- 2.8015 \pm 0.04$	$- 2.9585 \pm 0.01$	$- 0.6197 \pm 0.25$	$- 45.479 \pm 18.6$	$0.4724 \pm 0.14$	$0.6589 \pm 0.06$	$- 3.4612 \pm 0.62$	$0.9534 \pm 0.02$
	$ν = 10$	$- 2.7736 \pm 0.04$	$- 2.9257 \pm 0.01$	$- 0.7163 \pm 0.46$	$- 197.23 \pm 140.5$	$0.7020 \pm 0.18$	$0.6789 \pm 0.06$	$- 4.5613 \pm 1.20$	$0.9565 \pm 0.02$
BezierGP	$ν = 20$	$- 2.7391 \pm 0.05$	$- 2.8902 \pm 0.01$	$- 0.6504 \pm 0.62$	$- 139.42 \pm 52.7$	$0.7475 \pm 0.32$	$0.6939 \pm 0.06$	$- 7.7174 \pm 2.70$	$0.9058 \pm 0.03$
		Test RMSE
	$m = 100$	$3.8806 \pm 0.15$	$4.5272 \pm 0.04$	$1.1562 \pm 0.11$	$2.9372 \pm 0.65$	$0.4665 \pm 0.02$	$0.1279 \pm 0.00$	$5.8980 \pm 0.54$	$0.0968 \pm 0.00$
SGPR	$m = 500$	$3.7383 \pm 0.16$	$4.1755 \pm 0.16$	$0.5664 \pm 0.12$	$2.8233 \pm 0.65$	$0.3829 \pm 0.02$	$0.1212 \pm 0.01$	$5.3522 \pm 0.80$	$0.0948 \pm 0.00$
SimplexGP		$3.1147 \pm 0.26$	$4.1271 \pm 0.13$	NA	NA	$0.2876 \pm 0.07$	$2.6439 \pm 2.29$	$5.4457 \pm 0.75$	$0.1256 \pm 0.01$
	$ν = 5$	$3.9750 \pm 0.15$	$4.6620 \pm 0.04$	$0.4348 \pm 0.08$	$4.7007 \pm 0.91$	$0.1474 \pm 0.02$	$0.1319 \pm 0.03$	$4.8127 \pm 0.86$	$0.2939 \pm 0.85$
	$ν = 10$	$3.8675 \pm 0.15$	$4.5112 \pm 0.04$	$0.4157 \pm 0.11$	$5.6712 \pm 1.76$	$0.1100 \pm 0.01$	$0.1735 \pm 0.22$	$4.7917 \pm 0.94$	$1023.5 \pm 4 e 3$
BezierGP	$ν = 20$	$3.7427 \pm 0.18$	$4.3538 \pm 0.04$	$0.3573 \pm 0.11$	$4.9404 \pm 2.93$	$0.0821 \pm 0.01$	$80.05 \pm 348.4$	$4.9829 \pm 0.95$	$7 e 11 \pm 3 e 12$

Table 1. Results on eight standard UCI Benchmark datasets. We list test-set log-likehood and RMSE, both are averages over

20

train/test splits. On top are test log-likelihoods (higher is better), and bottom is test RMSE (lower is better). NA indicates Cholesky error.

4.2. UCI Benchmark

We evaluate on eight small to mid-size real world datasets commonly used to benchmark regression (pmlr-v37-hernandez-lobatoc15). We split each dataset into train/test-split with the ratio $90 / 10$ . We do this over $20$ random splits and report test set RMSE and log-likelihood average and standard deviation over splits. We choose baselines to be SGPR, following the method from pmlr-v5-titsias09a; we do both for $100$ and $500$ inducing variables. SimplexGP is another baseline, they suggest their approximation is beneficial for semi-high dimensional inputs (between $3$ and $20$ ) (kapoor2021skiing), hence they are an obvious baseline. SimplexGP usually use a validation set to choose the final model. This is due to a highly noisy optimisation scheme using a high error-tolerance ( $1.0$ ) for conjugate gradients. We remedy this by setting the error-tolerance to ( $0.01$ ), which harms scalability, but we can omit using a validation set for better comparability. wang2019exact recommend this error-tolerance, but remark it is more stable for RMSE than for log-likelihood. For BézierGP, we fix the number of permutations to $r = 20$ , and vary the order in $ν = 5, 10, 20$ . The order is identical over input dimensions. The inputs are pre-processed such that the training set is contained in $[0, 1]^{d}$ .

Table 1 contains the results of this experiment. We make the following observations about our presented BézierGP. On keggdirected and elevators there are test points outside the defined domain on some splits, which cause the RMSE to be extreme, but the likelihood is more forgiving. This highlights the constraint of our model: it needs a box-bounded domain to be a priori known. We could not reproduce results from kapoor2021skiing on keggdirected, the optimisation was too noisy and with no use of validation. On concrete, boston and energy we see overfitting tendencies. Even though BézierGP is the optimal choice on energy there is a mismatch between train and test error. On concrete this shows in better test RMSE, than the baselines, but the variance is overfitted yielding non-optimal likelihood. We conjecture this happens because the $n / d$ -ratio is low; which makes it more likely to overfit the control points – especially for higher orders $ν$ . Knowing these model fallacies, we observe that BézierGP outperforms on baselines on multiple datasets, most notably the $17$ -dimensional bike dataset.

4.3. Large scale regression

Figure 5. Test *negative* log-likelihood (lower is better) and RMSE (lower is better) for large scale regression. Colours indicate the origin of numbers: red from salimbeni2017doubly, purple from wang2019exact, and cyan from kapoor2021skiing. Orange is ours. We observe our BézierGP is highly competitive on large datasets.

Figure 5 shows the results of regression tasks in regimes of high dimensions and one in high number of observations. Here, we follow exactly the experimental setup of either salimbeni2017doubly or wang2019exact. If the latter, we use the validation-split they use as training data. Our optimisation scheme for BézierGP is consistent with above, except for slice, where the first training phase runs for $30000$ iterations. We discard test points that are not in the $[0, 1]^{d}$ domain – in no situation did this remove more than $0.001 %$ of the test set. The number after DGP, denotes the number of hidden layers in a Deep GP (salimbeni2017doubly), after SGPR and SVGP it denotes the number of inducing points. SVGP refers to the method from hensman2013. After B, it denotes the order used in BézierGP.

On year, we observe our (non-deep) model is on-par with $2$ -layered Deep GPs, and closer to $3$ in RMSE. The highest dimensional dataset, slice, sees us in the low $n / d$ -ratio again, and we are again faced with a too flexible model. This is why we report results for orders $3$ and $5$ , rather than $20$ , since the overfitting kicks in. Even for these small orders the test log-likelihood has high variance and under-performs compared to RMSE. With respect to RMSE it is top-performer signalling again it is overfitting the variance. On the remaining two datasets BézierGP is best-performing among baselines.

4.4. Influence of number of permutations

All experiments so far used $r = 20$ ; that is, $20$ random permutations of the ordering of dimension used in the Bézier buttress. For a problem with input dimension $d$ , there exist $d!$ possible permutations. Table 2 shows results with varying $r$ ; for each dataset, the results are over the same train/test split ( $0.9 / 0.1$ ). We fixed $ν = 20$ . Up to some noise in the optimisation phase, we see for the two highest dimensional datasets, protein and bike, performance improves with higher $r$ . Bike has over $50 %$ reduction in RMSE from $r = 1$ to $r = 50$ .

	power ( $24$ )		protein ( $362880$ )		bike ( $3 e 14$ )
	RMSE	LL	RMSE	LL	RMSE	LL
$r = 1$	$4.1476$	$- 2.8418$	$4.8416$	$- 2.9962$	$0.1477$	$0.4813$
$r = 10$	$3.8028$	$- 2.7549$	$4.4306$	$- 2.9080$	$0.0926$	$0.7275$
$r = 20$	$4.2514$	$- 2.8826$	$4.3713$	$- 2.8943$	$0.0853$	$0.5454$
$r = 30$	$3.8257$	$- 2.7551$	$4.3377$	$- 2.8867$	$0.0775$	$0.8372$
$r = 40$	$3.6612$	$- 2.7190$	$4.3449$	$- 2.8886$	$0.0848$	$0.5024$
$r = 50$	$3.7841$	$- 2.7503$	$4.2614$	$- 2.8686$	$0.0718$	$0.9009$

Table 2. Performance in test RMSE and log-likelihood against the number of permutations,

r

, on three datasets. In parenthesis shows the number of possible permutations of input dimensions. Higher dimensional datasets show improving performance with increasing

r

Table 2 emphasises the results we have presented are not optimised over hyperparameter $r$ and $ν$ . They also illustrate an interesting direction of future research: optimising these hyperparameters. We chose permutations by random sampling, but choosing them in a principled deliberate manner could yield good performance with a computationally manageable $r$ . This result indicates, at least, protein and bike would see increased performance in Table 1 from better (or just more) permutations.

5. Discussion

We introduced the Bézier Gaussian Process – a GP, with a polynomial kernel in the Bernstein basis, that scales to a large number of observations, and remains space-filling for high number of input features (limited to a box-bounded domain). We illustrated that, with slight adjustments, the prior and posterior have similar behaviour to ‘usual’ stationary kernels. We presented the Bézier buttress, a weighted graph to which we amortise the inference, rather than inferring the control points themselves. The Bézier buttress allows GP inference without any matrix inversion. Using the Bézier buttress, we inferred $6^{385}$ control points for the high-dimensional slice dataset.

We highlighted weaknesses of the proposed model: most crucially the tendency of overfitting when the $n / d$ -ratio is low. The results demonstrate scalability in both $n$ and $d$ , but does not solve the short, but wide problem. The paper did not optimise over the hyperparameters of the proposed kernel, namely $ν$ and $r$ , but it showcased briefly that doing so might enhance BézierGPs empirically; especially smart selection of the permutations is an interesting direction for future research.

Acknowledgements

MJ is supported by the Carlsberg Foundation.

References

Appendix A Computations in the Bézier Buttress

This section seeks to explain how the KL-divergence is computed using the Bézier buttress. It further explains more detailed parametrisation in the architecture. For completeness we here give a forward pass to compute $Var (f (x))$ .

Var (f (x)) = 1_{ν_{1} + 1}^{⊤} w_{1} B_{x_{1}}^{2} \dots w_{d} B_{x_{d}}^{2} 1_{ν_{d} + 1},

(A.1)

here we make the choice that ${w_{γ}}_{i, j} := exp (v_{i, j}) {ς_{γ}}_{i}$ . This ensures positive weights and hence a positive output for the variance of $f$ . $ς_{γ}$ comes from the inverse squared Bernstein adjusted prior (see Section 2). $v$ are free parameters to be inferred in the variational posterior. This parametrisation makes computing the KL terms easier.

We remark all the following calculation are only for one Bézier buttress. Are there multiple Bézier buttresses, with different orderings of layers, the computations are equivalent for all of them.

For computing the KL we first recall from the paper

	$KL (q (P) ∥ p (P))$	$= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 KL (q (P_{i_{1}, \dots, i_{d}}) ∥ p (P_{i_{1}, \dots, i_{d}}))$		(A.2)
		$= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 ⎧ ⎪ ⎨ ⎪ ⎩ \frac{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}} - 1 + \frac{{^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}^{2}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}} + log \frac{Σ_{i_{1}, i_{2}, \dots, i_{d}}}{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}} ⎫ ⎪ ⎬ ⎪ ⎭ .$		(A.3)

For easier reference we declare

$S_{1}$	$:= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 \frac{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}},$	(A.4)
$S_{2}$	$:= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 \frac{{^ϑ}_{i_{1}, i_{2}, \dots, i_{d}}^{2}}{Σ_{i_{1}, i_{2}, \dots, i_{d}}},$	(A.5)
$S_{3}$	$:= ν_{1} \sum i_{1} = 0 ν_{2} \sum i_{2} = 0 \dots ν_{d} \sum i_{d} = 0 log \frac{Σ_{i_{1}, i_{2}, \dots, i_{d}}}{^Σ_{i_{1}, i_{2}, \dots, i_{d}}} .$	(A.6)

We remind again that “hat” notation refers to parameters from variational posterior $q$ . We also remind that the prior variance is given $Σ_{i_{1}, \dots, i_{d}} = \prod_{γ = 1}^{d} ς_{γ} (i_{γ})$ . Because we have included $ς$ in the parametrisation in the posterior ${^Σ}_{i_{1}, \dots, i_{d}}$ , they are cancelling out in the expression in $S_{1}$ and $S_{3}$ . We get

S_{1} = 1_{ν_{1} + 1}^{⊤} exp v_{1} \dots exp v_{d} 1_{ν_{d} + 1} .

(A.7)

where $exp$ is element-wise on the matrices.

For $S_{3}$ we make the observation, based again on $ς$ cancelling out in the fraction, that

log \frac{Σ_{i_{1}, i_{2}, \dots, i_{d}}}{{^Σ}_{i_{1}, i_{2}, \dots, i_{d}}} = - log d \prod γ = 1 exp v_{i_{γ} - 1, i_{γ}, γ} = d \sum γ = 1 v_{i_{γ} - 1, i_{γ}, γ} .

(A.8)

That is, summing over $log {^Σ}_{i_{1}, i_{2}, \dots, i_{d}}$ is basically counting how many paths (i.e. control points) use $v_{i_{γ} - 1, i_{γ}, γ}$ . That is determined as $ψ_{γ} = \frac{τ}{(ν_{γ - 1} + 1) (ν_{γ} + 1)}$ . Here $ν_{0} := 0$ . Hence,

- S_{3} = d \sum γ = 1 ψ_{γ} ⨁ v_{γ},

(A.9)

where $⨁$ denotes summing the elements in the matrix.

Notice how the variational parametrisation of ${w_{γ}}_{i, j} := exp (v_{i, j}) {ς_{γ}}_{i}$ was carefully chosen for easily computing $S_{1}$ and $S_{3}$ .

$S_{2}$ is more close to what described in main paper. We simply just need to square all the weights and correct with the prior variance. That is, correct with $1 / ς_{γ}$ . Hence,

S_{2} = 1_{ν_{1} + 1}^{⊤} w_{1}^{2} ς_{1}^{- 1} \dots w_{d}^{2} ς_{d}^{- 1} 1_{ν_{d} + 1},

(A.10)

where here $ς_{γ}$ is the diagonal matrix with $ς_{γ_{i}}$ along its diagonal, for $i = 1, \dots, ν_{γ}$ . Notice further here $w$ are the weights in the mean Bézier buttress.

Now

KL (q (P) ∥ p (P)) = S_{1} - τ + S_{2} + S_{3},

(A.11)

all of which are computed in a single forward pass in the Bézier buttress. $τ$ is the number of all control points (in one buttress).

Appendix B Numerical results

For reproducibility we give the values used to generate Figure 4. These are given in Table 3.

year	buzz	houseelectric	slice
Test log-likelihood
B20: $- 3.6209 \pm 0.00$	B20: $- 0.0832 \pm 0.01$	B20: $1.5987 \pm 0.00$	B3: $- 0.5321 \pm 1.36$ B5: $- 2.7831 \pm 1.49$
Test RMSE
B20: $9.0461 \pm 0.01$	B20: $0.2629 \pm 0.00$	B20: $0.0489 \pm 0.00$	B3: $0.0761 \pm 0.01$ B5: $0.0880 \pm 0.02$

Table 3. Numerical values used create Figure 4. Here is listed average and standard deviation over 3 splits. On year the test-set was not standardised to compare with baselines there.

Bézier Gaussian Processes for Tall and Wide Data