A Survey of Open Source Automation Tools for Data Science Predictions

Nicholas Hoell

Abstract

We present an expository overview of technical and cultural challenges to the development and adoption of automation at various stages in the data science prediction lifecycle, restricting focus to supervised learning with structured datasets. In addition, we review popular open source Python tools implementing common solution patterns for the automation challenges and highlight gaps where we feel progress still demands to be made.

\affiliation

Research Technology
WorldQuant Predictive \emailAddnicholas.hoell@wqpredictive.com

List of Figures

1 Data science automation components

List of Tables

1 Automated feature engineering components in various packages.
2 Automated model engineering package comparison.

1 Overview

As the demand for robust predictive models has increased over the past decades, so too has the demand for tools data scientists can use to automate their workflow. This can be advantageous for a number of reasons, including the following:

Time savings: Automated systems offer the potential for tremendous speed-up in a number of areas that data scientists spend their time. Some common patterns that often have reusable components are data cleansing/standardizing, feature generation, and model tuning.
Benchmarking: The fast production of independent benchmarks to outperform and test against improves the quality of competitor models.
Expanded userbase: The more automation is built into data science solutions, the less specialized skills are required of users. At the extreme, this offers the possibility of business users specifying a small number of high-level decisions and letting automation drive the subsequent pipelining (and, in some cases, even the reporting, e.g., [89]).

While there are other important areas where automation has impacted the data science lifecycle, the focus of this survey is specifically on open source Python tools used for various steps in the automated machine learning pipeline for supervised problems pertaining to structured data. A recent study [39] that compared the adoption of automated tools for machine learning by organizational type found a plurality of respondents using automated tools only partially with significant variance by sector.

This paper will present challenges faced in each of the three main stages of automation in a data science pipeline, together with an overview of some of the common concepts, terminology, algorithms, and open source implementations for addressing them. This is by no means meant to be considered an exhaustive study of all techniques (for this, we highly recommend the recent [165] for AutoML and [86] for feature engineering topics) or solutions but rather, its hopeful uniqueness lies in the following:

The utility for the uninitiated reader to quickly gain an understanding of many of the algorithms being used in this active area
An overview of common open source Python tools available for practitioners and their potential trade-offs
A deeper understanding of common relevant design patterns, especially those concerning hyperparameter optimization
At-a-glance feature comparisons for a collection of AutoML tools
Extensive references for the reader looking to delve deeper

Our intended audience for this survey is data scientists with solid knowledge of major concepts in data science (clustering, linear models, etc) including some familiarity with deep learning, as well as a basic foundation in statistics. We believe this survey can help to fill the current knowledge gap between data scientists primarily working in a business context and potentially lacking experience in AutoML concepts and packages and the daunting list of rapidly expanding open source options available. As we pay particular attention to the utility of the packages and algorithms for the practicing data scientist, we also include discussion relevant to time series modelling tools.

We will begin, in Section 1, by introducing the three main components in a fully automated data science workflow, as first (we believe) described in [140]. Then, in each of Sections 1.2, 2, and 3, we will go in depth through each of these components, highlighting both popular algorithms as well as popular open source Python packages implementing them. As the space of both algorithms and packages is quite large, not all will be covered and we refer the interested reader to [14] for a well-maintained (as of this writing) compendium of model engineering packages.

From a packages perspective, our survey will cover sklearn, Auto-WEKA, auto-sklearn and its variants, MLBox, hyperopt-sklearn, EvalML, featuretools, autofeat, AutoKeras, AlphaD3M, AutoGluon-Tabular, Auto-PyTorch, TPOT, tsfresh, AutoTS, FLAML, Compose, Prophet, PyCaret, OptimalFlow, OBOE, GluonTS, and AdaNet.

We present some open challenges remaining in Section 4. For completeness, we include a self-contained primer on Bayesian optimization, including Gaussian Processes, in a closing Appendix as this plays a large role in guiding the hyperparameter optimization and model selection process in several of the AutoML packages we discuss, and may not be something practicing data scientists have experience with.

1.1 The main components

There are several options for engineering automation in a data science pipeline, consisting of highly customizable levers. We recommend [140] as a good overview of the general building blocks in such a pipeline, from which we shall borrow naming conventions in the list below.

Given a dataset, $D$ , consisting, potentially, of multiple tables, the building blocks underlying automated supervised learning are as follows¹¹1The terminology is taken from [140]:

Automated prediction engineering (autoPE): This is the automated generation of training examples from $D$ producing tuples of inputs and their associated target labels. This may additionally include the most recent relevant time at which the observation occurred as well as other potentially important inputs²²2This is slightly more general than the definition given in [140]..
Automated feature engineering (autoFE): This is the creation of new input variables associated to each training example.
Automated model engineering (autoME): This is the creation of a model function, $m$ , mapping feature vectors to predicted outcomes. This stage is commonly thought of as AutoML, though in many systems AutoML includes both autoME and autoFE, as feature engineering can involve learning algorithms.

Figure 1: Overview of the modular parts of the data science pipeline as first described in [140], showing some of the internal levers/options for each component discussed in this article.

Each of these building blocks has its own substructure with its own unique challenges and common solutions. We should note that the above presumes a specification of a prediction problem in the first place, which is generally proposed prior to modelling (see Section 4 for more on this issue).

After the autoPE stage, the dataset can be assumed to consist of labelled pairs $D = {(x^{(i)}, y^{(i)})}_{i = 1}^{M}$ , with $x^{(i)}$ the base feature vectors (assumed to be drawn from a data-generating distribution) and $y^{(i)}$ the associated target values or labels. The general data science prediction problem, generically, is to find a model $m$ for which

m (x^{(i)}) \sim y^{(i)}, i = 1, . . ., M

approximately holds³³3The sense in which the relationship is approximate is determined by an appropriate loss function, measuring, essentially, goodness of fit on the entire dataset or batch thereof.. In other words, the model should approximately capture the relationship between the distribution of target values and the data-generating distribution.

One important stage in the overall data science pipeline is general data preprocessing which may consist of imputation of missing values, assessment and/or removal of outliers, normalization, handling duplicate observations, or others. Strictly speaking, these tasks can be considered as data preparation preceding even the prediction engineering stage and for this reason, we omit a fuller discussion of these techniques other than to mention that many AutoML packages discussed below (for instance, EvalML, MLBox, OptimalFlow, and auto-sklearn to name a few) perform some of these preprocessing steps as part of the data preparation pipelining stages. To the author’s knowledge no currently available open source Python AutoML tooling include advanced imputation strategies such as multivariate imputation by chained equations (MICE, see e.g. [16]), nearest-neighbour based imputation (see, e.g. [112]) or autoencoder based techniques (see, e.g. [80, 97]), although some, for instance PyCaret, employ iterative imputation strategies⁴⁴4PyCaret uses Light Gradient Boosting Machine (LightGBM) algorithm.. As these preprocessing steps can consume a significant amount of practicing data scientists’ time, this may be an area for impactful future developments.

As the autoPE process involves the least amount of technical complications and has the least open source libraries devoted to it, we briefly present an overview of it in the following before delving into the more involved autoFE and autoME in sections and 3, respectively.

1.2 Automated prediction engineering

In the article [153], the term prediction engineering was first introduced, formally, as

“ the process that transforms time-driven relational data into feature vectors and labels that can be used for machine learning”

The authors then describe a general label-generating procedure based around two principles: a labeling function and a data traversal algorithm. The labeling function’s responsibility is to create labels from time slices of the dataset while the traversal algorithm orchestrates the aggregation, time-partitioning and application of the labelling function on appropriate subsets of the data. These ideas were implemented in the open source Compose package (see [32, 64]) by Innovation Labs at Alteryx.

Although the labelling process is automated by Compose, the implementation of it does require a bit more technical than some “one-button” AutoML approaches as this step is responsible for a formal specification of the target instances. To the authors’ knowledge, Compose is the only open source Python tool currently dedicated to prediction engineering automation. Additionally, currently available automated open source tooling does not seem to deal with automated stratification of samples or, more generally, processing that aims to find “representative” examples from the data. We feel this is an area for potential advances in the automated pipelining tools.

2 Automated feature engineering

The space of available features for a given prediction task, we denote by $L$ , the so-called feature landscape or feature space for the given problem. It is easy to see that even in the simplest case of a single target and a single base feature this space contains infinitely many elements. The exploration of $L$ consists of successive application of feature transformations which result in a finite number of derived features, $F \subset L$ , from which a final subset $~ F \subseteq F$ must be chosen.

The feature engineering process thus consists of the following two stages:

Feature generation: This stage consists of exploration of the feature landscape by construction of derived features, usually in an iterative manner. The laws of combinatorics generally conspire to offer an explosion of derived features even for relatively few iterations. This stage is often called feature extraction⁵⁵5This holds as well for techniques that combine generation and selection, such as PCA, as mentioned in the subsequent commentary..
Feature selection: This stage uses a fitness criteria to prune the newly-generated feature set $D$ to a manageable size⁶⁶6We depart slightly from some traditional definitions, see e.g. [107], in order to focus attention on different conceptual stages where possible.. This general problem, depending on exact specifications, is known to be NP-hard (see, e.g., [28, 56]) and can be seen as a special case of approximating an optimal subset selection from $F$ ’s $2^{# F}$ subsets. Assuming the hyperparameter $# ~ F$ to be fixed still yields a search problem among $(\frac{# F}{# ~ F})$ allowable feature subsets.

The boundary between these two steps can be blurred in practice. For example, principal component analysis (PCA) is often used as a dimension reduction technique, constructing derived features that combine the input derived features and selecting those corresponding to the largest singular values of the empirical covariance matrix, and thus maximizing the variance of the original features in the lower dimensional space. In other words, PCA often refers both to the generation of new features and to the selection of the new derived features. While this line may be blurry in practice, the above division between generation and selection is a good, if not hard, cleaving of responsibilities in the feature engineering process.

During an iterative feature generation cycle, the pruning process can take place ‘‘mid-stream”, i.e. concurrent with the feature generation, or ‘‘downstream”, i.e. after the full cycle of feature generation iterations has concluded⁷⁷7In practice there may be multiple iterations of the feature generation/selection cycle.. The advantage to placing the selection process downstream of the generation is the generation algorithm has the opportunity to explore more of the feature landscape that may not have been reached had feature pruning eliminated paths to the most informative/powerful features. The trade-off is the enormous cost of storing the intermediate features required to generate the ultimately powerful end features.

The advantage of using mid-stream selection algorithms is the potential speed-up in exploration and the avoidance of “blind alleys” in the feature landscape. This approach may not offer actual speed-up, depending on details of how the selection is implemented. As well, these techniques suffer form the usual performance risks of greedy algorithms: by eliminating likely “blind alleys” there is a risk of also eliminating paths through feature space that result in highly powerful features in the long run.

In the following we outline common techniques in feature engineering and its automation in various Python packages. Our coverage is far from complete but we hope the reader gets a sense of the main concepts and tricks of trade. We also present, in Table 1, the methods used by packages that perform automated feature engineering discussed in this survey.

2.1 Generation techniques

In the following we cover some of the common methodologies and open source implementations used in the automated generation of derived features for data science problems

2.1.1 Elementary techniques

We begin by very briefly covering a few of the more basic techniques for generating derived features, where possible highlighting open source Python implementations.

Interactions.

Feature interactions are obtained from simple products of existing features. Here, we use the term more broadly and define basic interactions between features to be sums, differences, products, and ratios between numerical features. The PyCaret library (see next section) offers automated generation of products and ratios of existing features as well as of multivariate polynomials generated by base features. The autofeat and featuretools packages also offer a simple interface for quickly generating basic interaction features.

Encoding.

One of the most common feature generation techniques is one-hot encoding. This refers to transforming a categorical feature vector into a collection of bit vectors, each row of which containing a $1$ if and only if the given observation falls into the associated category. One-hot encoding is part of the automated feature engineering stages of several packages including EvalML, PyCaret, and auto-sklearn.

Binning.

The PyCaret package automates the binning of continuous variables, converting them into categories on the basis of applying one dimensional $K$ -means clustering, with number of clusters determined using Sturges’ rule for optimal choice of bins. The package also allows for automatically combining infrequent categories in features with large numbers of categorical values, as reducing the number of categories reduces the resultant number of features obtained when applying one-hot encoding.

Clustering.

Clustering refers to unsupervised partitioning of the dataset examples into a fixed number of clusters. The derived feature from the clustering can be the categorical cluster label, the distance to the cluster representative (cluster centroid, say), or another derived quantity (see [121] for a thorough overview). A host of clustering techniques exist, with $k$ -means (having an scikit-learn implementation) and the density-based HDBSCAN ([63]) algorithms being two popular choices for data scientists.

2.1.2 Some more sophisticated techniques

Here we present a handful of techniques that go beyond some of the more basic techniques for feature generation discussed above and include some ideas or packages that may be novel to practicing data scientists.

Genetic programming.

Genetic programming is a common type of genetic algorithm (see, e.g. [99] for a fuller description), a class of algorithm inspired by principles from evolution in the natural world which incorporates fitness selection, genetic recombination (sexual reproduction, although asexual techniques exist as well), and random mutation on a population of solutions to a given problem. These solution populations undergo iterated evolution guiding them towards a superior solution.

Genetic programming is a genetic algorithm whose populations are symbolic expression trees, with intermediate degree $k$ nodes given by primitive $k$ -ary operators and with leaves given by symbols representing constants. In the context of feature engineering, each expression tree represents a formula that can be applied to existing features to produce new derived features. The recombination and mutation act on the feature space to produce a robust set of expressions with a fitness function selecting for an optimal property along the way (for instance, small linear correlations and non-redundant expressions,). A popular open source implementation of genetic algorithms, including genetic programming is the DEAP package [37]., though we remark that DEAP may not be straightforward to implement into an automated pipeline as it is not particularly low-code.

Deep feature synthesis.

An algorithm called deep feature synthesis was presented in [75] which formed the basis for the featuretools Python library. Deep feature synthesis was designed to work with data held in relational databases though reduced forms of it can work on single-table data. The operators used to construct derived features are aggregations and transformations. The former produces the deep part of deep feature synthesis by synthesizing data held in different tables via prescribed relationships among their unique keys and recursively traversing a graph made up of available primitive operations on existing features. The latter consist of more tabular operations within a single table.

The featuretools package implementing this algorithm offers very good graphical and even natural language representations of how the derived features are obtained, making it a particularly excellent choice for data scientists needing to speed up reporting. The PyCaret package offers a somewhat related, though stripped down, version of synthesis by performing descriptive statistics on passed in groups of features as part of its feature engineering pipeline stage.

Synthetic features from random forests.

In [70] the authors use random forests (see [24]) to create synthetic features to feed into a downstream forest. The authors rely on interpreting the terminal node size hyperparameter of a random forest as a form of smoothness control and generate synthetic features by using the predicted values from a collection of random forests obtained with different node sizes. For classification problems, these features are $# c l a s s e s$ -dimensional vectors of estimated probability for each class, from each random forest. The synthetic features are combined with the original base features to feed as input to a downstream random forest. This general paradigm of creating synthetic features, extended to include other classifier models, is used by the AutoML system TPOT and has a straightforward implementation using scikit-learn’s ensemble module as well.

Time series.

There are often cases where datasets in question have a natural ordering to them, whether time-based or positional. In the case where there in a natural one-dimensional ordering to the observational data, a host of common features present themselves as reasonable and informative. We consider the case where each training example corresponds to various non-sequential features in addition to one or more time series. An input example, the $i$ ’th say, may therefore take the form

x^{(i)} = (f_{1}^{(i)}, . . ., f_{n}^{(i)}, {X_{1}^{(i)} (t)}_{t \in [t_{0} (X_{1}^{(i)}), t_{f i n a l} (X_{1}^{(i)})]}, . . ., {X_{m}^{(i)} (t)}_{t \in [t_{0} (X_{m}^{(i)}), t_{f i n a l} (X_{m}^{(i)})]})

where in the above, the $f_{j}^{(i)} \in R, j = 1, . . ., n$ and ${X_{k}^{(i)} (t)}_{t \in [t_{0} (X_{k}^{(i)}), t_{f i n a l} (X_{k}^{(i)})]}$ for $k = 1, . . ., m$ each represent time series, at possibly different cadence, length, and even type (categorical or numeric) associated to the observation $x^{(i)}$ . In this case, the feature generation stage consists of “flattening” the dataset by reducing each time series associated to an observation to a collection of single value, unordered, derived features.

The tsfresh (Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests, see [94]) Python package is a common library for generating a host of traditional features for ordered datasets. It computes 794 different features from 63 characterization methods. While deep feature synthesis can handle time series aggregations, the specificity of the methods used in tsfresh, including spectral features, make it extremely well suited for extracting informative derived features quickly.

We close this subsection by referring the interested reader to the article [95] which covers some of the additional automated feature generation tools available as part of automated forecasting libraries. It presents a discussion of some of the general classes of features these tools can obtain.

2.2 Filters, Wrappers, and All That

As mentioned earlier, the process of selecting a good subset $~ F$ of derived features from the $2^{# F}$ possible subsets of derived features can be NP-hard depending on the optimality criteria. Although there are numerous possibilities for this search problem, two methods tend to occur in practice more often than others (see, e.g. [109]): filters and wrappers. Filter methods are those that use only properties intrinsic to the data, i.e. they are model-independent. These are often considered a form of preprocessing as they do not involve the predictive model at all. Wrapper methods rely on using a machine learning model to evaluate the appropriateness of a proposed derived feature subset (see [76]). In addition, there are hybrid methods that combine filter and wrapper (see, e.g. [59]). A third class is embedded techniques in which the selection process is incorporated directly into the modelling, for instance $l_{1}$ -regularized regression models. In this article we only focus on modular components and therefore restrict our attention to the filter and wrapper techniques below.

The articles [48] and [120] give good, short surveys of some of the main concepts and techniques in the field. Our presentation in the below is far from exhaustive; we aim to present a few popular options for the feature selection stage with open source Python implementations. Interested readers may also find [69] particularly valuable.

2.2.1 Filters

In the following, we cover some of the main ideas behind common approaches to filter feature selection algorithms as well as discuss some of the open source packages that offer user-friendly implementations of them. Some well-known filtering techniques we omit for brevity are FOCUS, Relief, and the method of Cardie (see [26]), each of which is discussed in [56]. Our presentation divides the techniques into three non-exclusive categories: those based on classical dimension reduction, those based heavily on information or correlation among features, and those based on statistical methods. While these categories are somewhat artificial, we hope it clarifies some of the differences in rationale behind the approaches.

Dimension reduction.

While traditional dimension reduction techniques were often not aimed at increasing the discriminatory power of downstream models, they have become an increasingly important part of the feature engineering stages of data science pipelines. These methods combine both feature generation and selection; the input feature vectors are not simply reduced in quantity, they are transformed as well. The transformation and selection are fundamentally entangled and not independently modularizable in a straightforward way.

The mutual information, $I (X; Y)$ , between two random variables $X$ and $Y$ , with distribution functions $p_{X}$ , $p_{Y}$ , respectively, and joint distribution $p_{X Y}$ , is given by

E log (\frac{p_{X Y}}{p_{X} p_{Y}})

which measures the dependency of variables $X$ , $Y$ . In the above, the expectation is taken with respect to the joint distribution of $(X, Y)$ . Among other properties, it is straightforward to show that $I (X; Y)$ is zero if and only if the variables $X$ and $Y$ are statistically independent. Information-based filtering are techniques that, at their core, aim to reduce features based on higher mutual information. Mutual information can be estimated using the scikit-learn package which is based on algorithms discussed in [131, 78] on entropy estimations. Mutual information has a well-known bias in favour of larger numbers of values, see, e.g. [126, 61].

Principal feature analysis [90] is a technique to select a subset of the derived features with low mutual information. The method is based on the observation that the absolute value vectors produced by taking the rows of the (truncated) eigenvector matrix obtained from the eigendecomposition of the covariance matrix are in close Euclidean proximity for features with high mutual information. Based on this observation, $k$ -means clustering in the lower-dimensional space produced from the row span of these absolute value vectors will find clusters with similar informational content, and the original features corresponding to the best representatives from each cluster will be the ultimately selected features. While the author could not find an open source Python package implementing principal feature analysis, this algorithm is a straightforward chaining of PCA and clustering found within the scikit-learn package. Of course, a clear disadvantage of a straightforward implementation is the dependence on the nondeterministic clustering.

While mutual information can directly measure statistical independence, correlations can, at times, be a reasonable proxy (as zero correlation is a necessary condition of independence and a sufficient condition for linear independence) and thus falls into this class of filtering. This is highly intuitive: the smallest set of feature vectors spanning a $d$ -dimensional subspace of the feature space will necessarily be linearly independent and, therefore, are uncorrelated. PCA, as discussed in the previous section, falls into this category of selection as the principal components are orthonormal eigenvectors of the empirical covariance matrix and so are uncorrelated. PCA is automated in the PyCaret package (see next section) as part of its feature engineering stage.

In [136] a generalization of PCA, kernel PCA, is introduced in which base features are embedded into a (higher-dimensional) new feature space in which eigenproblems for the empirical covariance matrix in the embedded space can be solved to get projections onto dominant eigenvectors, as in usual PCA (see also [130]). The high dimensional embedding allows for the process to efficiently handle potential nonlinearities in the dataset as explicit embeddings are not actually required to compute the projections, only inner products among embedded vectors are required for the procedure. As with linear PCA, projections can be truncated to projections on the most dominant (in terms of sizes of eigenvalues) eigenvectors. Both PCA and kernel PCA rely on covariance matrices and all techniques reliant on covariance estimation can be sensitive to outliers (see [103] and references therein). Kernel PCA is automated as part of preprocessing in some AutoML packages, such as Auto-PyTorch (see Section 3).

PCA and kernel PCA are often viewed not just as feature engineering steps but also as dimension reduction techniques and derive their power essentially from scale imbalances in the distribution of points along suitable coordinates. These differences in scale imply a reasonable approximation of the data should exist in a lower-dimensional space. A different approach, based on the idea that the true dimension of the dataset is lower than the number of features, is given by locally linear embedding (LLE) [132, 135]. In LLE, data from an $n$ -dimensional space is assumed to be drawn from a distribution of points on a $d < n$ dimensional manifold. Because smooth manifolds are locally flat (see [142]), the low-dimensional manifold hypothesis of LLE implies that a suitable linear representation exists of data points in terms of their near neighbours. Once this linear map is obtained, projections to lower dimensions follow by minimizing an objective that penalizes variation from the linear relationship in the high-dimensional space holding in the lower-dimensional space, enforcing as much of the local geometry to hold in the lower dimensional space as possible (see also [27] for another take on using LLE in feature selection). Both kernel PCA and LLE have straightforward implementation in sklearn.

Finally, Independent Component Analysis (ICA) is a technique for uncovering latent, non-Gaussian, independent, features from a dataset. Namely, observations are assumed to be generated by independent, non-Gaussian components, viz. $x^{(i)} = A s^{(i)}$ where $A$ is a mixing matrix and the entries of $s^{(i)}$ represent statistically independent latent components. The independent components can be approximately determined by a projection pursuit algorithm, FactICA (introduced in [1]), that finds directions to project $x^{(i)}$ onto that result in maximizing the kurtosis of the result⁸⁸8Recall that for a random variable $X$ , the kurtosis can be defined by $E X^{4} - 3 (E X^{2})^{2}$ and is zero for normal distributions (see [1]).. In [139], the authors rank-order the induced latent features obtained from ICA by their kurtosis to obtain a feature subset. As with other approaches in this section, this approach combines both generation and selection. ICA is built into the automated preprocessing portion of the auto-sklearn package.

Information and correlation.

In the below, we outline a few methods for filtering that use information explicitly, or use correlation as a proxy.

A somewhat related method, not reliant upon statistical testing, is to rank features by their correlation (or covariance) with the target variable. In the case of a univariate linear least squares regression model this gives the square root of the coefficient of determination, namely the fraction of the variance in the response explained by the variance in the given feature. Extensions of this approach, and its relation to traditional statistical tests, is further discussed in [69].

Another simple example of this kind of filtering is the removal of all features whose correlation with other features exceeds a threshold. Linear correlations can easily be estimated from the values of the features and do not require direct distribution estimation. This property makes them a desirable proxy for information between random variables in some instances. A related issue is multicollinearity, wherein one feature may have linear dependence on possibly several others. The package PyCaret, for example, automates the handling of multicollinear features as part of the overall AutoML pipeline.

The article [59] introduces a hybrid approach whose filter stage uses a unique multi-objective optimization to prune features. For a given feature subset $~ F \subset F$ , three objectives are calculated: the (negative of) average mutual information between features in $~ F$ and target $Y$ , the average mutual information between features in $~ F$ , and $# ~ F$ . The Pareto frontier⁹⁹9Due to the classic analogy of optimizing for location and price of coastal hotels, this frontier is commonly called a skyline (see, e.g., the seminal article [23]). for a multi-objective optimization problem is a set of Pareto efficient (i.e. non-dominated¹⁰¹⁰10In this context, these are solutions for which no subset of features performs at least as well on 2 objectives and better on the remaining one. ) solutions. The authors use a forward selection strategy based on finding an approximation of the Pareto frontier by growing the size of examined subsets while finding subsets that are non-dominated according to the first two (information-based) objectives. To the authors knowledge, Pareto optimization is not performed in any open source automated tooling, however the DEAP package includes implementations of popular multi-objective optimizations such as NSGA-II (see [38]).

In the doctoral thesis [61], Mark Hall outlines a method for feature selection, correlation-based feature selection (CFS), for classification problems, based on the following idea:

“A good feature subset is one that contains features highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other”

Stated this way, CFS aims to find a trade-off between relevancy (dependence of the target on the response, as above) and redundancy (correlations among the derived features)¹¹¹¹11The article [69] provides several helpful cautionary tales about naive application of ranking schemes. Among other things, they show that identically distributed variates may not be immediately redundant, irrelevant features may combine to give relevance as a group, and high correlation does not imply uselessness in separating classes.¹²¹²12As discussed in [76], some machine learning algorithms have highly robust performance in the face of irrelevant variables but degrade when given highly correlated (redundant) relevant features.. Features are retained based on their high relevance, low redundancy, and ability to predict classes in areas of the features space that are not predicted well by existing features. For a classification prediction on targets $C$ , a proposed subset, $~ F \subseteq F$ is scored according to

s (~ F) ≐ \frac{\sum_{f \in ~ F} ρ_{c f}}{\sqrt{# ~ F + 2 \sum_{f, g \in ~ F} ρ_{f g}}}

(1)

where $ρ_{f g}$ is the correlation coefficient between features $f,$ and $g$ , and $ρ_{c f}$ is the correlation coefficient¹³¹³13These coefficients are not the usual linear Pearson coefficients, but represent a more general sense of the word “correlation”. In Hall’s presentation the coefficients can be symmetrical versions of relief, minimum description length, or symmetric uncertainty. See [61] for full details. between feature $f$ and target values $c$ . The numerator in (1) indicates how predictive the features are, while the denominator penalizes non-parsimony and redundancy. The pyautoweka package, which implements the Java Auto-WEKA application [124, 148] (described in more detail below) performs CFS in addition to informational and entropy-based filtering as part of its general automated feature and model selection algorithm.

Statistical Pruning.

A related technique to the informational approaches above are statistical filtering techniques. The popular scikit-learn package (see [117, 137]) offers a host of basic statistical feature selection techniques through its feature_selection module [117]. These include methods such as SelectKBest which uses scores based on $p$ -values of passed hypothesis tests of the features, VarianceThreshold, which makes sure that variance in a feature exceeds a minimal limit and removes those features below, as well as others. Some of these methods are automated in the OptimalFlow, EvalML and TPOT (see next section) feature selection pipeline stage., while some implement their own takes on these patterns (such as MLBox’s automated implementation of variance thresholding).

A very good example of automated statistical filtering is outlined in [30], as implemented in tsfresh, wherein for each derived feature, a $p$ -value is computed to determine whether the derived feature is relevant for the prediction of the target. Here, irrelevance is equivalent to zero mutual information between the given feature and the target. Each univariate feature has its $p$ -value chosen (using potentially different hypothesis tests) resulting in a vector of $p$ -values. From here, a Benjamini-Yekutieli procedure is applied to avoid false positives (meaningless features making it through this round of filtering) among these $p$ -values. This procedure is based on a method for multiple comparison testing that controls the so called false discovery rate, $F D R$ , defined by

F D R = E [\frac{# {incorrect rejections of null hypothesis}}{# {rejections of null hypothesis}}]

Feature reduction then takes place based on the $p$ -values while keeping bounds on the $F D R$ . Of course, other techniques, like the simple Bonferroni correction or other multiple comparison solutions may be adopted in place of $F D R$ -controlled methods. In addition to the above, there can be overlap with the information pruning techniques mentioned previously when including the target variable in the information pruning basket.

A final remark on relevance.

We close this discussion of filtering by noting that feature relevance is not at all an easy determination as there are many different, incompatible, definitions of relevance in the literature. For examples of these definitions, see Kohavi and John’s article [76] from which we quote:

‘‘...it is important to realize that relevance according to these definitions does not imply membership in the optimal feature subset, and that irrelevance does not imply that a feature cannot be in the optimal feature subset.”¹⁴¹⁴14Optimality is given a precise meaning in their paper.

2.2.2 Wrappers

In general, wrapper methods treat the predictive model as a black box, and assess fitness of a subset of features by looking at performance on that subset. Although there are many approaches to wrappers (for instance, type of model, whether evaluation uses cross-validation or holdout, etc) many of the methods fall into two types: forward selection and backward elimination. Forward selection wrappers begin with an initial set of features (the empty set, say) and proceed to add in features one at a time, while backward elimination rests on winnowing down the initial set of features by removing redundant or irrelevant features. Basic implementations of this wrapper paradigm are implemented in the scikit-learn model_selection module. Of course, the search problem can be carried out with different algorithms and there are reasonable reasons for each approach (see [76, 69] for in-depth discussions on this topic).

Feature importance.

The term feature importance can refer to a handful of different techniques. A simple case is the case of linear models, where model coefficients can be viewed as feature importances when the features are appropriately normalized. In the classic article [24], permutation feature importances are explored in the context of random forests. This amounts to looking at the resultant decrease in performance (misclassification rate, say) when each feature undergoes a random permutation of its values, keep all other features fixed. EvalML (see next section) and scikit-learn (which can be used with non-native variants such as XGBoost) include this form of importance as part of their basic suites.

Sticking with forests, the determination of splits, the Gini index¹⁵¹⁵15If $π_{i} (t)$ is the estimated probability of class $i$ in node $t$ , $\sum_{i \leq # c l a s s e s} π_{i} (t) (1 - π_{i} (t))$ denotes the Gini index for $t$ , see [73]. (or impurity) is often used for classification problems and sum of squares for regression (see, e.g. [144] for more details). XGBoost offers multiple additional options for measuring feature importances, such as the number of times a given feature was used to provide a splitting of the data across all trees in the forest as well as the gain from each feature used in model construction/pruning, among others (see [40] for full details). Fully automated random forest impurity based feature importance is implemented by MLBox, EvalML, and other autoML packages.

Building on work in [145] for the case of filters, in [98] a wrapper algorithm and R package, Boruta, are presented which are based around random forest feature importances and meant to find all relevant features for a given classification problem. Each feature in the dataset is used to create a “shadow” feature whose values are permutations of the base feature. Then, the enlarged dataset with the original and shadow features is used to perform a prediction and feature importances are computed. The maximum $Z$ score is calculated among all shadow attributes (MZSA)¹⁶¹⁶16The $Z$ score of a variable $X$ is $\frac{X - E X}{\sqrt{v a r (X)}}$ . In this case, the importances are random variables. and those features with $Z$ scores testing significantly less than the MZSA are removed, while those testing significantly above are counted as important. This process repeats until a stopping criteria is met. Boruta has an open source Python implementation (see [22]) as well as a Python extension BoostARoota [21] which uses a variant of the Boruta algorithm with XGBoost for increased speed relative to the traditional random forest based Boruta. Boruta is part of PyCaret’s automated feature selection module.

Shapley additive explanations (SHAP, see [92]) can provide an alternative feature importance paradigm. Traditional Shapley values can be defined by

ϕ (f_{i}) = \sum S \subset F ∖ {f_{i}} c (S) (m_{S \cup {f_{i}}} (x_{S \cup {f_{i}}}) - m_{S} (x_{S}))

(2)

where $f_{i}$ is the $i$ ’th feature, $c (S) = \frac{# S! (# F - # S - 1)!}{# F!}$ , $m_{B}$ denotes the predictive model obtained by training on the subset $B \subset F$ features and $x_{B}$ the corresponding inputs from the feature subset, $B$ . Morally speaking, the SHAP values are a variation of (2) wherein $m_{B} (x_{B})$ is approximated by a conditional expectation $E [m (z) ∣ z_{B}]$ where $z_{B}$ is a binary vector indicating which features are selected, and $m$ is the full predictive model. As the sum in (2) extends over subsets of a feature set, sampling can be performed to give a tractable approximate solution, among other approaches (see [92] and [8] for more details). SHAP values are implemented in, e.g., EvalML as part of model explainability.

Genetic algorithms.

A genetic algorithm based wrapper technique is to use a performance based fitness function during the evolutionary cycle. A clear example of this general idea in the context of regression problems with categorical features, using the DEAP package, is found in [108]. Similar ideas were explored in [60].

In [84], an additional feature subset selector pipeline was added to the basic TPOT package (see Section 3.1.2 for more details on the TPOT package) that performs removal of input features as a pipeline module. The TPOT evolved model then can be used as a wrapper to select the best performing feature subset. Expanding on the TPOT framework, in the article [114] a method inspired from genetics is added, paring down the full feature set to a restricted number of parameterized feature pairs. Each pair is then assessed using its accuracy on a custom tree built just on those two features. This method presumes the possible strong interaction between two features as can happen between two genes.

Lasso.

The least absolute shrinkage and selection operator (LASSO) [88, 149] is an $ℓ^{1}$ -regularized variant on the linear least squares regression problem, that tries to solve

a r g m i n θ | | ⎡ ⎢ ⎢ ⎣ \begin{matrix} 1 & (x^{(1)})^{T} ⋮ 1 & (x^{(m)})^{T} \end{matrix} ⎤ ⎥ ⎥ ⎦ θ - y | | + λ \sum i \leq n + 1 | θ_{i} |

(3)

The first term in the above is a fidelity term penalizing poorness of fit to a linear mode, while the second term helps promote sparsity of the resultant $θ$ ; the gradient of (3) will produce terms like $\frac{\partial}{\partial θ_{i}} | θ_{i} | = \frac{| θ_{i} |}{θ_{i}} = s g n (θ_{i})$ when $θ_{i} \neq 0$ so that gradient descent will result in components of $θ$ not improving fidelity tending to zero. LASSO on its own, as a predictive model, is an embedded technique rather than a wrapper, but it can be repurposed as part of a larger wrapper method.

The autofeat Python package [65] offers a multi-step feature reduction technique, consisting of removal of highly correlated features, followed by division of data into chunks on which $ℓ^{1}$ -regularized models are trained to guide feature selection. This process repeats through several rounds of subsampling. In addition, the package implements a form of symbolic pruning, utilizing the SymPy library to simplify and reduce redundant symbolic expressions. This eliminates redundant but complex features without evaluating their associated expressions. The popular autoML package MLBox also offers automated LASSO feature selection as part of its preprocessing pipeline optimization.

Package	Low-Code	Preprocessing	Feature engineering
MLBox	Yes	Cleaning, duplicate handling, typecasting, imputation, encoding	Variance thresholding, $L 1$ , random forest feature importances
OptimalFlow	Yes	Imputation, standardization, outlier winsorizing, encoding	Backward elimination, statistical pruning
hyperopt-sklearn	No	Encoding, scaling, time series lag selection, tf-idf, restricted Boltzmann machines	PCA, column $K$ -means
AutoGluon-Tabular	Yes	Encoding, datetime handling, imputation, duplicate handling	None
AlphaD3M	Yes	Imputation, encoding, scaling	PCA, tf-idf, variance thresholding, statistical pruning
Prophet	Yes	None	None
PyCaret	Yes	Imputation, encoding, SMOTE, outlier removal	Interactions, trigonometric features, binning, clustering, permutation importance, correlation with target, removal of collinear features, PCA, Boruta
EvalML	Yes	None	None
featuretools	Yes	Data typing	Deep feature synthesis. Customizable aggregations and transformations. Basic selection based on low information or constant as well as removal of highly correlated features
OBOE	Yes	Imputation, encoding, standardization	None
TPOT	Yes	Scaling, encoding, normalization	Binning, FastICA, synthetic feature generation, feature agglomeration, PCA, polynomial features, forward selection, statistical pruning, variance thresholding, elimination
AdaNet	No	None	Implicit neural features
FLAML	Yes	Encoding, scaling, class balancing, stratified sampling	None
tsfresh	Yes	Imputation	Statistical pruning (false discovery rate minimization)
AutoTS	YEs	Cleaning and shaping, outlier handling, detrending, shifting, filtering, transforming	Not pipeline-indpendent
Auto-Keras	Yes	Typecasting, encoding	Implicit neural features
Auto-WEKA family	Yes	None	Forward selection, CFS subset selection, information gain evaluation, principal components evaluation, symmetrical uncertainty, gain ratio, Pearson correlation, 1-R
auto-sklearn family	Yes	Encoding, imputation, class balancing, rescaling	Extremely random trees preprocessing, fastICA, feature agglomeration, kernel PCA, random kitchen sinks, linear SVM preprocessing, nystroem sampler, PCA, polynomials, random trees embedding, percentile selection, select rates
autofeat	Yes	Encoding	Buckingham Pi theorem, interactions, recursive application of combinations of ratios, sums, trigonometric functions, absolute value and others. Wrapper selection based on higher performance over corrupted version.

Table 1: Feature engineering and preprocessing comparison of packages discussed in this survey. For the purposes of this survey a package is deemed “Low-Code” if it demands a very low learning curve from a user already familiar with scikit-learn, which is an industry standard for non-automated machine learning.

Stability.

A distinct kind of wrapper technique is one based on stability, which can be interpreted in a number of different ways¹⁷¹⁷17In the literature, stability generally refers to what we call here dataset stability. We use the term stability in a looser, more general, way.. In [59], the authors use an approach based on classification stability: for each subset $~ F \subset F$ of the feature set, two scores is computed based on classification rates for a handful of classifiers. The first is $A (~ F) = {max}_{f \in M} R (f, ~ F) - {min}_{f \in M} R (f, ~ F)$ , where $M$ is a set of classifiers and $R (f, ~ F))$ represents the rate of correct classification with model $f$ using feature subset $~ F$ , and measures the overall worst variation in performance with a given subset. The other objective is simply the average $R (f, ~ F)$ over the classifier set. Subsets can be retained based on minimization of the former and maximization of the latter objectives. This multi-objective optimization can help ensure both reliability and performance of the final feature subset on a wide range of models.

In [91], a modification to forward selection is proposed, which we will refer to as dataset stability. The algorithm rests on the fact that error rates are random variables, as they depend on the data-generating distribution and therefore are influenced by sampling. Since wrappers generally assess feature subsets on the basis of these random variables (error rates), the sequence of feature sets obtained during a single run of forward selection, is also a random variable, and ultimately dependent upon the training set used in the selection process. The proposed algorithm works by performing $K$ runs of forward selection, each producing a sequence of obtained feature sets,

S_{i} = {~ F}_{1}^{(i)}, . . ., {~ F}_{# F}^{(i)}, i = 1, . . ., K

Evaluation of runs involves a new training/testing split of the dataset, and therefore the sequences $S_{1}, . . ., S_{K}$ generally will have feature variety. The set of feature sequences, $S = {S_{1}, . . . ., S_{K}}$ can then be given an index,

I_{S} (k) = \frac{2}{K (K - 1)} \sum 1 \leq i < j \leq K I_{C} (S_{i} (k), S_{j} (k))

where, in the above, $S_{i} (k)$ denotes the first $k$ features selected by the $i$ ’th step of forward selection and $I_{C} (S_{i} (k), S_{j} (k))$ is a measure of consistency between two feature subsets¹⁸¹⁸18Specifically, we have for equal cardinality feature subsets ${~ F}_{1}, {~ F}_{2}$ . With a stability index for the set of feature sequences, a final reduced feature subset can then be obtained by selecting an optimal subsequence and truncating its size based on the lowest cost (see [120] for full details). The main advantage of this approach is that the returned feature subset should be fairly robust with respect to data samples. A fuller discussion of stability, with differing notions of measuring it, is found in [120], to which we refer the interested reader. We remark that, to the author’s knowledge, no current open source automated feature engineering tools implement the stability-based approaches discussed here. We include them as they may make a promising addition to future systems.

3 Automated model engineering

Model engineering (as described earlier, a subset of autoML) can be thought of as consisting of at least two stages. The first is the choice of a given model, $m$ , from a candidate class of models, $M$ . The second is the appropriate selection of hyperparameters $λ$ , chosen from a set of hyperparameters $Λ$ . As outlined in [148, 81], the hyperparameter space $Λ$ is necessarily hierarchical and often conditional; choices of a given hyperparameter $λ_{j} \in λ$ may highly influence other hyperparameters, $λ_{i}, i \neq j$ . For example, architectural modifications (limitations of depth or breadth in a neural network) may render other hyperparameters in changed network regions meaningless. In fact, the second of these stages may be seen as a superset of the first; selection of which model to use may be cast as a choice of hyperparameter specification, though we may wish to keep them conceptually distinct. Similar remarks of course hold for any preprocessing steps that are included in the pipeline; see [77].

A common trend in automated model engineering packages is the pipeline: an object joining multiple, conceptually independent processing steps into a cohesive unit able to be independently trained. These pipeline components sometimes include aspects of model engineering and feature engineering as well as more banal data cleaning or imputation tasks.

The ultimate goal of automated model engineering is to eliminate the necessity of human involvement in the design stage of the model engineering process. Users should be expected only to have to make very high-level choices, despite the complexity of the search space involved (both model and hyperparameter space can be effectively infinite). In the below, we present some of the common concepts, algorithms, and open source Python package implementations used for automating the model engineering work of data scientists. For a fuller, more detailed history, the recent [165], [53], and [160] are excellent starting points covering many of the packages considered below. We refer the reader to Table 2 for a comparison of models and methods of the automated model engineering packages we discuss in the following sections.

3.1 Model selection and optimization

In this section, we will be looking at the first-order issues involved with automatic hyperparameter and/or model selection, paying attention to some of the challenges unique to the overall model engineering problem and introduce some popular frameworks for solving them. Many of the packages discussed rely on techniques from Bayesian optimization and we refer the interested reader to Appendix A for explanation of some of the main concepts.

3.1.1 Ensemble selection

Perhaps the first automated system for tuning hyperparamters and selecting a model appeared in the form of ensemble selection introduced in [127]. This technique consists of the generation of a large number of candidate models (originally, around 2000), obtained by variation of hyperparameters, without regard to performance quality. In an iterative and greedy, way models are evaluated to be added to an ensemble set based on how well the proposed model performs, when averaged with the existing ensembles, on a validation set. In order to allow for continued performance gains, models are selected from the model population with replacement. As well, ensembles can be initialized with a fixed number of top performers to avoid potential overfit. Finally, the authors also explore ensemble bagging; constructing ensembles from a smaller random subset of the models to avoid overfit, repeating this process and finally averaging over all bagged ensembles.

3.1.2 Genetic techniques

In [114], the authors introduce Tree-Based Pipeline Optimization (TPOT, see [150]), an algorithm that may have been the first to implement a full automated pipeline engineering solution that included feature transformation and selection. The authors used the DEAP package to implement a genetic programming (see section 2) algorithm generating trees of pipeline operators, which form the genetic population at any given stage of evolution. The algorithm evolves the sequence of operators as well as the hyperparameters associated with each one and uses performance score on the learning task as a fitness function.

The details of the basic TPOT algorithm allow for multiple copies of the data set to be propagated through a given pipeline before finally being used by an end stage classifier. As well, longer pipelines can be achieved than with packages like auto-sklearn which prohibit multiple preprocessing and model stages. In the article [113], the authors extend the basic TPOT framework, using Pareto optimization to regularize their model, penalizing needlessly lengthy pipelines. In [84] the addition of a feature subset selector and optional specification of linear or path-graph pipelines is explored, further enhancing TPOT’s abilities.

3.1.3 Combined algorithm selection and hyperparameter optimization

In this section we review packages that implement solutions to a common formulation of the automated model engineering problem.

auto-WEKA.

In [148], the authors introduce a combined algorithm selection and hyperparameter optimization (CASH) scheme to find

a r g m i n m_{j} \in M, λ \in Λ^{(j)} \frac{1}{k} k \sum i = 1 G (m_{j}^{λ}, D_{t r a i n}^{(i)}, D_{v a l i d a t i o n}^{(i)})

(4)

where $G (m_{j}^{λ}, D_{t r a i n}^{(i)}, D_{v a l i d a t i o n}^{(i)})$ denotes the incurred loss when training model $m_{j}^{λ}$ , having parameters $λ$ that can take on values in a specified range $Λ^{(j)}$ , on training subset $D_{t r a i n}^{(i)} \subset D ∖ D_{v a l i d a t i o n}^{(i)}$ and evaluating loss on validation set $D_{v a l i d a t i o n}^{(i)} \subset D$ .

The authors’ implement a Sequential Model-Based Optimization (SMBO, see [138] for open source solutions) algorithm called Sequential Model-Based Algorithm Configuration (SMAC, see Appendix A.4 for more details) to solve (4). SMBO is a general Bayesian technique for solving algorithm configuration problems (see [46]), attempting to maximize the expected loss improvement at a given hyperparameter choice with respect to a model-specific conditional probability. Their implementation of this resulted in the auto-WEKA Java application which has Python implementations (see, e.g. [124]). Since the original release of auto-WEKA, auto-WEKA 2.0 (see [82]) introduced several improvements on the original package, including support for regression models and a higher level of parallelism. In Table 2 we refer to the various iterations of auto-WEKA, as the auto-WEKA family .

auto-sklearn.

The auto-sklearn package [50, 49] builds on the CASH formulation of auto-WEKA but incorporates two enhancements: meta-learning and ensembling. The meta-learning entailed first establishing meta-features and model performance on 140 datasets in the repository OpenML [151], and using this information to assign a new dataset $D$ a similarity score to the known datasets, and using the associated model metadata for top performing models on the nearest 25 datasets to seed the Bayesian optimization portion of the CASH solver. Evaluated meta-features are numerous and include things like numbers of instances/classes, ratio of categorical to numeric columns, some simple descriptive statistics, prediction performance with simple models, and others (see [11]).

The ensembling portion of the auto-sklearn algorithm keeps track of models during the Bayesian CASH optimization and then applies ensemble selection as described above. The auto-sklearn package is capable of performing end-to-end modelling, including various preprocessing (imputation, balancing, etc) routines.

In the article [49] the authors extend the original auto-sklearn package in a few ways. On the prosaic end, they expand on datasets used for the meta-learning portion. On the more challenging side, they consider a time-constrained version of the cash optimization (4). The main novelty used to satisfy the given time-constraint is to no longer use the meta-datasets to make a $K$ -nearest dataset (KND) score on each new dataset given at runtime, but to use meta-datasets to construct a pipeline portfolio of pipeline configurations (hierarchical hyperparameters) that perform well in general, in terms of generalization error on validation subsets of the meta-dataset. The portfolio obtained serves as a robust warmstart set for the Bayesian optimization used for solving CASH on a new dataset. The authors introduced a successive halving technique for reducing the portfolio optimization time: train pipelines for a low given budget, select the best performing pipelines, increase their time budget in proportion to the number selected, and repeat. This technique offers significant speedup on the time-constrained portfolio optimization. The described variant of auto-sklearn is called PoSH Auto-sklearn. We refer to auto-sklearn and its various variants as the auto-sklearn family in Table 2.

3.1.4 Other methods

Finally, we present a few patterns for automated model engineering that do not fit neatly into any of the previous categories.

Flaml.

In [154] the authors introduce the FLAML (Fast and Lightweight AutoML) library (see [52, 51]) which implements a few novelties in the algorithm selection and hyperparameter optimization space. In FLAML a learning configuration is specified by a tuple $(m_{j}^{λ}, λ, r, s)$ where, as earlier, $m_{j}^{λ}$ is a machine learning model as usual, $λ \in Λ^{(j)}$ , the associated hyperparameters for that model, $r$ is a binary variable indicating whether to use cross-validation or a holdout set (effectively determining how much of the training data is used in assessing performance) known as resampling, and $s$ is the sample size to use during training. The main novelty here is the decoupling of the sample size, resampling, hyperparameters and choice of model from each other allowing the potential exploitation of several heuristics¹⁹¹⁹19Some of these heuristics are the relationships between the resampling strategy and computation time, the relationships between smaller samples and need for regularization, etc. See [154] for the full discussion of observations and properties..

The method works according to the following. First a resampling strategy is decided based upon the size of the training data, with cross-validation preferred for smaller sized datasets. After this, a machine learning model $m_{j}^{λ}$ is selected. This selection is probabilistic with probability density determined based on the expected cost of improvement, a measure of the expected computation time to find a configuration of the given model with a lower validation error than the current lowest validation error among all models used thus far. Models with a lower expected cost of improvement are expected to spend less time improving the overall performance and therefore are given a higher probability of being proposed. After the model is proposed, hyperparameters are selected using an algorithm described in [158] which uses a randomized stepping updates to $λ$ based on observed validation error. For a given model $m_{j}^{λ}$ , starting with an initially small sample size, decisions are made about whether to increase the size with given choice of hyperparameters, or move to new hyperparameters with the current sample size.

Although FLAML does not offer built-in ensembling as many of its competitors do, it manages to achieve solid performance. Additionally it is highly extensible, offering users the ability to easily add in models, requiring very simple compatibility requirements.

hyperopt-sklearn.

In [18] the authors introduce the hyperopt package [66] for optimization in a variety of algorithm configuration problems. Building on this framework, the hyperopt-sklearn package [67] uses hyperopt to performing pipeline selection from preprocessing and model modules implemented in scikit-learn. The hyperopt-sklearn package offers users a high level of control over the method used for hyperparameter optimization, supporting a number of popular optimization algorithms such as random search (see, e.g., [20]) and annealing. The package is unique in allowing users a bit more control over the probabilistic aspects of the final estimators (probability of certain models being used, as well as distributions of hyperparameters).

MLBox.

The MLBox package [101, 100] provides a fully automated, low-code wrapper for hyperopt and scikit-learn that includes multiple forms of feature selection (feature importances, LASSO, and others) as well as performing basic ensemble stacking.

PyCaret

The PyCaret package (see [125]) provides a very low-code, automated wrapper for scikit-learn, hyperopt, and others. Heavily targeted at business case focussed data scientists, it offers some unique additions over competing packages, such as dedicated anomaly detection, classification, and natural language processing modules, robust preprocessing, as well as very good plotting and visualization tools with very minimal human input.

OptimalFlow.

In [85] the OptimalFlow automated pipeline ensembling package is introduced (see also [115]). Unlike some of the other packages discussed so far, OptimalFlow used random search ([20]) or grid search rather than Bayesian optimization to solve the CASH problem. OptimalFlow constructs a collection of pipelines to optimize, each with modularizable components, and then searches through the collection to obtain an optimal baseline in a technique the authors call Pipeline Cluster Traversal Experiments.

Package	Models	Ensembling	Optimization
MLBox	LightGBM, random forest, extra-trees, AdaBoost, linear	Stacking	Tree Parzen estimators
OptimalFlow	Logistic regression, SVC/SVR and linear SVC/SVR, linear, LassoCV, and ridge regression, SGDRegression/Classification, Huber, random forest, AdaBoost, kNN, MLP, XGBoost, HistGradientBoostingRegressor	No	Pipeline cluster traversal experiments, Grid search, random search
hyperopt-sklearn	Multiple SVCs/SVRs, kNN, AdaBoost, random forest, decision tree, multinomial and Gaussian naive Bayes, linear and quadratic discriminant analysis, one-v-rest/one, SGD, gradient boosting.	No	hyperopt (annealing, tree Parzen estimators)
AutoGluon-Tabular	LightGBM, CatBoost, XGBoost, random forest, extra-trees, neural networks, Vowpal Wabbit, DeepAR, Prophet, MQCNNEstimator, MLP time series estimator, untested extensibility of full GluonTS models.	Multi-layer stacking	Fixed defaults, ensembling is optimization
AlphaD3M	AdaBoost, XGBoost, gradient boosting, random forest, logistic and least angle regression, nearest centroid, SVC and linear SVC, passive aggressive, kernel ridge, extra-trees	No	Monte Carlo tree search
Prophet	Additive regression Prophet model	No	Bayesian MAP estimation
PyCaret	Logistic and ridge regression, KNN, naive Bayes, linear and radial SVM, Gaussian process classifier, MLP, LightGBM, random forest, AdaBoost, linear and quadratic discriminant analysis, extra-trees, extreme gradient boosting, CatBoost	Stacking	Random search, Bayesian optimization
EvalML	KNN, random forest, XGBoost, LightGBM, CatBoost, extra-trees, decision trees, exponential smoothing, ARIMA, Prophet, Vowpal Wabbit	Stacking	Grid search, random search, Bayesian optimization
OBOE	LASSO and ridge regression, elastic net, KNN, decision tree, random forest, gradient boosting trees, AdaBoost tree, SVM, perceptron, Gaussian naive Bayes, extra-trees	Ensemble selection	Experiment design and collaborative filtering
TPOT	Gaussian, multinomial and Bernoulli naive Bayes,extra-trees, decision tree, kNN, XGBoost, logistic regression, SGD regression/classification, ElasticNetCV, AdaBoost, LASSO least angle regression, ridgeCV, MLP	Implicit stacking from synthetic features	Genetic
AdaNet	Neural networks	Complexity-regularized subnetwork ensemble	Boosting-style coordinate descent
FLAML	Random forest, extra-trees, LightGBM, XGBoost, ARIMA, SARIMAX, Prophet	No	Probabilistic expected cost of improvement and randomized direct search
AutoTS	Multiple naive models (zeroes, average value, seasonal, last value), GLS, GLM, exponential smoothing, UCM, ARIMA, VARMAX, DynamicFactor, DynamicFactorMQ, VECM, VAR, Theta, ARDL, Prophet, GluonTS, RollingRegression, WindowRegression, MultivariateRegression, UnivariateRegression, Univariate/MultivariateMotif, SectionalMotif, NVAR, NeuralProphet, GreyKite, MotifSimulation, TensorflowSTS, TFPRegression, ComponentAnalysis	Simple, distance, horizontal, mosaic	Genetic
Auto-Keras	Neural networks	No	Bayesian optimization with neural morphisms
Auto-WEKA family	BayesNet, DecisionStump, DecisionTable, GaussianProcesses, IBk, J48, JRip, Star, LinearRegression, LMT, Logistic, M5P, M5Rules, MLP, naive and multinomial naive Bayes, OneR, RandomSubspace, PART, random forest, RandomTree, REPTree, SGD, SMO & SMOreg, VotedPerceptron, ZeroR, LWL, AdaBoostM1, AdditiveRegression, AttributeSelectedClassifier, RandomCommittee,	Stacking, voting	SMAC, Tree Parzen estimators. Also supports grid and random search
auto-sklearn family	Adaboost, decision tree, extremely random trees, Bernoulli, multinomial and Gaussian naive Bayes, gradient boosting, kNN, linear and quadratic discriminant analysis, linear and kernel SVM, passive aggressive, random forest, SGD linear classifier	Ensemble selection during search	Bayesian optimization warm-started with meta-learning

Table 2: Automated model engineering packages discussed in this survey. Some models have been omitted, though most were included. Bayesian optimization, SMAC, and Tree Parzen estimators are discussed in Appendix A. Included as well are time series based models, when appropriate.

3.2 Second-order methods

In this section we discuss extensions and enhancements of the CASH formulation described in Section 3.1.3. These ideas in some cases touch on “second-order” issues that arise in automating model engineering, touching at issues at a layer of abstraction higher than the basic CASH optimization (4) problem. For a simple example of “higher order” complications, while hyperparameters appear in machine learning models and require optimization, many AutoML systems address hyper-hyperparameter optimization (see, e.g. [159] for this parameter hierarchy). Each of the below packages and algorithms touch on higher-order abstractions in the automated model engineering problem.

Auto-sklearn 2.0.

We begin by remarking that the PoSH Auto-sklearn algorithm and package described above introduced additional meta-parameters (for instance, whether to run successive-halving in the optimization or not, and if so which parameters to set). The authors thus further extend PoSH Auto-sklearn by introducing a learned policy selector, which we now briefly describe (our notation will differ somewhat from [49]).

The estimated generalization error of a model $m^{λ}$ , on the meta-learning datasets $D_{m e t a} = \cup_{j} {{~ D}_{j}}$ , is denoted $ˆ E r r (m^{λ}, D_{m e t a})$ and is given by

\frac{1}{# D_{m e t a}} \sum j ˆ E r r (m^{λ}, {~ D}_{j})

where the error $ˆ E r r (m^{λ}, {~ D}_{j})$ is estimated on a holdout set in the usual way. If $π$ denotes an optimization policy choice, from an allowable set of such choices $Π$ , then the generalization error includes a dependence on the policy decision function $ξ : D_{m e t a} \to Π$ viz,

ˆ E r r (m^{λ}, ξ, D_{m e t a}) ≐ \frac{1}{# D_{m e t a}} \sum j \leq # D_{m e t a} ˆ E r r (m^{λ} (ξ ({~ D}_{j})), {~ D}_{j})

(5)

with $m^{λ} (ξ ({~ D}_{j}))$ denoting the model $m^{λ}$ with optimization policy on ${~ D}_{j}$ determined by $ξ$ . The second order learning process then involves optimizing over the policy space in addition to the model space. Cross-validation of the meta-data sets themselves is performed, building each policy on subsets of $D_{m e t a}$ and validating on a holdout subset, in order that the estimated performance of a policy on any given dataset be reliable. The authors of [49] use a combination of techniques to then determine an appropriate $ξ$ . The resulting open source implementation of this enhancement to Auto-sklearn is the Auto-sklearn 2.0 package, available as a module within the Auto-sklearn package [11].

AutoGluon-Tabular.

In the article [105] the authors introduce the AutoGluon-Tabular fully automated model engineering algorithm and package (see [12]) capable of extremely low-code generation of machine learning models. Data is passed through both model-agnostic and model-dependent preprocessing stages prior to being input to a number of sequentially trained models chosen from a broad class, including Extremely Randomized Trees and neural networks. The model training is scheduled to allow good results to obtain even under heavy time-constraints.

An important part of AutoGluon-Tabular is the use of a multi-layer stacking technique for model ensembling (see also [157]). While packages like EvalML and MLBox use a common stacking ensemble technique consisting of training a final “stacker” model, whose inputs are the predictions of the “base” learners in the candidate model space, on a holdout set, in AutoGluon-Tabular stacker models are layered, with outputs of a given set of stacker models serving as inputs to the next layer of stackers. The higher-level stackers are trained, for a given base model, only on the holdout predictions, never on in-fold predictions, as data is partitioned into $k$ disjoint chunks. To reduce potential overfit, this $k$ -fold bagging process is repeated a number of times and the average of the out-of-fold predictions is the final estimator at a given level of stacking.

AlphaD3M.

In the article [42], a team from New York University introduced AlphaD3M, a pipeline selection and hyperparameter optimization package [5] built around DARPAs Data Driven Discovery of Models (D3M) program [35]. As apparent in the name, AlphaD3M takes heavy algorithmic inspiration from AlphaZero, a program developed by DeepMind that masters several two-player games by learning through self-play. Training examples for AlphaD3M are created by the system during self-play, and model pipeline primitives (feature extraction, inference, etc) were taken as fundamental modular units in the process. A long short-term memory (LSTM) network is trained to predict action probabilities corresponding to a pipeline distribution as well as estimated performance. A regularized cross-entropy loss on the predicted pipeline sequence (again, viewed as drawn from a distribution) and a pipeline found from Monte Carlo Tree Search (MCTS) together with squared errors from expected and obtained performance is then used to optimize the network, separately penalizing cumbersome pipelines and network weights.

In [41] the authors extend the basic AlphaD3M framework discussed above by incorporating a context-free grammar over for valid pipeline construction, based on practitioner knowledge about pipelines. With this pipeline grammar, the MCTS computation time is significantly decreased. This improvement showed that even a small amount of expert knowledge can have profound gains for autoML systems. The authors additionally showcased the impact that pre-trained models can make on AlphaD3M.

EvalML.

The EvalML package (see [45]) developed by Innovation Labs at Alteryx performs full model engineering and is designed to be highly compatible with Compose and featuretools. Aside from doing automated pipeline optimization, ensemble stacking, and excellent graphical pipeline visualization options, EvalML also offers explainability modules. These allow users to explore local interpretable model-agnostic explanations (LIME, see [93]) and SHAP values to gain insight into the decisive features in a selected prediction. These techniques assess the relative impact of features in a particular decision, as opposed to the feature importances, which measure the relative importance of a feature in the overall model. This, together with the natural language explanations offered by featuretools make an end-to-end featuretools/EvalML pipeline an attractive choice when explainability and transparency are important considerations (e.g. use cases in highly regulated industries).

Collaborative filtering.

In [159] a method based on collaborative filtering (see [55] for a very good explanation of this class of problem) is proposed for solving the algorithm selection and hyperparameter configuration problem resulting in an algorithm and package called OBOE²⁰²⁰20The name derives from the woodwind instrument orchestras use to provide tuning. (see [111]). The main idea can be summarized as follows below.

Let ${{~ D}_{i}}_{i \leq m}$ and ${m_{j}}_{j \leq n}$ be a collection of meta-datasets and models, respectively. A matrix $M$ can be constructed such that

M_{i j} = cross-validation error of m_{j} using {~ D}_{i}

The matrix $M$ can then be approximated by a low rank matrix comprised of latent meta-features, i.e. $M_{i j} \approx w_{i} \cdot z_{j}$ for $w_{i}, z_{j} \in R^{k}$ , $k < min (m, n)$ , latent feature vectors of meta-datasets and models, respectively, which can be computed, essentially, using singular value decomposition. When provided a new, previously unseen, dataset $D_{n e w}$ , some models are run on it and associated cross-validation errors are computed, effectively appending a new, partially filled, row to $M$ . This occupies the majority of the online runtime in OBOE. Latent meta-features for $D_{n e w}$ are then obtained via

w^{(n e w)} = a r g m i n w \in R^{k} \sum j (e_{j} - w \cdot z_{j})^{2}

where the sum is taken over the $j$ ’s corresponding to models for which errors were calculated for $D_{n e w}$ . The performance of (unobserved) model $l$ on $D_{n e w}$ can then be predicted using $w^{(n e w)} \cdot z_{l}$ . Additionally, the authors regress model runtimes on sample size and features in each of their meta-datasets allowing for surprisingly accurate estimation of unobserved model runtimes. Notice that the method used by OBOE assumes a bilinear relationship between model performance and meta-features and models.

The only piece missing from above is how the models to run on a new dataset are chosen. For this, the authors use a $D$ -optimal (see, e.g. [128] for related problems) method inspired by experiment design, in which a constrained (with total estimated runtime constraints) convex optimization problem is solved to produce the subset of models to use in order to estimate $w^{(n e w)}$ . Once the model performances on $D_{n e w}$ have been either observed or predicted, an ensemble selection process (see 3.1.1) builds an ensemble learner on the top observed performers²¹²¹21A priority queue is used, based on predicted performance, to decide which models to actually observe..

While OBOE models the model/feature to performance relationship linearly, in [106], the authors instead allow for a nonlinear relationship by probabilistic matrix factorization (PMF, see [83]). For this, the authors look to model $M$ using an explicitly nonlinear relationship, $M_{i j} = g_{j} (w_{i}) + ϵ$ , where $ϵ$ is a Gaussian distributed random variable with variance $σ^{2}$ and $g_{j}$ is a nonlinear function modelled using a Gaussian process prior (see Appendix A.2). In this case

(g_{1}, . . . ., g_{n}) ∣ ⎡ ⎢ ⎢ ⎣ \begin{matrix} w_{1} ⋮ w_{m} \end{matrix} ⎤ ⎥ ⎥ ⎦ \sim N (0, K)

for covariance matrix $K$ ’s entries given by kernel function $k (w_{i}, w_{j}) = K_{i j}$ and $N$ a multivariate Gaussian distribution. Putting all parameters associated with kernel function $k$ in $θ$ gives the following likelihood function for the performance matrix $M$ ,

L (W, θ, σ^{2}) ≐ P (M ∣ W, θ, σ^{2}) = Π_{i \leq n} N (M_{:, i} ∣ 0, K + σ^{2} I)

(6)

where $M_{:, i}$ denotes the $i$ ’th column of $M$ and $W$ is a matrix made up of the latent features ${w_{i}}_{i}$ as columns. The likelihood function in (6) can be optimized using stochastic gradient descent to reveal the latent features in $W$ . In this framework, predictions according to new models will also follow a normal distribution and depend on previously evaluated models. The selection of subsequent models to evaluate is done via maximizing the expected (performance) improvement acquisition function. An implementation of this technique is found in [122]. Despite the seeming increase in expressiveness provided by PMF described here, comparisons between OBOE and PMF (see [159]) indicate that the low-rank, linear assumptions of OBOE provide solid competition against PMF based model selection.

3.3 Automated forecasting

As with feature engineering, time series present their own unique challenges when it comes to automating the model engineering process. We can only briefly touch on some of the additional complications involved with automated forecasting; for a recent, thorough, review of some of the challenges and solutions in this space, we refer the reader to [95]. The unique challenges with time series forecasting we wish to mention arise from a number of sources including the following:

Model Variety. A large number of available models is not an issue unique to time series, however time series have their own classes of models designed specifically for forecasting purposes. These are generally naive models which rely on simple techniques, statistical methods which include autoregressive and smoothing models, and machine learning models (see [161]) which train regression models using more traditional machine learning techniques not directly inspired from classical time series modelling.
Assumptions. Classical time series models generally have their own types of assumptions which may include different forms of stationarity of the series as well as ergodicity (see, e.g. [119]). This means that automated forecasting tools should be able to detect and, potentially correct (as part of an overall pipeline) non-stationarity.
Forecasting. Time series forecasting generally requires more than single value prediction and, rather, usually involves prediction of a sequence of values into a future horizon. Forecasting models therefore must be able to reliably capture more than just simple responses and must also capture the order-dependency of the predicted values. In addition, it may be desirable to forecast probabilistically by providing either forward-looking confidence bounds or future distributions.

Each of the issues mentioned above can be addressed by their own selection of different methods. For instance, there are several techniques available for detecting and addressing nonstationarity in time series²²²²22The article [95] gives a nice overview of many of the techniques available for this. We just remark that ergodicity is, in general, not something easily verified.. Each available option of technique for the above issues can represent a pipeline component, potentially carrying their own hyperparameters just as we saw in non-time series prediction problems²³²³23For instance the choice of $p$ and $q$ in an $A R M A (p, q)$ model, the backwards horizon period over which to train a machine learning model, or choice of distributional parameters in a Bayesian model to name just a few examples. .

Prophet.

Inspired by generalized additive models (see [62]), in [147], the authors introduce the Prophet model for automated time series forecasting (see [123]) attempting to provide a scaling of the tools in this space in the sense of expanded userbase, variety of forecasting problems, and efficient evaluation of large numbers of forecasts. The model they introduce,

y (t) = g (t) + s (t) + h (t) + ϵ_{t}

(7)

is made up of a nonlinear growth term, $g$ , a seasonality term, $s$ , holidays, $h$ , and a Gaussian noise term, $ϵ_{t}$ . The growth term the authors use is a modified logistic model with nonconstant carrying capacity and piecewise constant growth rates²⁴²⁴24Their framework also allows for unbounded growth models., specified by rate changepoints that use Laplace-distributed priors. The seasonality term in (7) is modelled on a truncated Fourier series, and takes the form $s (t) = X (t) β$ with $X (t)$ a vector of Fourier basis elements and $β \sim N (0, σ^{2} I)$ . Holidays are treated as independent shock events and models as $h (t) = (1_{t \in h o l i d a y_{1}} (t), . . . ., 1_{t \in h o l i d a y_{k}} (t)) κ$ for $κ \sim N (0, ν^{2} I)$ where $h o l i d a y_{j}$ is a set of dates for the $j$ ’th holiday. With these priors on each component in the additive model (7) a maximal a posteriori estimate can be used to determine parameters and forecast or a generative posterior can be used to forecast with uncertainty. The implementation of the Prophet model [123] is highly low-code but also allows interested users control over lower-level parameters.

AutoTS.

The AutoTS package (see [15]) is a robust, genetic algorithm based automated forecasting library that focuses on providing highly low-code time series models. The model families used by AutoTS are large and include naive, statistical and deep learning members. The package allows for a variety of validation techniques as well as the ability to handle multiple time series, probabilistic forecasts, extensive control over the scoring of models via weighted combinations of metrics²⁵²⁵25As the authors of the package note, this composite scoring technique (which works well in the genetic algorithm paradigm since differentiation is not needed) allows for AutoTS to balance competing, potentially exclusive performance criteria in an elegant way. See the extended example in [15] for a fuller discussion., and control over the overall speed of learning by restricting the allowable model types. In addition, AutoTS offers several forms of ensembling: simple ensembling is a basic averaging approach, distance is the splicing of the forecasted data into separate forecast regions, one for each model projection, horizontal in which models and parameters are independent across the multiple time series, and mosaic, which extends horizontal ensembling by having each forecast period for each series get its own model. Since the horizontal ensembles involve lower level models, they can involve recursively simple or distance ensembled forecasts.

GluonTS.

In the article [4] (see also [3]), the authors introduce the time series modelling package GluonTS [57], which implements a host of probabilistic and deep learning models built in MXNet and PyTorch, among others. The models in GluonTS are, in the main, divided into generative and discriminative types²⁶²⁶26The particular meaning of these terms in [4] differs somewhat from the usually encountered meanings, as in [9]. with Deep State Space Models (see [146]) being one example of the former and Transformers (see [152]) an example of the latter. The AutoGluon library has a forecasting module incorporating several deep learning models from GluonTS including convolutional encoder-decoder network (see [155] or [58]) and potential (untested) ability to add any GluonTS estimator object into the AutoGluon framework.

We close this subsection on time series by noting that, in addition to the above mentioned packages, PyCaret, EvalML, FLAML and Auto-Keras (see next section) offer the users ability to do forecasting as part of their general model engineering suite. The models offered within these packages are outlined in Table 2.

3.4 Deep AutoML

The automated model engineering packages considered thus far have concentrated on more classical models (decision trees, linear regression, support vector machines, etc) as opposed to neural networks. Because of the complexity of even basic feedforward neural networks²⁷²⁷27In a standard fully-connected feedforward network with architecture defined by layers $L_{0}, . . . ., L_{f}$ , with $L_{0}$ being the input layer and $L_{f}$ being the output layer size, and corresponding depths $d_{0}, . . ., d_{f}$ has $\sum_{i = 1}^{f} (d_{i} d_{i - 1} + d_{i})$ parameters, excluding any hyperparameters associated with initialization of regularization. For common input vector sizes and just a single hidden layer, this can be quite a lot of parameters., automatic generation of a network for a specific task presents its own set of unique algorithmic challenges, not least of which is the increased time costs associated with training candidate models. In the following we cover a few open source packages and their underlying algorithms for tackling this complex problem which is sometimes known as AutoDL (e.g. [164]). While some of these packages present users with “only” neural networks as candidate models, universal approximation theorems (e.g. [34]) ensure the profound robustness of the resulting space. We refer the reader to [58] for more thorough coverage of the complexities involved with neural networks more generally.

AutoKeras.

In the article [74] the authors introduce an algorithm for neural architecture search (NAS, see [44]) to automatically optimize the architecture of a neural network as well as introduce the AutoKeras package implementing this algorithm [13].

The particular approach to the NAS problem taken by AutoKeras relies on neural morphisms (see, e.g. [29]), transformations between networks that preserve functionality, as an essential part in the Bayesian optimization. The search space in the optimization routine is effectively a tree whose edges are morphisms between nodes representing architectures. In order to use Bayesian optimization, the authors circumvent the need for Euclidean representation for the underlying Gaussian Process and instead produce a kernel directly, obtained from an embedding of an approximation of the edit-distance between two proposed networks (see [163]). The acquisition function optimization is handled by a bespoke version of $A^{*}$ search (see [133]), returning a path through the tree to obtain an architecture (i.e. a sequence of morphisms to apply). An additional challenge was enforcing shape consistency constraints.

AdaNet

In [33] an algorithm for NAS is presented, whose implementation, AdaNet (see [2]) uses, as implied by its name, an adaptive technique for generating new architectures from previous ones. The architecture space includes any neural network that can be represented as a directed acyclic graph (DAG), and thus includes skip-connections and more exotic layouts.

The main idea behind the AdaNet algorithm is to iteratively modify the architecture in a way that optimizes a very specially designed objective. To wit, an objective is used that consists of a convex surrogate loss with penalization on complexity (specifically, Rademacher complexity²⁸²⁸28Given a data sample $D = {x_{i} = X (ω_{i})}_{i}$ , $g$ in a function class $G$ , and ${σ_{i}}$ , ${- 1, + 1} -$ valued identically distributed Rademacher variables, i.e. $P (σ_{i} = 1) = P (σ_{i} = - 1)$ , for all $i$ , the Rademacher complexity of the class $G$ is defined by

where the expectations are over the Rademacher distribution and the data generating distribution. The Rademacher complexity measures the ability of a function class to model noise as it is the most expected correlations with noise a given function class should be able to obtain. In other words, it captures how robust the function class is to random labelling of data in a binary classification problem. See, e.g.,[118] for a fuller description. ). At each iteration of the algorithm, candidate weak learners (potential subnetworks) are generated by a module, and the network is augmented by a candidate if it causes a reduction in the objective. In this way, the final network evolves from the accumulation of multiple weak learners, with augmentations happening in a way that allows new candidates to learn from features generated by prior iterations.

Auto-PyTorch

In [96] the authors introduce a system, Auto-Net, instantiated in two packages, Auto-Net 1.0, a SMAC-based solution to (4) based on auto-WEKA and Auto-Net 2.0, or Auto-PyTorch (see [10]), which uses a combination of Bayesian optimization and the bandit-based HyperBand (HB, see [47, 79]) algorithm, known as BOHB [134]. Auto-PyTorch works with four different neural network types; multi-layer perceptrons with dropout [143], residual networks [162], in both shaped and nonshaped versions, with different learning schedulers and optimizers available.

BOHB uses kernel density estimation to estimate opportunistic regions in the algorithmic conifiguration space (both architectural and associated hyperparameters) and has the advantage of being easily parallelizable. It works on so-called multi-fidelity optimization problems in which multiple budget (e.g. runtime or epochs) granularities can be explored with performance-based promotion to increased budget size. Smaller-budget configurations then help guide the search and using sampling from the kernel density estimator promotes variety in the resultant configurations.

In a followup article, [164], a refinement and specialization of Auto-PyTorch, Auto-PyTorch Tabular is introduced, focussing specifically on structured data most relevant in traditional data science applications. The resulting algorithm includes a similar ensembling strategy as used by auto-sklearn, as discussed in Section 3.1.3, with the additional ability of being able to incorporate other classes of models into the ensembler. As well, warmstart for the BOHB is provided by configurations selected for performance on meta-training sets by a portfolio optimization strategy similar to that used in auto-sklearn PoSH.

4 Remaining challenges

While there has been tremendous progress made in recent years on increasing automation into the general data science workflow, there still remain challenges that are not, in our opinion, adequately solved by existing open source libraries. We highlight a few areas where we feel progress can still be made that would result in significant gains for the data science community.

Problem framing.

As noted in Section 1 the three automation steps (autoPE, autoFE, and autoME) presuppose a (precise) formulation of a given prediction problem. Not all problems lend themselves to unambiguously specified formulations. We consider here two very simple examples:

First, we consider customer churn (see, e.g. [31]). The time period over which a customer may be considered lost or retained will vary considerably on a case by case basis. As well, segmentation of customers will often be an important step needed to have a reliably meaningful churn rate (see [110] for this and a discussion other problems with simple churn rates). Both the time period involved and the segmentation are, essentially, determined by human judgment and/or business logic and therefore not easily automated.
Consider next the issue of predicting promotions within an organization. In a large enough organization, it may not be a straightforward task to determine what constitutes a promotion. Suppose the following holds: department $X$ has titles Entry, Junior, Senior and Executive while department $Y$ has the titles Entry, Associate, Lead and Executive. All pay bands are higher (and non-overlapping) in department $Y$ than their counterparts in department $X$ due to geographic or product-focussed reasons²⁹²⁹29The human resources department may not make such distinctions and just assign the levels to numbers 1–4 or similar. Whether they do this or not, both raise additional potential complexities.. Bob in department $X$ at Junior level is internally transferred (as a result of merit) to an Associate position in department $Y$ while Alice moves from a Lead role in department $Y$ to an Executive role in department $X$ (resulting in a pay decrease). Also, Carol left department $X$ in a Senior role in January only to be rehired as an Executive in $X$ in February. The precise logic as to whether Alice, Bob, and Carol were or were not promoted does not seem separable from specific business knowledge and therefore not easily automated.

As these basic examples illustrate, even very simply-stated prediction problems necessarily entail value judgments, business or domain knowledge, and sometimes arbitrary decisions to formulate in precise terms. These preclude immediate progress on automating this first step in the data science problem lifecycle.

Early stages.

As discussed in Section 1 there are stages close to the data ingestion stage where data scientists often spend significant time for which automated open source tools do not seem readily available. These include class balancing by sampling or synthetic generation (e.g. SMOTE [104]), more advanced methods of imputation of missing values, detection and/or removal of outliers, stratification of the observations, detection of drift. Some of these tools may be automated and embedded in a larger ingestion or cleansing pipeline upstream of automated data science predictions, however this is often not the case and data scientists may spend significant time on several of these steps within a given project. While some of the AutoML tools we’ve investigated address some of these issues (most handle simple univariate imputation and normalization, for instance) none seem to address the more sophisticated techniques in the data scientist’s general arsenal for tackling these preliminary preprocessing tasks.

Scalability.

Machine learning models can form the primary analytical engine of data science prediction pipelines, however, they are still, in many cases, a small part of the overall system. How well a machine learning model fits into the existing production infrastructure is a critical factor in determining adoption (see [156] for a nice summary of the issues involved). This may involve some of the following considerations: latency, built-on data-checks and data monitoring, expected computation resources, uncomplicated compatibility with data ingestion protocols, ease of incorporation into A/B testing paradigms, flexibility of model types and objectives used, native parallelism and/or ability to work easily in a distributed compute environment, transparency and interpretability of predictions (see [72] for more on this topic). Integrating more seamlessly into existing technological pipelines can be, depending on the context, as important as predictive performance.

While the preceding issues are critical for any automated prediction engine, they are particularly important in the open source libraries considered in the article as there do exist enterprise-level AutoML platforms devoted entirely to helping address various of these concerns and thereby increasing the odds of AutoML making it into a deployed solution. While these closed source solutions are not part of our focus here, some common ones are [7, 36, 17, 68] and [6].

Usability.

Many of the packages explored in the preceding are low-code (e.g. PyCaret, MLBox, and EvalML are particularly good on this front), however, most generally do require some technical expertise on the part of the user. Installation is not always a straightforward task, depending on the platform and package, this being particularly true with Auto-WEKA, h2o, and auto-sklearn. As discussed in [87], many AutoML search spaces may not be adequate for many real-world applications.

Conversely, more advanced users may wish to have more fine-grained control over some aspects of the automation than are offered by the package. The AutoML packages discussed in the preceding offer users differing degrees of control over parameters that may users may wish to adjust or maintain.

Reusability and modularity.

Most of the AutoML systems discussed in this survey exhibit limitations in terms of their general reuse. For example, expert or domain knowledge can be a crucial part of feature engineering and, as a consequence, automating the feature generation and selection process may preclude pipeline reuse being effective for a wider variety of data science contexts. An additional potential impediment for having more robust systems is the lack of modularity of specific pipeline components. Isolating the feature selection stage from one automated system, say, might require running the entire AutoML process, which may be prohibitive in addition to unnecessary.

Rigidity.

Lastly, data scientists often work on a variety of different use cases. In some cases, they handle data in multiple forms, work on varying timelines, face a range of reporting requirements, and even program in different programming languages and environments. Many times, this variety takes place within a single project. Because of this³⁰³⁰30And, surely, other unnamed reasons., data scientists often adopt idiosyncratic working styles, developed based on subjective criteria such as familiarity, adoption by a larger team, perceived learning curve and time investment to replicate existing functionality, lack of organizational investment in or commitment to promoting growth (see, e.g. [54]), and other forms of psychological inertia or rationalization³¹³¹31The rationalization may or may not be justified. It may well be the case that performance, for instance, will be improved with more human involvement and less automation, on any given project. Whether the performance gains justify the extra costs is highly context-specific.. Done right, data science reinforces deep scientific skepticism (about new datasets, new trends, new models, new deadlines, new technology, etc) and can foster caution about new, untested adjustments to existing patterns. Automation plays directly against these forces by removing the control and specificity provided by carefully designed or optimized workflows developed through hard-won experience and continuous iteration.

5 Acknowledgments

I am indebted to the helpful suggestions and improvements offered by Dávid Parázs, Thiam Lee, and Hao Chang during their careful review of an earlier draft of this report. Figure 1 was created by Tatiana Avaeva.

References

[1] Aapo Hyvarinen and Erkki Oja (2000) Independent component analysis: algorithms and applications. Neural Networks 13 (4-5). Cited by: §2.2.1, footnote 8.
[2] AdaNet documentation. Note: https://adanet.readthedocs.io/en/v0.9.0/ Cited by: §3.4.
[3] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski, D. C. Maddix, S. Rangapuram, D. Salinas, J. Schulz, L. Stella, A. C. Türkmen, and Y. Wang (2020) GluonTS: Probabilistic and Neural Time Series Modeling in Python. Journal of Machine Learning Research 21 (116), pp. 1–6. External Links: Link Cited by: §3.3.
[4] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski, D. C. Maddix, S. S. Rangapuram, D. Salinas, J. Schulz, L. Stella, A. C. Türkmen, and Y. Wang (2019) GluonTS: probabilistic time series models in python. CoRR abs/1906.05264. External Links: 1906.05264, Link Cited by: §3.3, footnote 26.
[5] AlphaD3M GitLab repository. Note: https://gitlab.com/ViDA-NYU/d3m/alphad3m Cited by: §3.2.
[6] Alteryx machine learning platform. Note: https://www.alteryx.com/products/alteryx-machine-learning Cited by: §4.
[7] Amazon SageMaker. Note: https://aws.amazon.com/sagemaker/ Cited by: §4.
[8] An introduction to explainable AI with Shapley values. Note: https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html Cited by: §2.2.2.
[9] Andrew Ng and Michael I. Jordan (2002) On discriminative vs. generative classi ers: A comparison of logistic regression and naive Bayes.. T. G. Dietterich, S. Becker, and Z. Ghahramani (Ed.), Vol. 14, pp. 841–848. Cited by: footnote 26.
[10] Auto-PyTorch GitHub repository. Note: https://github.com/automl/Auto-PyTorch Cited by: §3.4.
[11] auto-sklearn GitHub repository. Note: https://github.com/automl/auto-sklearn Cited by: §3.1.3, §3.2.
[12] AutoGluon GitHub repository. Note: https://github.com/awslabs/autogluon Cited by: §3.2.
[13] AutoKeras API docs. Note: https://autokeras.com/ Cited by: §3.4.
[14] Automatic Machine Learning Website. Note: https://mlpapers.org/automl/ Cited by: §1.
[15] AutoTS GitHub repository. Note: https://github.com/winedarksea/AutoTS Cited by: §3.3, footnote 25.
[16] Azur MJ, Stuart EA, Frangakis C, Leaf PJ. (2011) Multiple imputation by chained equations: what is it and how does it work?. J Methods Psychiatr Res 20 (1), pp. 40–49. Cited by: §1.1.
[17] Azure Automated machinje learning. Note: https://azure.microsoft.com/en-ca/services/machine-learning/automatedml/#features Cited by: §4.
[18] Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. (2013) Hyperopt: A Python library for model selection and hyperparameter optimization. In Proc of the 12th Python in Science Conf, Cited by: §3.1.4.
[19] Bergstra, James and Bardenet, Rémi and Bengio, Yoshua and Kégl, Balázs (2011) Algorithms for Hyper-Parameter Optimization. In Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), Vol. 24. External Links: Link Cited by: 4th item, §A.3, §A.4, §A.4.
[20] Bergstra, James and Bengio, Yoshua (2012) Random Search for Hyper-Parameter Optimization. Journal of Machine Learning Research 13, pp. 281–305. Cited by: §3.1.4, §3.1.4.
[21] BoostARoota GitHub repository. Note: https://github.com/chasedehan/BoostARoota Cited by: §2.2.2.
[22] BorutaPy GitHub repository. Note: https://github.com/scikit-learn-contrib/boruta_py Cited by: §2.2.2.
[23] S. Borzsony, D. Kossmann, and K. Stocker (2001) The Skyline operator. In Proceedings 17th International Conference on Data Engineering, pp. 421–430. External Links: Document Cited by: footnote 9.
[24] L. Breiman (2001) Random Forests. Machine Learning (45), pp. 5–32. Cited by: §2.1.2, §2.2.2.
[25] E. Brochu, V. M. Cora, and N. de Freitas (2010) A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. External Links: 1012.2599 Cited by: §A.3, Appendix A.
[26] C. Cardie (1993) Using decision trees to improve case-based learning. In In Proceedings of the Tenth International Conference on Machine Learning, pp. 25–32. Cited by: §2.2.1.
[27] Chao Yao, Ya-Feng Liu, Bo Jiang, Jungong Han, Junwei Han (2017) LLE Score: A New Filter-Based Unsupervised Feature Selection Method Based on Nonlinear Manifold Embedding and Its Application to Image Recognition. IEEE Trans Image Process 26 (11), pp. 5257–5269. Cited by: §2.2.1.
[28] B. Chen, J. Hong, and Y. Wang (1997) The minimum feature subset selection problem. Journal of Computer Science and Technology 12 (2), pp. 145–153. External Links: Document, ISBN 1860-4749, Link Cited by: item 2.
[29] T. Chen, I. Goodfellow, and J. Shlens (2016) Net2Net: Accelerating Learning via Knowledge Transfer. External Links: 1511.05641 Cited by: §3.4.
[30] M. Christ, A. Kempa-Liehr, and M. Feindt (2016/10/24) Distributed and parallel time series feature extraction for industrial big data applications. Cited by: §2.2.1.
[31] Chrun rate at investopedia. Note: https://www.investopedia.com/terms/c/churnrate.asp Cited by: 1st item.
[32] Compose GitHub repository. Note: https://github.com/alteryx/compose Cited by: §1.2.
[33] C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang (2017-06–11 Aug) AdaNet: adaptive structural learning of artificial neural networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 874–883. External Links: Link Cited by: §3.4.
[34] G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. External Links: Document, ISBN 1435-568X, Link Cited by: §3.4.
[35] Data-Driven Discovery of Models. Note: https://datadrivendiscovery.org/ Cited by: §3.2.
[36] Datarobot Automated Machine Learning. Note: https://www.datarobot.com/platform/automated-machine-learning/ Cited by: §4.
[37] DEAP API. Note: https://deap.readthedocs.io/en/master/ Cited by: §2.1.2.
[38] K. Deb (2002) A fast elitist non-dominated sorting genetic algorithm for multi-objective optimization: NSGA-II. IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION 6 (2). Cited by: §2.2.1.
[39] K. V. der Blom, A. Serban, H. Hoos, and J. Visser (2021) AutoML Adoption in ML Software. In 8th ICML Workshop on Automated Machine Learning (AutoML), External Links: Link Cited by: §1.
[40] dmlc XGBoost Documentation. Note: https://xgboost.readthedocs.io/en/latest/ Cited by: §2.2.2.
[41] I. Drori, Y. Krishnamurthy, R. Lourenco, R. Rampin, K. Cho, C. Silva, and J. Freire (2019) Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar. External Links: 1905.10345 Cited by: §3.2.
[42] I. Drori, Y. Krishnamurthy, R. Rampin, R. de Paula Lourenco, J. P. Ono, K. Cho, C. Silva, and J. Freire (2021) AlphaD3M: Machine Learning Pipeline Synthesis. External Links: 2111.02508 Cited by: §3.2.
[43] Eggensperger, K., Feurer, M., Hutter, F., Bergstra, J., Snoek, J., Hoos, H., Leyton-Brown, K (2013) Towards an empirical foundation for assessing Bayesian optimization of hyperparameters. NIPS Workshop on Bayesian Optimization in Theory and Practice (BayesOpt’13). Cited by: §A.4.
[44] Elsken, T., Metzen, J. H., and Hutter, F (2019) Neural architecture search: A survey. Journal of Machine Learning Research 20 (55), pp. 1–21. Cited by: §3.4.
[45] EvalML API docs. Note: https://evalml.alteryx.com/en/stable/index.html Cited by: §3.2.
[46] F. Hutter, H. Hoos, and K. Leyton-Brown. (2011) Sequential model-based optimization for general algorithm configuration. In Proc. of LION-5, pp. 507–523. Cited by: §A.4, §3.1.3.
[47] Falkner, S., Klein, A., Hutter, F (2017) Combining hyperband and bayesian optimization.. In: NIPS 2017 Bayesian Optimization Workshop. Cited by: §3.4.
[48] Feature selection on Wikipedia. Note: https://en.wikipedia.org/wiki/Feature_selection Cited by: §2.2.
[49] M. Feurer, K. Eggensperger, S. Falkner, M. Lindauer, and F. Hutter (2021) Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. External Links: 2007.04074 Cited by: §3.1.3, §3.1.3, §3.2, §3.2.
[50] Feurer, Matthias and Klein, Aaron and Eggensperger, Katharina and Springenberg, Jost and Blum, Manuel and Hutter, Frank (2015) Efficient and Robust Automated Machine Learning. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28. External Links: Link Cited by: §3.1.3.
[51] FLAML API docs. Note: https://microsoft.github.io/FLAML/ Cited by: §3.1.4.
[52] FLAML GitHub repository. Note: https://github.com/microsoft/FLAML Cited by: §3.1.4.
[53] Frank Hutter, Lars Kotthoff, Joaquin Vanschioren (Ed.) (2019) Automated Machine Learning: Methods, Systems, Challenges. Springer Series on Challenges in Machine Learning, Springer. Cited by: §3.
[54] Frank-Jurgen Richter and Gunjan Sinha (2020) Why Do Your Employees Resist New Tech?. Harvard Business Review. Note: https://hbr.org/2020/08/why-do-your-employees-resist-new-tech Cited by: §4.
[55] S. Funk (2006-12) Netflix update: try this at home. Note: https://sifter.org/simon/journal/20061211.html Cited by: §3.2.
[56] K. P. George H. John (1994) Irrelevant Features and the Subset Selection Problem. In Proceedings of the Eleventh International Conference, Rutgers University, Machine Learning Proceedings, pp. 121–129. Cited by: item 2, §2.2.1.
[57] GluonTS GitHub repository. Note: https://github.com/awslabs/gluon-ts Cited by: §3.3.
[58] I. J. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press, Cambridge, MA, USA. Note: http://www.deeplearningbook.org Cited by: §3.3, §3.4.
[59] Grandchamp, Enguerran and Abadi, Mohamed and Alata, Olivier (2016-01) A Pareto Front Approach for Feature Selection. pp. 334–342. External Links: Document Cited by: §2.2.1, §2.2.2, §2.2.
[60] H. Vafai and K. De Jong, (1992) Genetic algorithms as a tool for feature selection in machine learning. Proceedings 4th International Conference on Tools with Artificial Intelligence. Cited by: §2.2.2.
[61] M. A. Hall (1998) Correlation-based Feature Subset Selection for Machine Learning. Ph.D. Thesis, University of Waikato, Hamilton, New Zealand. Cited by: §2.2.1, §2.2.1, footnote 13.
[62] Hastie, T. Tibshirani, R (1987) Generalized Additive Models: Some APplications”. Journal of the Americal Statistical Association. Cited by: §3.3.
[63] HDBSCAN docs. Note: https://hdbscan.readthedocs.io/en/latest/index.html Cited by: §2.1.1.
[64] J. Hernandez Introducing Compose for Prediction Engineering. Note: https://innovation.alteryx.com/introducing-compose-for-prediction-engineering/ Cited by: §1.2.
[65] F. Horn, R. Pack, and M. Rieger (2020/03/28) The autofeat Python Library for Automated Feature Engineering and Selection. Machine Learning and Knowledge Discovery in Databases, pp. 111–120. External Links: Document, ISBN 978-3-030-43822-7 Cited by: §2.2.2.
[66] Hyperopt GitHub repository. Note: https://github.com/hyperopt/hyperopt Cited by: §3.1.4.
[67] hyperopt-sklearn API docs. Note: https://hyperopt.github.io/hyperopt-sklearn/ Cited by: §3.1.4.
[68] IBM Watson Studio. Note: https://www.ibm.com/cloud/watson-studio/ Cited by: §4.
[69] Isabelle Guyon, André Elissee (2003) An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, pp. 1157–1182. Cited by: §2.2.1, §2.2.2, §2.2, footnote 11.
[70] Ishwaran H, Malley JD. (2014) Synthetic learning machines. BioData Min. 7 (1). Cited by: §2.1.2.
[71] J. Mockus, V. Tiesis, and A. Zilinskas. (1978) The applicatoin of bayesian methods for seeking the extremum. Towards Global Optimization, Elsevier. Cited by: §A.1.
[72] Jaimie Drozdal, Justin Weisz, Dakuo Wang, Gaurav Dass, Bingsheng Yao, Changruo Zhao, Michael Muller, Lin Ju, Hui Su (2020) Trust in AutoML: Exploring Information Needs for Establishing Trust in Automated Machine Learning Systems. In IUI ’20: Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 297–307. Cited by: §4.
[73] G. James, D. Witten, T. Hastie, and R. Tibshirani (2013) An introduction to statistical learning: with applications in r. Springer. External Links: Link Cited by: footnote 15.
[74] H. Jin, Q. Song, and X. Hu (2019) Auto-keras: an efficient neural architecture search system. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1946–1956. Cited by: §3.4.
[75] J. M. Kanter and K. Veeramachaneni (2015) Deep feature synthesis: Towards automating data science endeavors. In 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10. External Links: Document Cited by: §2.1.2.
[76] J. G. Kohavi (1997) Wrappers for Feature Subset Selection. Artificial Intelligence (97), pp. 273–324. Cited by: §2.2.1, §2.2.2, §2.2, footnote 12.
[77] Korner, Brent, Bergstra, James, and Eliasmith, James (2014) Hyperopt-Sklearn: Automatic Hyperparameter Configuration for Scikit-Learn. In Proc of the 13th Python in science conf., Cited by: §3.
[78] A. Kraskov, H. Stögbauer, and P. Grassberger (2004-06) Estimating mutual information. Phys. Rev. E 69, pp. 066138. External Links: Document, Link Cited by: §2.2.1.
[79] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar (2018) Hyperband: A novel bandit-based approach to hyperparameter optimization.. Journal of Machine Learning Research 18 (185), pp. 1–52. Cited by: §3.4.
[80] R. Lall and T. Robinson (2021) The MIDAS Touch: Accurate and Scalable Missing-Data Imputation with Deep Learning. Political Analysis. Cited by: §1.1.
[81] Lars Kotthof, Chris Thornton, Holger H. Hoos, Frank Hutter, Kevin Leyton-Brown (2016) Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 17, pp. 1–5. Cited by: §3.
[82] Lars Kotthoff, Chris Thornton, Holger Hoos, Frank Hutter, Kevin Leyton-Brown (2016) Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 17, pp. 1–5. Cited by: §3.1.3.
[83] N. Lawrence and R. Urtasun (2019) Non-linear matrix factorization with Gaussian processes. ICML. Cited by: §3.2.
[84] T. T. Le, W. Fu, and J. H. Moore (2020) Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics 36 (1), pp. 250–256. Cited by: §2.2.2, §3.1.2.
[85] Lei Dong, Syed Khader (2020) OptimalFlow: Omni-Ensemble and Scalable Automated Machine Learning. In GENPACT GVector Aygmented Intelligence Conference, Cited by: §3.1.4.
[86] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, and H. Liu (2018-11) Feature Selection. ACM Computing Surveys 50 (6), pp. 1–45. External Links: Document, ISSN 1557-7341, Link Cited by: §1.
[87] Y. Li, Y. Shen, W. Zhang, J. Jiang, B. Ding, Y. Li, J. Zhou, Z. Yang, W. Wu, C. Zhang, and B. Cui (2021-07) VolcanoML. Proceedings of the VLDB Endowment 14 (11), pp. 2167–2176. External Links: Document, ISSN 2150-8097, Link Cited by: §4.
[88] Linear inversion of band-limited reflection seismograms (1986) Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing. SIAM. 7 (4). Cited by: §2.2.2.
[89] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani (2014) Automatic Construction and Natural-Language Description of Nonparametric Regression Models. External Links: 1402.4304 Cited by: 3rd item.
[90] Y. Lu, I. Cohen, X. Zhou, and Q. Tian (2007/09/29) Feature selection using principal feature analysis. Proceedings of the 15th international conference on Multimedia, MUL-TIMEDIA ’07, pp. 301–304. External Links: Document Cited by: §2.2.1.
[91] Ludmila I. Kuncheva (2007) A stability index for feature selection. Proceedings of the 25th IASTED International Mutii-Conference. Cited by: §2.2.2.
[92] S. M. Lundberg and S. Lee (2017) A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. External Links: Link Cited by: §2.2.2.
[93] C. G. Marco Tulio Ribeiro (2016) ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. Cited by: §3.2.
[94] Maximilian Christ, Nils Braun, Julius Neuffer, Andreas W. Kempa-Liehr (2018) Time Series FeatuRe Extraction on basis of Scalable Hypothesis tests (tsfresh – A Python package). Neurocomputing 307, pp. 72–77. Cited by: §2.1.2.
[95] S. Meisenbacher, M. Turowski, K. Phipps, M. Rätz, D. Müller, V. Hagenmeyer, and R. Mikut (2022) Review of automated time series forecasting pipelines. arXiv. External Links: Document, Link Cited by: §2.1.2, §3.3, footnote 22.
[96] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, M. Urban, M. Burkart, M. Dippel, M. Lindauer, and F. Hutter (2018-12) Towards automatically-tuned deep neural networks. In AutoML: Methods, Sytems, Challenges, F. Hutter, L. Kotthoff, and J. Vanschoren (Eds.), pp. 141–156. Cited by: §3.4.
[97] MIDASpy GitHub repository. Note: https://github.com/MIDASverse/MIDASpy Cited by: §1.1.
[98] W. R. Miron Hursa (2010) Feature Slection with the Boruta Package. Journal of Statistical Software 36 (11). Cited by: §2.2.2.
[99] M. Mitchell (1998) An Introduction to Genetic Algorithms. MIT Press, Cambridge, MA, USA. External Links: ISBN 0262631857 Cited by: §2.1.2.
[100] MLBox API docs. Note: https://mlbox.readthedocs.io/en/latest/index.html Cited by: §3.1.4.
[101] MLBox GitHub repository. Note: https://github.com/AxeldeRomblay/MLBox Cited by: §3.1.4.
[102] J. Mockus (1972) On a bayesian method for seeking an extremum. Automatika i vychislitelnaja tekhnika. Cited by: §A.1.
[103] S. Mutalib, S. Satari, and W. Yusoff (2019-11) A new robust estimator to detect outliers for multivariate data. Journal of Physics: Conference Series 1366, pp. 012104. External Links: Document Cited by: §2.2.1.
[104] N. V. Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, pp. 321–357. Cited by: §4.
[105] Nick Erickson and Jonas Mueller and Alexander Shirkov and Hang Zhang and Pedro Larroy and Mu Li and Alexander Smola (2020) AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. External Links: 2003.06505 Cited by: §3.2.
[106] M. E. Nicolo Fusi (2018) Probabilistic matrix factorization for automated machine learning. NeurIPS. Cited by: §3.2.
[107] Nilsson R, Peña J, Bjorkegren J, Tegnér J (2007) Consistent Feature Selection for Pattern Recognition in Polynomial Time. Journal of Machine Learning Research. Cited by: footnote 6.
[108] V. Ninja Feature selection using genetic algorithm (DEAP package) in Python. An approach used for solving Kaggle Earthquake Prediction Challenge.. Note: https://viktorsapozhok.github.io/deap-genetic-algorithm/ External Links: Link Cited by: §2.2.2.
[109] N. Nnamoko, F. Arshad, D. England, J. Vora, and J. Norman (2014-06) Evaluation of Filter and Wrapper Methods for Feature Selection in Supervised Machine Learning. Vol. The 15th Annual Postgraduate Symposium on the convergence of Telecommunication, Networking and Broadcasting. Cited by: §2.2.
[110] S. H. Noble Defining Churn Rate (no really, this actually requires an entire blog post). Note: https://shopify.engineering/defining-churn-rate-no-really-this-actually-requires-an-entire-blog-post Cited by: 1st item.
[111] Oboe GitHub repository. Note: https://github.com/udellgroup/oboe/tree/master/examples Cited by: §3.2.
[112] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman (2001) Missing value estimation methods for dna microarrays. Bioinformatics 17 (6), pp. 520–525. Cited by: §1.1.
[113] R. S. Olson, N. Bartley, R. J. Urbanowicz, and J. H. Moore (2016) Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science. In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16, New York, NY, USA, pp. 485–492. External Links: Document, ISBN 978-1-4503-4206-3, Link Cited by: §3.1.2.
[114] R. S. Olson, R. J. Urbanowicz, P. C. Andrews, N. A. Lavender, L. C. Kidd, and J. H. Moore (2016) Applications of Evolutionary Computation: 19th European Conference, EvoApplications 2016, Porto, Portugal, March 30 – April 1, 2016, Proceedings, Part I. G. Squillero and P. Burelli (Eds.), pp. 123–137. External Links: Document, ISBN 978-3-319-31204-0, Link Cited by: §2.2.2, §3.1.2.
[115] OptimalFlow documentation. Note: https://optimal-flow.readthedocs.io/en/latest/index.html Cited by: §3.1.4.
[116] E. Parzen (1962) Estimation of a Probability Density Function and Mode. Annals of Mathematical Statistics 33 (3), pp. 1065–1076. Cited by: footnote 37.
[117] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011) Scikit-learn: Machine learning in python. Journal of Machine Learning Research, pp. 2825–2830. Cited by: §2.2.1.
[118] Peter Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3, pp. 463–482. Cited by: footnote 28.
[119] Peter J. Brockwell and Richard A. Davis (1991) Time Series: Theory and Methods. Springer. Cited by: 2nd item.
[120] Petr Somol, Jana Novovicova and Pavel PudilA. Herout (Ed.) (2010) Efficient Feature Subset Selection and Subset Size Optimization. InTech Online. Cited by: §2.2.2, §2.2.
[121] M. Piernik and T. Morzy (2021) A study on using data clustering for feature extraction to improve the quality of classification. Knowledge and Information Systems 63 (7), pp. 1771–1805. External Links: Document, ISBN 0219-3116, Link Cited by: §2.1.1.
[122] pmf-automl GitHub repository. Note: https://github.com/rsheth80/pmf-automl Cited by: §3.2.
[123] Prophet GitHub repository. Note: https://github.com/facebook/prophet Cited by: §3.3.
[124] pyautoweka GitHub repository. Note: https://github.com/automl/pyautoweka Cited by: §2.2.1, §3.1.3.
[125] PyCaret API docs. Note: https://pycaret.gitbook.io/docs/ Cited by: §3.1.4.
[126] J.R. Quinlan (1986) Induction of Decision Trees. Machine Learning 1, pp. 81–106. Cited by: §2.2.1.
[127] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes (2004) Ensemble selection from libraries of models. In Proc. of ICML’04, Cited by: §3.1.1.
[128] R.C. St. John (1975) D-Optimality for Regression Designs: A Review. Technometrics 17 (1). Cited by: §3.2.
[129] C. Rasmussen and C. Williams (2006) Gaussian Processes for Machine Learning. MIT Press. Cited by: §A.2.
[130] R. Rosipal, M. Girolami, L. J. Trejo, and A. Cichocki (2001) Kernel pca for feature extraction and de-noising in nonlinear regression. Neural Computing & Applications 10 (3), pp. 231–243. External Links: Document, ISBN 1433-3058, Link Cited by: §2.2.1.
[131] B. C. Ross (2014) Mutual Information between Discrete and Continuous Data Sets. PLoS ONE 9 (2). Cited by: §2.2.1.
[132] S. T. Roweis and L. K. Lawrence (2000) Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science (290.5500), pp. 2323–2326. Cited by: §2.2.1.
[133] S. J. Russell and P. Norvig (2009) Artificial intelligence: a modern approach. 3 edition, Pearson. Cited by: §3.4.
[134] S. Falkner, A. Klein, and F. Hutter (2018) BOHB: Robust and Efficient Hyperparameter Optimization at Scale.. Proceedings of the international conference on machine learning (ICML). Cited by: §3.4.
[135] L. K. Saul and S. T. Roweis (2001) An introduction to locally linear embedding. Cited by: §2.2.1.
[136] Schölkopf B., Smola A., Müller KR (1997) Kernel Principal Component Analysis. Lecture Notes in Computer Science 1327. Cited by: §2.2.1.
[137] scikit-learn API. Note: https://scikit-learn.org/stable/modules/feature_selection.html Cited by: §2.2.1.
[138] scikit-optimize API docs. Note: https://scikit-optimize.github.io/stable/ Cited by: §3.1.3.
[139] Seiichi Ozawa and Manabu Kotani (2000) A Study of Feature Extraction and Selection Using Independent Component Analysis. Proc. of Int. Conf. on Neural Information Processing. Cited by: §2.2.1.
[140] G. Sheni, B. Schreck, R. Wedge, J. M. Kanter, and K. Veeramachaneni (2018) Prediction Factory: automated development and collaborative evaluation of predictive models. External Links: 1811.11960 Cited by: Figure 1, §1.1, §1, footnote 1, footnote 2.
[141] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical Bayesian Optimization of Machine Learning Algorithms. In Advances in Neural Information Processing Systems, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), Vol. 25. External Links: Link Cited by: §A.3, Appendix A.
[142] M. Spivak (1970) A Comprehensive Introduction to Differential Geometry. Vol. 1, Publish of Perish, Inc.. Cited by: §2.2.1.
[143] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.4.
[144] M. N. W. Stefano Nembrini (2018) The revival of the Gini importance?. Bioinformatics 34 (21), pp. 3711–3718. Cited by: §2.2.2.
[145] Stoppiglia H, Dreyfus G, Dubois R, Oussar Y (2003) Ranking a Random Feature for Variable and Feature Selection. Journal of Machine Learning Research 3, pp. 1399–1414. Cited by: §2.2.2.
[146] Syama Sundar Rangapuram, Matthias Seeger, Jan Gasthaus, Lorenzo Stella, Yuyang Wang, and Tim Januschowski. (2018) Deep state space models for time series forecasting. Advances in Neural Information Processing Systems. Cited by: §3.3.
[147] Cited by: §3.3.
[148] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. External Links: 1208.3719 Cited by: §A.4, §A.4, §2.2.1, §3.1.3, §3.
[149] R. Tibshirani (1996) Regression Shrinkage and Selection via the lasso. Journal of the Royal Statistical Society. Series B 58 (1). Cited by: §2.2.2.
[150] TPOT Github repo. Note: https://github.com/EpistasisLab/tpot Cited by: §3.1.2.
[151] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2014-06) OpenML. ACM SIGKDD Explorations Newsletter 15 (2), pp. 49–60. External Links: Document, ISSN 1931-0153, Link Cited by: §3.1.3.
[152] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin (2017) Attention is all you need. Advances in Neural Information Processing Systems. Cited by: §3.3.
[153] J. M. K. O. G. K. Veeramachaneni (2016) Label, Segment, Featurize: A Cross Domain Framework for Prediction Engineering. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Cited by: §1.2.
[154] C. Wang, Q. Wu, M. Weimer, and E. Zhu (2021) FLAML: a fast and lightweight automl library. In MLSys, Cited by: §3.1.4, footnote 19.
[155] R. Wen, K. Torkkola, B. Narayanaswamy, and D. Madeka (2017) A multi-horizon quantile recurrent forecaster. arXiv. External Links: Document, Link Cited by: §3.3.
[156] Why AutoML Is Not Enough for Scaling AI. Note: https://turintech.ai/insights/why-automl-is-not-enough-for-scaling-ai/ Cited by: §4.
[157] D. H. Wolpert (1992) Stacked generalization. Neural Networks 5 (2), pp. 241–259. Cited by: §3.2.
[158] Wu, Q., Wang, C., and Huang, S. (2021) Frugal optimization for cost-related hyperparameters. AAAI’21. Cited by: §3.1.4.
[159] C. Yang, Y. Akimoto, D. W. Kim, and M. Udell (2019-07) OBOE: Collaborative Filtering for AutoML Model Selectio. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1173–1183. External Links: Document, Link Cited by: §3.2, §3.2, §3.2.
[160] L. Yang and A. Shami (2020-11) On hyperparameter optimization of machine learning algorithms: theory and practice. Neurocomputing 415, pp. 295–316. External Links: Document, ISSN 0925-2312, Link Cited by: §3.
[161] Z. Han, J. Zhao, H. Leung, K. F. Ma, andW.Wang (2019) A review of deep learning models for time series prediction. IEEE Sensors Journel 21 (6), pp. 7833–7848. Cited by: 1st item.
[162] Zagoruyko, S., Komodakis, N (2016) Wide residual networks. CoRR abs/1605.07146. Cited by: §3.4.
[163] Z. Zeng, A. K. H. Tung, J. Wang, J. Feng, and L. Zhou (2009-08) Comparing stars: on approximating graph edit distance. Proc. VLDB Endow. 2 (1), pp. 25–36. External Links: Document, ISSN 2150-8097, Link Cited by: §3.4.
[164] L. Zimmer, M. Lindauer, and F. Hutter (2021) Auto-pytorch tabular: multi-fidelity metalearning for efficient and robust autodl. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 3079 – 3090. Note: also available under https://arxiv.org/abs/2006.13799 Cited by: §3.4, §3.4.
[165] Zöller, Marc-André and Huber, Macro F (2021) Benchmark and Survey of Automated Machine Learning Frameworks. Journal of Artificial Intelligence Research 70, pp. 409–474. Cited by: §1, §3.

Appendix A Bayesian optimization primer

Because Bayesian optimization plays such a large role in several of the automated model engineering packages discussed in this article, we provide the interested reader with a self-contained, though superficial, review of the basics of the approach, together with the specific variants relevant to the automation process considered above. We recommend [141] and [25] for further reading on Bayesian optimization more generally.

a.1 The black box optimization problem

Bayesian optimization is concerned with the classical objective optimization problem

a r g m a x x \in Ω f (x)

where the objective function³²³²32As in equation (4) minimization problems can similarly be cast into this framework by using $- f$ in place of $f$ ., $f : Ω \subset R^{n} \to R$ . Specifically, it was designed to handle cases were the functions fail to have properties expected of more traditional techniques; the objectives may be non-convex, difficult to find closed form expressions for, non-smooth, expensive to evaluate (as will be the case when the inputs are hyperparameters for a machine learning model that needs to be trained), or otherwise unconventional. For this reason, Bayesian optimization is often referred to as a black box optimization technique since it does not require explicit (analytic) expressions of the objective.

To deal with these complexities, in [71, 102] the authors describe a broad class of optimization problems based around viewing the objective function as stochastic; namely, $f (x)$ may be viewed as shorthand for $f (x, ω)$ where $ω$ is in a $σ$ -algebra. The function $f$ then carries with it an associated distribution function

F_{x^{(1)} . . . x^{(m)}} (y_{1}, . . ., y_{m}) ≐ P (f (x^{(1)}) \leq y_{1}, . . ., f (x^{(m)}) \leq y_{m})

induced by observations ${x^{(i)}}_{i = 1}^{m}$ . Defining $z_{k} = \cup_{i \leq k} {x^{(i)}, f (x^{(i)})}$ ³³³³33Strictly speaking, the observations may be noisy, and therefore of the form $\cup_{i \leq k} {x^{(i)}, N (f (x^{(i)}), ξ)}$ for a choice of $ξ \in R$ . to be a set of $k$ observations of the function’s input/output pairs, Bayes’ formula becomes the following

P (f ∣ z_{k}) = \frac{P (z_{k} ∣ f) P (f)}{P (z_{k})} \propto P (z_{k} ∣ f) P (f)

(8)

The left-hand side of equation (8) represents the posterior distribution of the function $f$ , given the $k$ observations in $z_{k}$ . The prior distribution, $P (f)$ , represents initial assumptions about the objective (smoothness, convexity, etc) and influences the likelihood function, $P (z_{k} ∣ f)$ .

a.2 Detour on Gaussian processes

It is common in Bayesian optimization to use a Gaussian process prior distribution for $f$ . A Gaussian process, ${X_{t}}_{t \in T} = {X_{t} (ω)}_{t \in T}$ with $T$ an (possibly multidimensional) index set, is a stochastic process with the property that any finite subset of the process is multivariate normally distributed (see, e.g. [129]). As multivariate normal distributions are fully characterized by their mean vectors and covariance matrices, the same is true of a Gaussian process. Namely, ${X_{t}}_{t \in T}$ represents a Gaussian process if and only if ${X_{t}}_{t \in I}$ , with $# I < \infty$ , satisfies that $(X_{σ (I_{1})}, . . ., X_{σ (I_{# I})}) \sim N (μ, Σ)$ , for some $μ$ and symmetric positive-definite $Σ$ and for all permutations $σ$ on sets with $# I$ elements. Viewing each such vector as ordered samples from a function, it is convenient to say that a Gaussian process $g : R^{n} \to R$ is a function for which any finite sample of its values represents draws from a multivariate normal distribution. In that case, we have that

	$μ (x)$	$= E [g (x)]$
	$k (x, ~ x)$	$= E [(g (x) - μ (x)) (g (~ x) - μ (~ x))] = C o v (g (x), g (~ x))$

define the mean and kernel, $k$ ³⁴³⁴34The kernel function induces spatial correlation on $R^{n}$ via $g$ . , of a Gaussian process, respectively, and write

g \sim G P (μ, k)

A few remarks are in order before concluding this subsection:

If a Gaussian process is restricted to having a finite index set, it follows from the definition that the process is simply a multivariate Gaussian distribution.
Fixing a single spatial point, $x$ , $g \sim G P (μ, k)$ means that $g (x)$ is normally distributed. Therefore, by varying $x$ , we see the assumption of $g \sim G P (μ, k)$ amounts to viewing the Gaussian process as supplying a “random function”, its expected magnitude determined by $μ$ , and its unpredictability determined by $k$ .
Often the kernel functions, $k$ , will have hyperparameters associated with them which may be associated with correlation range or anisotropies. Therefore, writing $k_{θ}$ will make explicit the dependence of the kernel on hyperparameters $θ$ .
Gaussian processes are closed under sampling; if $f \sim G P (0, k)$ then it can be shown that $f ∣ z_{k} \sim G P (μ, ~ k)$ with $μ, ~ k$ having definite analytic expressions (see [19]).

a.3 The Bayesian algorithm

As mentioned, a common Bayesian optimization prior is then to assume that the objective function $f$ is an unknown Gaussian process, namely $f \sim G P (μ, k)$ for given choice of mean and kernel function. Sampling $f$ helps to remove uncertainties in the associated Gaussian Process model via Bayesian updates.

In the following, we will assume $μ = 0$ (a common simplifying assumption, see e.g. [19]) and follow [25] closely, By an abuse of notation, denote $f_{k} = (f (x^{(1)}), . . ., f (x^{(k)}))^{T}$ , then $f_{k} \sim N (0, K)$ with $K_{i j} = k (x^{(i)}, x^{(j)})$ . For a new point, $x^{(k + 1)}$ , we then have

[\begin{matrix} f_{k} f (x^{(k + 1)}) \end{matrix}] \sim N (0, [\begin{matrix} K & k k^{T} & k (x^{(k + 1)}, x^{(k + 1)}) \end{matrix}])

with $k_{i} = k (x^{(k + 1)}, x^{(i)})$ for $i = 1, . . ., k$ . From this, it can be shown that

f (x^{(k + 1)}) ∣ z_{k}, x^{(k + 1)} \sim N (k^{T} K^{- 1} f_{k}, k (x^{(k + 1)}, x^{(k + 1)}) - k^{T} K^{- 1} k)

So, once the kernel function is chosen, we have an explicit formula for the posterior distribution of $f$ with explicit predictive mean and covariance functions dependent upon prior observations and the kernel function.

In addition to modelling $f$ as a random process, the Bayesian optimization paradigm uses an acquisition function (decision function) $a : Ω \to R$ to propose new inputs to sample. The job of the acquisition function is to balance exploitation (pursuit of regions where the function is likely to be optimal) and exploration (generating draws in regions where $f$ may have a high degree of uncertainty) as well as to keep down the total number of evaluations of $f$ .

Once an acquisition function is selected, subsequent points are proposed via

x_{k + 1} = a r g m a x x a (x ∣ z_{k}, θ)

where, above, we use the notation $a (x ∣ z_{k}, θ)$ to denote the dependence of the acquisition function on the already sampled points and values together with any hyperparameters in the Gaussian Process representing $f$ (see [141]). When the Gaussian Process has a normally distributed posterior, popular acquisition functions have explicit dependence on the mean and variance of the posterior.

Two acquisition functions worth mentioning are the expected improvement and confidence bound. If is the input for the best observed $f$ value thus far, then the improvement is³⁵³⁵35In the context of the CASH problem (4) the improvement is $max (0, c (λ_{-}) - c (λ)))$ where $c$ is a given error rate dependent on conditional hierarchical hyperparameters $λ$ and $λ_{-}$ is the best performing currently observed hyperparameter selection.

I (x) = max (0, f^{(k + 1)} (x) - f (x_{+}))

where $f^{(k + 1)} (x)$ is shorthand for $f (x^{(k + 1)}) ∣ z_{k}, x^{(k + 1)}$ , i.e. the current posterior for $f$ . While this has an explicit dependence on $f (x)$ , which we wish to avoid having to evaluate, the expected improvement acquisition function given by

a_{E I} (x) = E_{f ∣ x} I (x)

where the expectation is over the distribution of $f ∣ z_{k}$ , namely the current Gaussian process posterior. This is expressible purely in terms of the predictive mean and variance of $f^{(k + 1)}$ and $f (x_{+})$ . The upper confidence bound acquisition function is given by

a_{U C B} (x ∣ z_{k}, θ) = μ (x ∣ z_{k}, θ) + κ σ (x ∣ z_{k}, θ)

where $μ, σ$ are the predictive distribution’s mean and standard deviation, respectively. In each of these acquisition functions, optimization to select the next point at which to sample $f$ is fairly straightforward.

Putting the above considerations together, to perform a Gaussian process-based Bayesian optimization one repeatedly optimizes the acquisition function to generate new points to sample, then evaluates $f$ on the new point³⁶³⁶36Of course, in the machine learning hyperparameter context, the evaluation of $f$ may require training a model from scratch. , then updates the posterior of the associated Gaussian Process model for $f$ to obtain newer predictive mean and variance. This process repeats until a suitable optima is located within acceptable tolerances.

a.4 Sequential Model-Based Optimizations

Given a dataset, let $c = c (λ)$ denote the loss function for a given machine learning model in terms of hyperparameters $λ$ . We note that $λ$ , as described in Section 3 is hierarchical and conditional, and for this reason is tree-like. Sequential Model-Based Optimization (SMBO) is a general optimization paradigm, similar to the Bayesian optimization approach described above, wherein the a surrogate to the objective function $f$ is optimized to find a proposed new point $x^{(n e w)}$ to evaluate, upon which a new model for $f$ is constructed (see [19]). For hyperparameter optimization $c$ plays the role of $f$ and $λ$ plays the role of $x$ . We briefly discuss two approaches to the SMBO problem.

Sequential Model-Based Algorithm Configuration (SMAC, see [46]) is an approach to SMBO using a variety of models for $c ∣ λ$ , including Gaussian processes and random forests. In [148] the authors explore a SMAC paradigm using random forests. Essentially, they model

c ∣ λ \sim N (μ_{λ}, σ_{λ}^{2})

with $μ_{λ}$ and $σ_{λ}^{2}$ determined by predictive average and variance across individual trees in a random forest. Under the assumption of normality for $c ∣ λ$ , SMAC offers a closed form expression for the expected improvement (see previous section) acquisition function. Additionally, SMAC enforces robustness by keeping alternating between hyperparameter configurations $λ^{(n e w)}$ suggested from the acquisition function, and those selected at random, in keeping with an exploitation/exploration paradigm.

A separate approach, discussed In [19, 43] and [148], is using Tree-structured Parzen Estimators (TPEs) to model the variables $λ ∣ c$ and $λ$ to obtain $c ∣ λ$ . In this approach

where $c^{*}$ is a quantile-based loss threshold and $1$ is an indicator variable and $ℓ, g$ are empirical densities constructed on the associated observations. The density $ℓ$ , intuitively, should capture a distribution of well-performing hyperparameter values. The surrogate expected improvement function then is proportional to $\frac{1}{γ + \frac{g (λ)}{ℓ (λ)} (1 - γ)}$ , where $γ$ is the quantile defining $c^{*}$ , thus leading to a simple optimization for the surrogate.

One-dimensional Parzen estimators³⁷³⁷37Parzen estimators are a standard technique for kernel density estimation. See [116]. are then used on to estimate the density of each component of $λ$ during a tree traversal along paths whose nodes consist of active hyperparameters (i.e. components of $λ$ ). The resultant one-dimensional estimates can then be combined to yield estimates of the multivariate densities $ℓ (λ)$ and $g (λ)$ , from which $c ∣ λ$ can be estimated.