An Empirical Study on the Usage of Automated Machine Learning Tools

Forough Majidi forough.majidi@polymtl.ca Polytechnique MontréalMontrealCanada , Moses Openja openja.moses@polymtl.ca Polytechnique Montréal1 Thørväld CircleMontrealCanada , Foutse Khomh foutse.khomh@polymtl.ca Polytechnique Montréal1 Thørväld CircleMontrealCanada and Heng Li heng.li@polymtl.ca Polytechnique Montréal1 Thørväld CircleMontrealCanada

2018

Abstract.

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Due to the growth of available data, the usage of machine learning (ML) has increased dramatically over the past few years and a lot of business communities and researchers are using ML to analyze their data and reach their goals (elshawi2019:automated). ML practitioners usually have to follow a set of repetitive tasks including data collection and data preprocessing, feature engineering, hyperparameter optimization, model training, evaluating the performance of the resulting models, and deploying and monitoring the models in the production environment. As a result, a new field of automated machine learning (AutoML) has emerged over the past few years to automate and reduce the effort and time consumed by these repetitive tasks (truong2019:towards), such as model training/ retraining and hyperparameter optimization in the ML pipeline. For example, Tpot is an AutoML tool that uses genetic algorithm to optimize the ML pipeline. In addition, a lot of AutoML tools such as Autosklearn (feurer-neurips15a; feurer-arxiv20a), Autokeras (jin2019auto), Snorkel (snorkel:2021), and Optuna (akiba2019:optuna) have been introduced for automating model management, data labeling, and optimizing hyperparameters.

The popularity of AutoML tools has attracted the attention of researchers (xin2021:whither; wang2021:much; drozdal2020:trust). For instance, Xin et al (xin2021:whither) conducted a qualitative study to understand the use of AutoML tools in practice by examining users’ experience when using AutoML tools, the integration of AutoML features in ML workflows and the challenges faced when using the features of the AutoML tools in ML workflow. They suggested that AutoML tools should support the interactive user interface design to explore the intermediate execution (i.e., “human-in-loop”). Also, through a qualitative study, Wang et al. (wang2021:much) studied how much automation is required by data scientists and reported that designing a “human-in-the-loop” AutoML tool is better than a tool that Completely automates the ML works (wang2021:much). Drozdal et al. (drozdal2020:trust) studied what impacts the trust in the ML models built from using AutoML tools, through qualitative studies. They found that the non-functional requirements such as model performance metrics, visualization, and transparency increase the confidence of data scientists in AutoML tools.

Although prior works performed qualitative studies to understand ML practitioners’ experiences of using AutoML tools, it remains unclear how real-world projects use AutoML tools in their ML pipelines. Our study attempts to complement the existing studies through large-scale analysis of real-world open-source projects that use AutoML tools, aiming to get new empirical evidence and insights on the use of AutoML tools. Specifically, we extracted the information of the projects that use AutoML tools from GitHub and followed a mixture of both qualitative and quantitative analyses. We organize our study through the following research questions (RQs):

$∙$ RQ1: What are the most used AutoML tools? We examined the usage of AutoML tools in GitHub projects and we identified the top 10 most used AutoML tools: Optuna, HyperOpt, Skopt, Featuretools, Tpot, Bayes_opt, Autokeras, Auto-sklearn, AX, and Snorkel. Then, we analyzed the characteristics of the projects that use these tools, including their development history and popularity. ML practitioners can learn from our findings about which AutoML tools are mostly used and their characteristics when searching for the right tools for their ML pipeline. Besides, our findings can help AutoML tool developers gain insights into the projects that use their tools, which could allow them to make better decisions to further improve their tools.

$∙$ RQ2: How do ML practitioners use AutoML tools? Through manual analysis, we examined the purposes of using AutoML tools and the stages of the ML pipeline where they are used. We observed that AutoML tools are used mainly during the Hyperparameter optimization, Model training, and Model evaluation stages of the ML pipeline. We also observed 10 main purposes of using AutoML tools, such as Hyperparameter optimization, Model management, Data management, and Visualization. ML practitioners can save time and effort by using AutoML tools to automate their ML pipeline tasks (truong2019:towards) for similar purposes in similar stages of ML pipeline.

$∙$ RQ3: Are different AutoML tools used together? We analyzed the source code of the projects using AutoML tools to examine whether AutoML tools are used together in the same source code files. We found that, among the top 10 AutoML tools, only a few of them were used together, including Hyperopt being used together with other tools, such as Tpot, Skopt, Optuna. Our findings can help ML practitioners with the needs for heterogeneous AutoML features choose the combination of tools that are mostly used together by other ML practitioners. In addition, developers of AutoML tools can leverage our findings to provide APIs to make the collaboration between their tool and other co-used tools easier.

Organization: The rest of the paper is organized as follows. In Section \rom2, the background about the AutoML tools and ML pipeline is provided. In Section \rom3, we provided a summary of the important related works. In Section \rom4, the experimental setup of the study is explained. The results of the research questions are presented in Section \rom5. A discussion of these results is presented in Section \rom6. Section \rom7 discusses threats to the validity of our study. In Section \rom8, we conclude the paper and outline some avenues for future works.

2. Background

AutoML tools are used to automate one or more stages (e.g., hyperparameter optimization) in the ML pipeline. This section provides background information on a typical ML pipeline and AutoML tools. In this article, the word “ML practitioner” refers to anyone who uses AutoML or ML in their project or work. An ML practitioner can be a developer, a researcher, a data scientist, or an ML engineer.

2.1. Machine learning pipeline

Figure 1 shows the nine stages of a typical ML pipeline for building and deploying ML software systems used at Microsoft, described by Amershi et al. (amershi2019:software). We consider it as a reference ML pipeline in our study. Similar ML pipelines are also adopted by other commercial companies, such as Google (MLOps:Google:2021), or IBM (IBM:MLOps:20210). In the following, we describe the different stages of the ML pipeline mentioned in (amershi2019:software).

The model requirements stage is where ML practitioners select the type of ML model that fits their problem and products. In data collection stage, ML practitioners identify the data from different sources to collect, such as database systems or file storage. The stage of data cleaning includes cleaning the data to remove any anomalies that would likely hinder the training of the ML model. The stage of data labeling refers to assigning the proper labels to the data, such as assigning labels to a set of images reflecting the name of the objects present in the image.

In the stage of feature engineering, ML practitioners prepare (e.g., transforming or normalizing the features/data) and validate the features that should be used in the model training phase. The model training phase refers to training the ML models using the prepared features and the different implemented ML algorithms. The result from this stage is a trained ML model used to predict the label of new incoming data. In the model evaluation stage, the predictive performance of the trained ML models is evaluated on a validation dataset.

The model deployment stage is where the evaluated ML model and the entire workflow are exported for deployment. Finally, The model monitoring stage is where the model behavior is observed in the production environment to ensure that the model is doing what is expected to do.

Figure 1. The stages of the ML pipeline based on (amershi2019:software).

2.2. Automated machine learning (AutoML)

The primary goal of automation in ML started from reducing human effort in the loop, specifically during the hyperparameter tuning and model selection stage (yao2018:taking). However, the success of hyperparameter tuning and model selection also depends on other steps, such as data cleaning and preparation or feature engineering (wang2019:human; yao2018:taking; zoller2021:benchmark; crisan2021:fits). More recently, AutoML has been broadly used in automating multiple steps of the machine learning workflow, from data collection and preparation to the deployment and monitoring of the ML model in production.

A variety of AutoML tools have been proposed to automate the computational work of building an ML pipeline. Some of these tools are built upon widely used ML frameworks. For example, Auto-Sklearn (feurer2020:auto; feurer2015:efficient), and TPOT (olson2016:tpot; olson2016:evaluation) are built on scikit-learn (pedregosa2011:scikit). These AutoML tools focus largely on supporting data preparation (lee2020:human; yao2018:taking; zoller2021:benchmark), feature engineering (kanter2015deep), hyperparameter tuning (akiba2019optuna), model selection (yao2018:taking; zoller2021:benchmark; fernandez2014:we), and end-to-end development support for machine learning software systems (li2021:volcanoml).

3. Experimental Setup

This section detailed the methodology that we followed to conduct this study to answer the proposed research questions. Our study design consist a mixture of qualitative and quantitative (i.e., a sequential mixed-methods (ivankova:2006:using)) approaches. Figure 2 show an overview of our study design. Each step is explained in the following paragraphs.

Figure 2. An overview of our study design. [Heng says: It would be good to match the exacting wording of the steps with the titles of the paragraphs/subsections, so that people can better locate each step.] [Heng says: There is not Step 4]

\circled

1 Identifying AutoML tools:

In this initial step of our methodology, the AutoML tools were identified from two sources: research papers and Google search. The two sources allow us to cover a broader range of AutoML tools proposed in the research publications and industry. In the following, we describe the identification steps from each data source.

$∙$ AutoML tools from the research papers:

To extract the list of AutoML tools mentioned in research papers. We used the search keywords “automated” AND “machine learning” to extract the scholarly literature related to AutoML on the Google scholar (Googlescholar:2022) and Engineering Village (using Inspect and Compendex databases) (Engineeringvillage:2022) platforms. We limited our search to the literature published between 2010 to 2021. Then, we examine the first 10 research papers from each of the two sources to collect the list of AutoML tools, as the search results of both sources are sorted based on the papers’ relevance with the search keywords. Specifically, we were interested in the open-source AutoML tools that host their source codes on GitHub to allow us to study in detail their usage. In total, we extracted 31 open-source AutoML tools from the research papers.

$∙$ AutoML tools from Google search:

We used Google search to identify the names of open-source AutoML tools mentioned across different websites. We tried different keywords and finally settled for the keywords: “(automl OR “automated machine learning”) AND tools AND (GitHub OR list OR curated)” and extracted all the possible AutoML tools from the first 100 resulting websites. We created a list of open-source AutoML tools that have a GitHub repository in the next step. One of the results was the awesomeopensource (awesomeopensourceo) website that contains 426 AutoML open-source projects. However, extracting all the possible AutoML tools listed on the website requires significant manual effort. Therefore, to get the most out of these tools listed, we limited our criteria to the most popular AutoML tools listed on the awesomeopensource (awesomeopensourceo) website by considering the AutoML tools with more than 1,000 stars. In total, we collected the names of 95 AutoML tools from Google search.

\circled

2 Filtering AutoML tools:

After extracting the lists of names of AutoML tools from research papers and Google search results, we merged the two lists and removed the duplication. Then, using GitHub API (github:api) we extracted the repository details of each of these AutoML tools on November 24, 2021. For each AutoML tool’s repository, we collected the following metadata: the number of contributors, the number of releases, description, number of forks, size of the repositories, creation date, last update, fork status, programming languages, number of stars, number of commits, and last commit date. We removed the tools with only one contributor (8 tools removed) or zero stars (1 tool removed). Also, using the last update date of the tools, we removed the tools that have not been active since 2019 or earlier (12 tools removed) because we are interested in the tools that are still under development due to the growth of the technology. Also, we manually checked all the remaining tools to verify that they are real AutoML tools. At the end of this step, we remained with 57 AutoML tools.

\circled

3 Collecting GitHub projects using AutoML tools:

To find the projects that use open-source AutoML tools, we collected the configurations of each AutoML tool identified in the previous step by examining their documentation, GitHub repository, and sample source code. For example, ‘import+optuna’ and ‘from+optuna’ are the configurations that we found for Optuna. We only focused on the tools for which the configurations are in the Python programming language because Python is known as one of the main programming languages for building ML models and ML software projects (Python_language1:2017; Python_language2:2021). Therefore, we removed the tools for which the configurations are not written in Python and the tools for which we could not find the configurations in their documentations (3 tools removed).

Next, using the AutoML configurations as keywords, we searched GitHub API for the source codes and the respective GitHub projects that import the configuration. For example, we searched for the keywords ‘from+tpot’ and ‘import+tpot’ in GitHub API to find repositories and the corresponding source codes that use the Tpot AutoML tool. Then, we collected the paths to 97,815 source code files that contain at least one configuration of an AutoML tool and the name of the corresponding repositories (33,568 repository names collected) on December 4th and 5th, 2021. In the third step, we removed duplication from the resulting repositories (10,985 duplicates removed) and the source code files (39,194 duplicates removed).

Some repositories that use AutoML tools are not real projects, such as the course assignment repositories. We are not interested in these projects because they may not reflect the actual use of AutoML tools in real projects. Therefore, we identified the keywords that may reflect non-real projects. To do so, we randomly sampled 378 repositories from our dataset with a 95% confidence level and 5% confidence interval. We manually analyzed these repositorires and identified keywords includeing “assignment”, “book”, “chapter”, “tutorial”, and “course” (case insensitive). We used these keywords and removed the repositories with at least one of these keywords appearing in their name (602 repositories removed). Also, we removed the corresponding source code files path of the not-real projects (2,568 source code file paths removed). Furthermore, following the best practices established by previous studies (munaiah:2017:curating; Businge:2018:ICSME; Businge:2019:SANER; businge2022:reuse; openja2022:docker), we removed the repositories and the corresponding source code file paths of the forked projects. This step ensures that the studied projects are not copies of another project but the mainline projects (190 source code file paths removed). Finally, 21,981 projects and 55,863 source code file paths are retained in our dataset. Then, using GitHub API, we extracted each repository’s metadata, such as the number of contributors, number of commits, description, forks, repository size, creation date, update date, last commit date, fork status, and number of stars. We collected the repositories’ details on December 5th and 6th. Finally, we downloaded the source code files using the collected paths on December 8, 2021, for further analysis.

The rest of the data analysis is described in the results Section 4. We share the replication package of this study in (replication:AutoML-empirical-study:2022).

3.1. getting the imports and function calls

we got the imports and function calls of the first 10 tools that have the most number of stars.
while converting the python 2 to python 3 we removed the lines that starts with !,@, %, $, and ”print”

[Heng says: Seems this subsection is not complete. The labelling process can be described in RQ2 approach as it is only relevant to RQ2] Data labeling

Basic Concepts: We use the terms study and trial as follows: Study: optimization based on an objective function Trial: a single execution of the objective function
We labeled all the visualization as data cleaning, because feature is type of data
We added the stage of utility on its own
We considered all data analysis
write description of each labels and how we set this labels and what feature each function have. what is the attribute of each label.
In CRUD optimization, the letter ”c” means create/load study.
reading/writing from storage is considered as part of storage management.
model define contain sub purpose of ”add layer” and ”Normalize batch”
we considered the label ”search for best model and hyperparams” as a part of ”model training/fitting/refitting label”
after doing the second round of labeling, we merged or removed some labels and came up with final labels.
the label ”Combine data” is subpart od clean/filter data
”get data distribution” is part of ”data analysis”.
the label ”hyper parameter search space management” has the sub label of ”define search space”
the label ”check distributions compatibility” is changed to ”data analysis”
”Batch storage management” is removed from the data management and chenged with storage management.
We removed the label ”deploy ML model”.
the label of ”pipeline management” had sub labels of ”Get pipeline complexity”, ”Pipeline optimization”, ”Score pipeline”, ”Pipeline export”.
the label ”Manage multiple/single objective function” refers to the bothe multiple and single objective function.
The label of ”Execute optimization” contains the sublabel of ”improve optimization efficiency”
we merged ”get loss” and ”get trial” label with ”Evaluating an objective function”.
we merged the label ”create/add trial”,”Append trial”, ”Enqueue trial” with the ”execute optimization”.
trial management is mostly part of execute optimization.
we set ”trial setup” as part of ”execute optimization” label.
I try to give example for each of the labels
in the forth round we added the label ”Parallelize searching space”
Do not forget to explain from which sources I extracteed the informations about each functions. like offical documentation, stackoverflow, websites, the Q&A websites.
Most visualization is optuna are related to parameter importance and parameter relationship. For example optuna.visualization.plot_parallel_coordinate is about parameter relationship during HPO. optuna.visualization.plot_param_importances is about parameter importance.

4. Results

4.1. RQ1: What are the most used AutoML tools?

4.1.1. Motivation

Although AutoML tools help ML experts and non-expert practitioners save effort and time by automating the repetitive tasks (truong2019:towards), choosing the right AutoML tool is still a challenging task (xin2021:whither; passi2018trust; wang2019:human; drozdal2020:trust). The goal of RQ1 is to highlight the most used AutoML tools and provide detailed information about the projects using these tools. This information can help ML practitioners select appropriate AutoML tools for their projects. Besides, the insights gained from the projects using the tools can help AutoML tool’s developers improve their tools in future releases.

4.1.2. Approach

To find the most used AutoML tools, we collected the AutoML function calls in the GitHub projects. Also, we analyzed the characteristics of the GitHub projects that are using AutoML tools.

Collecting function calls of AutoML tools in GitHub projects: To understand the popularity of the AutoML tools, we analyzed the source code files of the GitHub projects that use AutoML tools and collected the calls that invoke the functions of each AutoML tool. Usually, a function is designed to perform a specific kind of task, whereby the name of that function often reflects that task. Therefore, calling such a function within a code invokes the tasks specified in the function. Invoking more functions of an AutoML tool indicates the tool is more engaged in a project. Thus we use the number of function calls to represent the popularity of an AutoML tool. The following steps are taken in order to extract the function calls from the source code files: Step 1) All Python notebooks files are converted to python file using the nbconvert (nbconvert:2022) and nbformat (nbformat:2022) Python libraries. Step 2) The Python2 source code files are converted to Python3. Step 3) The syntax errors are resolved by removing the lines of source code files that start with !, @, %, $, install, pip, conda. Step 4) The imported AutoML tools’ libraries and their corresponding imported functions are extracted from each source code file using the abstract syntax tree (AST) Python library (ast:2022). Step 5) The AST Python library is used to extract the function calls related to the imported AutoML tools’ libraries and the corresponding imported functions from the source code files. Step 6) The functions that are:a) directly mentioned in the code, b) called in the right side of an assignment, c) called in the if, while, for loop, try, with, list, assert, return, tuple, unary operation statements, d) super functions of a class in which the parent is an AutoML tool are extracted. After collecting the function calls from all source code files, we counted the total number of calls of the functions of each AutoML tool. Then, we sorted the AutoML tools based on the number of total function calls among all projects in our dataset.

Analyzing the characteristics of GitHub projects using AutoML tools: To understand the characteristics of the projects that use AutoML tools, we analyzed the details of the projects that use AutoML tools, such as the number of contributors, number of forks, size, age, star, and commits.

4.1.3. Results

Figure 3 shows the total number of function calls for each AutoML tool. As seen in Figure 3, the top 10 AutoML tools with the most number of function calls are specified in green color. The percentage of function calls that are associated with the top 10 most used tools (i.e., 18% of the 57 selected AutoML tools) is 68%; these top tools include Optuna, HyperOpt, Skopt, Featuretools, Tpot, Bayes_opt, Autokeras, Auto-sklearn, AX, and Snorkel. The details of each of these top 10 most used AutoML tools are provided in Table 1. As seen in Table 1, the most used AutoML tools are popular based on the number of forks and number of stars. Also, they are 67.1 months old in average, which means that they are not young AutoML tools and it took them years to become popular.

{tcolorbox}

The 18% (10 out of 57) most used AutoML tools (e.g., Optuna) account for 68% of all function calls.

Furthermore, the profile of the projects that are using AutoML tools is shown in Figure 4. For clear visualization, we removed the outliers within the last 5 percent of data in each metric and showed the distribution of 95% percent of the data in Figure 4. As seen in Figure 4, 75% of the repositories that are using AutoML tools are less than five months old and have fewer than 65 commits, two contributors, two forks, 35,753 lines of code, two stars. On the other hand, the projects that use AutoML tools also include some projects that are very mature (up to 10 years in age), popular (with up to 102,983 stars), with many contributors (with up to 433 contributors), and large in size (with up to 17,453,577 lines of code).

{tcolorbox}

The most used AutoML tools are popular tools according to their number of stars and forks. Furthermore, most of the projects that use AutoML tools are young and not popular (based on the number of stars).

Figure 3. AutoML tools and the number of function calls.

{adjustbox}

width=1center AutoML repository (short name) Function S-code Project Size Stars Age Forks Release Descriptions 1 optuna/optuna (Optuna) 34,916 5,482 3,140 13,577 5,553 45 602 41 A software framework for automatic hyperarameter optimization(Optuna:2022) 2 hyperopt/hyperopt (Hyperopt) 20,903 5,271 4,432 6,059 5,955 122 925 0 A library to optimize hyperparameter(Hyperopt:2021) 3 scikit-optimize/scikit-optimize (Skopt) 20,313 2,807 1,830 9,423 2,238 68 424 23 ”Sequential model-based optimization” (Scikit-optimize:2021) 4 Featuretools/featuretools (Featuretools) 14,786 1,053 548 5,888 5,864 50 769 94 A library to automate feature engineering(Featuretools:2022) 5 EpistasisLab/tpot (Tpot) 13,725 2,893 1,796 7,9030 8,349 72 1,437 27 A library to optimize machine learning pipeline(tpot:2021) 6 fmfn/BayesianOptimization (Bayes_opt) 11,582 3,128 2,605 23,290 5,543 89 1,222 7 ”Global optimization with gaussian processes” (BayesianOptimization:2020) 7 keras-team/autokeras (Autokeras) 10,331 1,259 550 44,487 8,241 48 1,334 53 ”An AutoML system based on Keras”(autokeras:2022) 8 automl/auto-sklearn (Auto-sklearn) 8,909 974 543 7,4993 5,870 76 1,092 29 An AutoML toolkit based on sklearn (auto-sklearn:2022) 9 facebook/Ax (AX) 8,442 722 213 283,337 1,649 33 178 23 A adaptive platform for adaptive experiments (ax:2022) 10 snorkel-team/snorkel (Snorkel) 8,237 853 264 292,468 4,922 68 797 14 A tool for generating training data (snorkel:2021)

Table 1. Summary statistics of the top 10 AutoML tools.
Function: Number of function calls, S-code: Number of unique source files importing the tool. Project: Number of unique projects importing the tool, Size: Size of the tool. Stars: Total number of stars. Age: Age of the tool (i.e., the difference between latest commit date and the creation date). Forks: Number of forks associated to the tool. Releases: The total number of releases associated to the tool. Descriptions: The description of the tool.

Figure 4. The summary characteristics of the projects that use AutoML tools in terms of their total Age in months (i.e., the difference between the creation date and the latest commit date), number of commits, number of contributors, number of forks, project’s size based on lines of code, and the number of stars in each project.

4.2. RQ2 : How do ML practitioners use AutoML tools?

4.2.1. Motivation

ML practitioners may use AutoML tools for different purposes across their ML pipeline. We analyzed the stages of the ML pipeline in which AutoML tools are used and the purposes of using AutoML tools to help AutoML tools’ developers gain more insights on the usage of their tool and enable them to improve their tools in a more efficient way.

Understanding the stages of the ML pipeline in which the AutoML tools are most popular can help developers of the tools find the shortcomings of their tools and help improve their tools efficiently. Besides, our results can provide insights for ML practitioners to better leverage AutoML tools in different stages of their ML pipeline and for different purposes.

4.2.2. Approach

To understand the purposes of using AutoML tools and the stages of the ML pipeline where they are being used, we used the extracted function calls in Section 4.1.3 and labeled them. To reduce the required time and effort for manual labeling, we only focused on the function calls of the top 10 most used AutoML tools (identified in RQ1). Besides, we focused on the function calls that are most commonly used in different projects. Thus, from the set of all unique function calls of the selected AutoML tools, we compute their total frequency and Shannon entropy value, then choose the functions that have more than 80% of 1) the Shannon entropy value and 2) the total frequency for manual labeling. In the following, we explain 1) the process of selecting functions based on Shannon entropy and usage frequency and 2) the process of manual labeling.

Selecting functions based on Shannon entropy and usage frequency: We used the normalized Shannon’s entropy (shannon:2001mathematical; khomh:2011entropy) to compute the distribution of the function calls across the projects:

(1)

H_{n} (F) = - i = n \sum i = 1 p_{i} * {log}_{n} (p_{i})

In Equation (1), F is a function that we want to measure its Shannon’s entropy value, n is the total number of unique projects, i is the unique number of projects, $p_{i}$ is the probability of the use of function $F (p_{i} \geq 0$ , and $\sum_{i = 1}^{i = n} p_{i} = 1)$ in a given project $i$ . $p_{i}$ is calculated by dividing the total number of occurrence of function F in the project i by the total number of occurrence of function F in all the projects. If all the unique projects have the same probability of using function F (i.e., $p_{i} = \frac{1}{n}, \forall i \in 1, 2, . ., n$ ), the Shannon entropy of F is maximal (i.e., $H_{n} (F) = 1$ ). Thus, the function F with higher Shannon entropy is more likely to be used more by ML practitioners compared to the functions with less Shannon entropy.

The functions with $H_{n} (F) > 0.8$ and with their respective counts of function calls above 80% percentage (sorted in descending order) were chosen for manual labeling. A total of 619 functions from the top 10 AutoML tools were selected in this step.

Manual labeling: Here we try to understand why AutoML tools are used by ML practitioners and what they are used for. To this end, we manually labeled the selected functions in the previous step and formed categories and sub-categories of purposes of using AutoML tools. We also assigned a separate label representing the stages of the ML pipeline for which the respective functions are called, following the stages presented in Figure 1. The first two authors were involved in the initial labeling and construction of the taxonomy. Both individuals are graduate students with strong knowledge in machine learning and software engineering. The primary reference during the labeling is the official documentation of the functions from the tool websites and the publications associated with the AutoML tools. For example, the functions ‘autokeras.TextClassifier.evaluate’ (autokeras3:2022) and ‘autokeras.ImageRegressor.evaluate’ (autokeras2:2022) of AutoKeras are used to evaluate the best model and their purposes were labeled as ‘Model evaluation’. The functions related to data formatting, generating statistics of the input data, or cleaning is assigned the stage of ‘Data cleaning’.

During the first iteration of labeling, the first author labeled the functions of Optuna, Skopt, Featuretools, and Ax, and the second author labeled the functions of the remaining six tools. In the Second iteration, all the initial labels were combined, and the first two authors underwent through each label together and subsequently further discussed the labels that were not agreed upon until a consensus was achieved. A random sample of 238 functions was selected with a confidence level of 95% and a confidence interval of 5% for further validation by the other two authors (not involved in the initial labeling). They agreed on 98% of the stages of the functions and 96% of the purposes of the functions.

4.2.3. Results

Table 2 details the stage of the ML pipeline (i.e., the table columns) in which ML practitioners use AutoML tools. Each cell value in Table 2 represents the percentage of the total number of calls of the functions from the respective AutoML tool corresponding to the stages of the ML pipeline. As shown in Table 2, the AutoML tools are mostly used for hyperparameter optimization (accounting for 40.9% of the function calls). One of the possible reasons is that hyperparameter optimization involves multiple repetitive tasks and requires a lot of effort and time (truong2019:towards); similarly, there might be numerous different sub-tasks involved compared to other stages of the ML pipeline.

{tcolorbox}

AutoML tools are mainly used ( $40.9 %$ ) during the Hyperparameter Optimization stage of the ML pipeline, followed by Model training ( $11.5 %$ ), Model evaluation ( $11.2 %$ ), and Feature Engineering ( $10.1 %$ ). AutoML tools mostly focus on providing functions for training ML models effectively, selecting a model, and evaluating the performance of the resulting model.

{adjustbox}

width=1center AutoML Data collection Data cleaning Dala labeling Feature engineering Hyperparameter optimization Model training Model evaluation Model deployment Model monitor Others Optuna 1.4 2.9 0 0 72.9 2.1 0.7 0.7 12.1 6.4 hyperopt 11.5 1.9 0 5.8 75 0 0 0 3.8 1.9 Skopt 0 0 0 0 90.6 3.1 3.1 0 0 3.1 Featuretools 12.8 7.7 0 69.2 0 0 0 0 2.6 7.7 Tpot 0 11.6 0 9.3 9.3 23.3 18.6 18.6 4.7 4.7 Bayes-opt 0 0 0 0 78.9 0 0 0 5.3 15.8 AutoKeras 3.6 3.6 0 8.3 8.3 40.5 25 7.1 0 3.6 Auto-sklearn 2.7 5.4 1.4 5.4 4.1 25.7 44.6 1.4 4.1 5.4 Facebook-Ax 0 4.2 0 0 68.8 4.2 4.2 2.1 4.2 12.5 Snorkel 9.5 21.6 27 2.7 1.4 16.2 16.2 1.4 0 4.1 AVERAGE 4.2 5.9 2.8 10.1 40.9 11.5 11.2 3.1 3.7 6.5

Table 2. The percentage usage of AutoML functions across the ML workflow.

Figure 5 presents high-level (in the grey background) and low-level (in white background) categories of the purposes for which ML practitioners use AutoML tools. As shown in Figure 5, the high-level categories of purposes are in the order: Hyperparameter Optimization (30.6%), Model management (26.5%), Data management (15%), Visualization (7.4%), Logging (3.4%), Utility functions (3.2%), Feature Engineering (3.1%), Storage management (2.1%), Pipeline management (2%) and User Interface (0.5%). The following section describes some of the categories and sub-categories of purposes.

Figure 5. The taxonomy of the purposes of using AutoML tools. The high level categories are highlighted in grey, while the sub-categories are in white boxes.

$∙$ Hyperparameter optimization (HPO): This category is about using AutoML to determine the right combination of hyperparameters that maximizes the model performance, usually through running multiple experiments in a single training process. Automating HPO is indeed crucial, particularly for Deep Neural Networks (DNN) (feurer2019:hyperparameter; akiba2019:optuna), which depend on a wide range of hyperparameter choices about the function tuning or regularization, parameter sampling, the network’s architecture, and optimization. In Figure 5, we reported $16$ different activities involved during the HPO process, from executing hyperparameter optimization and parameter sampling to HPO documentation.

$∙$ Model management: This category is about using AutoML to handle the different tasks related to training ML algorithms with some input data and the resulting model that can generate predictions using patterns in the input data. In Figure 5 we highlighted the different sub-tasks where ML practitioners use AutoML in the ML models management, using AutoML for model evaluation and error-analysis, model training/testing, model definition, model monitoring, and optimizing model training performance.

$∙$ Data management: In this category, AutoML is used to handle the tasks related to the data used for building ML models. This process includes a set of steps involving data, including data generation and processing steps, managing data flow to ML models, working with multiple datasets, and other tasks throughout the data pipeline. In particular, we summarized $12$ different sub-tasks related to data management in Figure 5 including data transformation, data analysis, data loading, data labeling, and data export.

$∙$ Utility functions: These are functions provided by specific AutoML tools to perform some common functionally and are often reusable in the ML pipeline. The Utility functions are particularly important to reduce the pipeline complexity and increase the readability of the source code. Examples of the tasks in this category include session management (e.g., during model training), handling of input/output operations of ML models, showing training or optimization progress, and handling of thread execution.

$∙$ Feature engineering: This category is about using AutoML tools to automate the extraction of features from the given data, usually using domain knowledge of the data and the model being built. The resulting features are used as input to some algorithms to train the ML models. We reported in Figure 5 two main sub-categories of Feature engineering, including feature analysis and feature generation/extraction.

$∙$ Pipeline management: This category is about using AutoML to automate the end-to-end management of an ML model or a set of multiple models, including the orchestration of data that flows into or out of the model.

$∙$ Visualization: This category is about the graphical representation of data and information, such as visualizing the feature importance using a chart for interpretability or visual representation of data distribution during data analysis.

$∙$ Logging: This category involves using AutoML tools to track and store records related to the execution events, data inputs, processes, data outputs, resulting in model performance and other related tasks when developing and deploying ML models. ML practitioners use the resulting information to identify any suspicious activities in ML software projects’ privileged operations and assess the impact of state transformations on performance.

$∙$ Storage management: In this category, AutoML tools are used for managing the data storage resources aiming at improving and maximizing the efficiency of data storage resources, often in terms of performance, availability, recoverability, and capacity.

$∙$ User interface: This category is about the element of AutoML tools that allows the ML practitioners to easily interact with the AutoML features, using the concepts such as visual and interactive design and information architecture (e.g., providing web-based interactive interface for displaying live code).

We used the label Others for the function calls which do not fall into any of the categories.

{tcolorbox}

We derived 10 high-level categories of the purposes of using AutoML tools, including Hyperparameter optimization (30.6%), Model management (26.5%), Data management (15%), Visualization (7.4%), and Logging (3.4%). ML practitioners can save time and effort by using AutoML tools to automate their ML pipeline tasks (truong2019:towards) for similar purposes.

In Figure 6 we provide the breakdown of the major categories of purposes (x-axis) of using AutoML provided in the top 10 AutoML tools (y-axis). The circle size represents the composition of using the respective AutoML tools for that purpose, computed as the percentage of total function calls defining that purpose to the total number of function calls from the individual AutoML tool.

Figure 6. The percentage composition of the high-level categories for purposes of using AutoML across the top 10 most used tools.

We can clearly see in Figure 6 that the functions for Hyperparameter optimization, Model management, and Data management are used the most for most of the top 10 AutoML tools. Specifically, Hyperparameter optimization and Model management dominate at least 40% of the AutoML tools. In contrast, some of the purposes of using AutoML tools are specific to certain tools. For example, we can observe that the function calls for Feature engineering ( $45 %$ ) and Pipeline management ( $22.7 %$ ) tasks are dominantly used from Featuretools and Tpot, respectively, while a minimal percentage ( $1.3 %$ ) of function calls related to User Interface comes from Snorkel and Optuna.

4.3. RQ3: Are different AutoML tools used together?

4.3.1. Motivation

From Table 2 we observe that at least two of the AutoML tools can be used to automate the functionality in each stage of the ML pipeline. This question aims to understand whether the different AutoML tools are used together in the same source code file. For instance, ML practitioners may use AutoSklearn for model training and tune the hyperparameters using the functionality provided by Optuna within the same code. Answering this question could help ML practitioners choose the AutoML tools used together for different purposes. Similarly, the information can help the developer of the AutoML tools improve the integration procedure of the AutoML tools used together, such as providing an API to facilitate the usage of the other AutoML tools while getting input or sending output.

4.3.2. Approach

We compared the composition of the source code files (of the projects using AutoML tool) where different AutoML tools are used together as follows. For every source code files ${T s_{1}, T s_{2}, T s_{3}, . . . T s_{n}}$ importing and using the functions from a given AutoML tool, we computed the percentage composition of the shared source files as:

(2)

\frac{T s_{i \cdot j} * 100}{T s_{i N}}

where $T s_{i \cdot j}$ is the total number of source files using the functions of two AutoML tools $T_{i}$ and $T_{j}$ in the same file, and $T s_{i N}$ is the total number of files using at least one function from the AutoML tool $T_{i}$ . We define the results of the equation as the correlation between two AutoML tools.

In addition, we also calculated the distribution of the projects that use the different number of AutoML tools to understand how these AutoML tools are used together in the same projects.

4.3.3. Results

Table 3 shows the distribution of the projects and source codes that use a different number of AutoML tools. We focus on the top 10 AutoML tools and the projects that use these tools. We observe that the vast majority of the projects use only one AutoML tool. There are only fewer than 8% of the projects that use two or more different AutoML tools in the same project.

Figure 7 shows the correlation matrix of the percentage composition of shared source files using different top AutoML tools together, computed using Equation (2). We only reported the correlation of the top 10 most used AutoML tools for clear visualization.

{adjustbox}

width=0.95center No. AutoML 1 2 3 4 5 $\geq 6$ % No.of Projects 92.52 5.85 1.28 0.26 0.053 0.045 % No.of code files 98.77 0.90 0.32 0.004 0 0

Table 3. The percentage number of GitHub projects and source code files using different AutoML tools together.

{adjustbox}

width=0.95center Tools combination Example function hyperopt, tpot hyperopt.fmin’, ’tpot.TPOTClassifier.fit’, ’tpot.TPOTClassifier.score’ hyperopt, optuna ’hyperopt.fmin’, ’hyperopt.hp.choice’, ’hyperopt.hp.uniform’, ’optuna.create_study’, ’optuna.create_study.optimize’ bayes_opt, skopt ’bayes_opt.BayesianOptimization’, ’bayes_opt.BayesianOptimization. maximize’, ’skopt.BayesSearchCV’, ’skopt.BayesSearchCV.fit’ tpot, hyperopt, optuna ’hyperopt.Trials’, ’hyperopt.fmin’, ’optuna.create_study’, ’optuna.create_study.optimize’, ’tpot.TPOTClassifier.fit’, ’tpot.TPOTClassifier.score’

Table 4. Examples of the AutoML functions used together .

Figure 7. The composition of project’s source code files using different AutoML tools.

From Figure 7 we observe that, across the top 10 AutoML tools, only a few of the source files ( $< 2 %$ ) uses the functions from more than one AutoML tool. This result implies that most of these tools are used independently. Among the most used AutoML tools, the few AutoML tools used together are Tpot with Hyperopt ( $0.69 %$ ), Optuna with Hyperopt ( $0.29 %$ ), Skopt with Hyperopt ( $0.25 %$ ) and Bayes-Opt with Skopt ( $0.20 %$ ). In Table 4, we provide examples of the functions used together in the same source code file. For instance, both Hyperopt and Optuna provide rich functions for optimizing hyperparameters according to Figure 6; however, optuna also offers other features that are useful during hyperparameter optimization, such as Visualization and Logging. This implies that ML practitioners using HyperOpt with Optuna together have more options for HPO while also utilizing the visualization and logging functions from Optuna. Similarly, Tpot provides few functions for optimizing hyperparameters but offers more functions such as pipeline management and model management; the ML practitioners can simultaneously automate these tasks when using Tpot with Hyperopt.

{tcolorbox}

Less than 8% of projects use two or more different AutoML tools. Different AutoML tools are rarely used in the same source code files. Among the few AutoML tools used together, Hyperopt is used with other tools, such as Tpot, Skopt, and Optuna. ML practitioners can consider using the combination of functions from different AutoML tools together to help automate multiple tasks across the ML pipeline. Similarly, the AutoML developers can propose an efficient API to integrate the AutoML tools that are used together.

5. Threats to validity

We now discuss threats to the validity of our study.
Internal validity threats: We manually studied a sample of AutoML function calls to understand ML practitioners’ purposes for using AutoML tools. However, our labeling results may not truly reflect the actual purposes of the developers. Future work can perform developer surveys to confirm our results. Besides, the process of manual labeling affects the results of this study. To reduce the impact of biased labels, two authors labeled the functions, and the other two authors reviewed the labels separately to validate them. Another possible threat is related to our use of function calls to determine the popularity of the AutoML tools. As different numbers of calls may combine to achieve the same goal across different AutoML tools, which may affect the results of this study. However, a function is usually designed to perform a specific task, as reflected by the name of that function. Therefore, we believe that this threat is minimal. In the future, we will analyze the relation cardinality between functions of different AutoML tools to understand how the studied tools handle the same situation with the same or different number of function calls.

External validity threats: The number of selected AutoML tools for manual labeling affects the generalizability of the results. To minimize this effect, we selected the 10 most used AutoML tools, which account for 68% of the usage in the GitHub projects. We assumed that our findings derived from these 10 AutoML tools can represent how ML practitioners use AutoML tools and their main purposes for using them.

Construct validity threats: To extract the function calls of AutoML tools in GitHub projects, we followed the official documentation of AST library. However, it’s still possible that we may have missed some function calls that are rarely used and hard to recognize. We provide all our scripts and data in our replication package (replication:AutoML-empirical-study:2022), to allow replicating our results.

6. Related Works

The related research papers are categorized into two groups and discussed in this section.

6.1. Qualitative studies on the usage of AutoML tools

Prior work performs qualitative studies (e.g., through interviews) to understand practitioners’ experiences of using AutoML tools (xin2021:whither; wang2019:human; crisan2021fits; wang2021:much; drozdal2020:trust; wang2019atmseer; xanthopoulos2020putting). For example, Xin et al. (xin2021:whither) investigated the use of AutoML tools through semi-structured interviews with AutoML users of different categories and expertise. They argued that although AutoML tools increase productivity for experts and make ML accessible to beginners, designing completely automated ML tools is impractical, and designers of AutoML tools should focus on improving the collaboration between ML practitioners and AutoML tools. Similarly, Wang et al. (wang2021:much) argued that there is no need for tools that are fully automated, instead, developing AutoML tools that are explainable and that integrate people in the ML workflow loop is more important. Crisan et al. (crisan2021fits) observed that using AutoML tools in real practices is not easy and suggested that one way to facilitate the interaction between humans and AutoML tools is data visualization. Similarly, Drozdal et al. (drozdal2020:trust) found that visualization and inclusion of transparency features build trust in users. Also, they recommended that AutoML tools should be easily customized for enabling users to apply their preferences. Xanthopoulos et al. (xanthopoulos2020putting) defined some criteria to evaluate the features of AutoML tools. They concluded that the performance does not guarantee the success of an AutoML tool because there are other important factors that should be taken into consideration, such as the ease of use and the interpretability of the results.

6.2. Comparison of AutoML tools

Ferreira et al.(ferreira2021:comparison) studied eight AutoML tools and compared them by conducting computational experiments based on General Machine Learning (GML), deep learning, and XGboost scenarios. The results show that current GML AutoML tools can produce close or even better predictive results than human-created ML models. Truong et al. (truong2019:towards) reported the different stages of the ML pipeline covered by some specific AutoML tools. Then, they examined the performance of the AutoML tools on different datasets. They found that none of the studied AutoML tools perform better than others. They suggest that AutoML tool developers should consider leveraging ideas from other AutoML tools to improve their tools.

The previous works studied 1) the usage of AutoML tools by conducting qualitative studies such as interviews, 2) the comparison between AutoML tools. However, no work studied how ML practitioners use AutoML tools in real-world projects at a large scale. In this work, we attempt to fill this gap by investigating the usage of AutoML tools in real-world GitHub projects. We aim to provide insights for ML practitioners and AutoML tool providers to improve the usage and design of AutoML tools.

change automl to Auto-ML

An Empirical Study on the Usage of Automated Machine Learning Tools

Abstract.

1. Introduction

2. Background

2.1. Machine learning pipeline

2.2. Automated machine learning (AutoML)

3. Experimental Setup

3.1. getting the imports and function calls

4. Results

4.1. RQ1: What are the most used AutoML tools?

4.1.1. Motivation

4.1.2. Approach

4.1.3. Results

4.2. RQ2 : How do ML practitioners use AutoML tools?

4.2.1. Motivation

4.2.2. Approach

4.2.3. Results

4.3. RQ3: Are different AutoML tools used together?

4.3.1. Motivation

4.3.2. Approach

4.3.3. Results

5. Threats to validity

6. Related Works

6.1. Qualitative studies on the usage of AutoML tools

6.2. Comparison of AutoML tools

References