Towards Unsupervised HPO for Outlier Detection

Yue Zhao, Leman Akoglu
Abstract

Given an unsupervised outlier detection (OD) algorithm, how can we optimize its hyperparameter(s) (HP) on a new dataset, without any labels? In this work, we address this challenging hyperparameter optimization for unsupervised OD problem, and propose the first systematic approach called HPOD that is based on meta-learning. HPOD capitalizes on the prior performance of a large collection of HPs on existing OD benchmark datasets, and transfers this information to enable HP evaluation on a new dataset without labels. Moreover, HPOD adapts (originally supervised) sequential model-based optimization to identify promising HPs efficiently. Extensive experiments show that HPOD works with both deep (e.g., Robust AutoEncoder) and shallow (e.g., Local Outlier Factor (LOF) and Isolation Forest (iForest)) OD algorithms on both discrete and continuous HP spaces, and outperforms a wide range of baselines with on average 58% and 66% performance improvement over the default HPs of LOF and iForest.

\nocopyright\affiliations

Carnegie Mellon University
zhaoy@cmu.edu, lakoglu@andrew.cmu.edu

1 Introduction

Although a long list of unsupervised outlier detection (OD) algorithms have been proposed in the last few decades Aggarwal2013a; Campos2016evaluation; pang2021deep; han2022adbench, how to optimize their hyperparameter(s) (HP) remains underexplored. Without hyperparameter optimization (HPO) methods, practitioners often use the default HP of an OD algorithm, which is hardly optimal given many OD algorithms are sensitive to HPs. For example, a recent study by Zhao et al. reports that by varying the number of nearest neighbors in local outlier factor (LOF) Breunig2000a while fixing other conditions, up to 10 performance difference is observed in some datasets zhao2021automatic. The literature also shows that HP sensitivity is exacerbated for deep OD models with more ‘knobs’ (e.g., HPs and architectures) ding2022hyperparameter, which we also observe in this study—deep robust autoencoder (RAE) zhou2017anomaly exhibits up to 37 performance variation under different HPs in several datasets.

\subfloat

[(left) large perf. variation over 384 HPs for RAE on Thyroid—random&min. loss HPs are sub-par; (right) HPOD (H) outperforms all.] \subfloat[(left) HP sensitivity of LOF on Vowels—the default HP is far from optimal; (right) HPOD (H) outperforms all baselines with +58% AP rank over random HP.]\subfloat[Perf. comparison on ensemble iForest; HPOD (H) outperforms all baselines markedly.]

Figure 1: (a) (left) Demo of HP sensitivity in deep RAE on Thyroid; (right) HPOD outperforms all baselines on a 39-dataset database (see §4.2.1), with a higher avg. performance rank (y-axis) as well as comparable to better top q% HP settings in the meta-HP set (x-axis) (b) (left) HP sensitivity in LOF on Vowels; (right) HPOD outperforms all baselines with great performance edge (e.g., normalized AP rank) over the default HP (§4.2.2) and (c) for iForest, HPOD is also the best with huge improvement (e.g., normalized AP rank) over the default HP (§4.2.2). See detailed experiment results in §4.

In supervised learning, one can use ground truth labels to evaluate the performance of an HP, leading to a simple grid and random search bergstra2012random and more advanced Sequential Model-based Bayesian Optimization (SMBO) DBLP:journals/jgo/JonesSW98. Unlike the simple methods, SMBO builds a cheap regression model (called the surrogate function) of the expensive objective function (which often requires ground truth labels), and uses it to iteratively select the next promising HP for the objective function to evaluate. Notably, learning-based SMBO outperforms simple, non-learnable methods in many applications thornton2013auto; falkner2018bohb.

However, unsupervised OD algorithms face evaluation challenges—they do not have access to (external) ground truth labels, and most of them (e.g., LOF and Isolation Forest (iForest) liu2008isolation) do not have an (internal) objective function to guide the learning either. Even for the OD algorithms with an internal objective (e.g., reconstruction loss in RAE), its value does not necessarily correlate with the actual detection performance ding2022hyperparameter. Thus, HPO for unsupervised OD remains underexplored, where the key is reliable model evaluation.

Moving beyond proposing another detection algorithm, we study this important HyperParameter Optimization for unsupervised OD problem, and introduce (to our best knowledge) the first solution called HPOD. In a nutshell, HPOD leverages meta-learning to enable (originally supervised) SMBO for unsupervised OD hyperparameter optimization. To overcome the infeasibility of evaluation in unsupervised OD, HPOD leverages meta-learning that carries over past experience on prior datasets/tasks to more efficient learning on a new task. To that end, we build a meta-database with historical performances of a large collection of HPs on an extensive corpus of existing OD benchmark datasets, and train a proxy performance evaluator (PPE) to evaluate HPs on a new dataset without labels. By plugging PPE into SMBO, HPOD can iteratively and efficiently identify promising HPs to evaluate and output the best. Additionally, we use meta-learning to facilitate SMBO in HPOD by initializing and transferring knowledge from similar historical tasks to the surrogate function of the new task. We find it important to remark that HPOD is strictly an HPO technique other than a new detection algorithm.

Performance. Fig. 1 (left) shows the huge performance variation (up to 37) for a set of 384 deep RAE HPs on Thyroid data, where HPOD is significantly better than expectation (i.e. random selection), as well as selection by min. reconstruction loss (MinLoss); ours is one of the top HPs. In Fig. 1 (right), we show that HPOD is significantly better than a group of diverse and competitive baselines (see Table 1) on a 39 dataset database. We also demonstrate HPOD’s generality on non-deep OD algorithm LOF with both discrete and continuous HP spaces in Fig. 1, as well as the popular iForest in Fig. 1. For all three OD algorithms, HPOD is statistically better than (most) baselines, including the default HPs of LOF and iForest which are widely used by practitioners. In fact, being an ensemble, iForest has been shown to be robust to HPs and outperform many other detectors in the literature emmott2015meta. As such, improving over its default setting by our HPOD is remarkable.

In summary, the key contributions of this work include:

  • First HPO framework for Unsupervised Outlier Detection. We introduce HPOD, a novel meta-learning approach that capitalizes on historical OD tasks with labels to select effective HPs for a new task without any labels.

  • Continuous HP search. Superior to baselines in Table 1, HPOD works with both discrete and continuous HPs.

  • Generality and Effectiveness. Extensive results on 39-datasets with (a) deep method RAE and classical methods (b) LOF and (c) iForest show that HPOD is markedly better than leading baselines, with on avg. 58% and 66% improvement over the default HPs of LOF and iForest.

Reproducibility. To foster future work on HPO for OD, we open-source all code and env. at https://tinyurl.com/hpod22.

2 Related Work

Category Default Random MinLoss HE GB ISAC AS MetaOD HPOD_0 HPOD
meta-learning
discrete HP
continuous HP
Table 1: HPOD and baselines for comparison with categorization by (first row) whether it uses meta-learning and (second & third row) whether it supports discrete and continuous HPO. Only HPOD and HPOD_0 leverage meta-learning and support continuous HPO. See details in §4.1.

2.1 Hyperparameter Optimization (HPO) for OD

We can categorize the short list of HPO methods for OD into two. The first group of methods require a hold-out set with ground truth labels for evaluation bahri2022automl, including AutoOD li2021autood, TODS lai2021revisiting, and PyODDS li2020pyodds, which do not apply to unsupervised OD scenarios. The second group of methods uses the default HP in the paper, randomly picking an HP, selecting the HP by internal objective/evaluation goix2016evaluate; marques2020internal, or averaging the outputs of randomly sampled HPs wenzel2020hyperparameter; we include them as baselines (see col. 2-5 of Table 1) with empirical results in §4.2.

2.2 Hyperparameter Opt. and Meta-Learning

HPO gains great attention due to its advantages in searching and optimizing through complex HP spaces, where learning tasks are costly karmaker2021automl. Existing methods include simple grid and random search bergstra2012random and more advanced Sequential Model-based Bayesian Optimization (SMBO) DBLP:journals/jgo/JonesSW98. Notably, SMBO builds a cheap regression model (called the surrogate function) of the expensive objective function, and uses it to iteratively select the next promising HPs to be evaluated by the objective function (see Appx. A). We cannot directly use these supervised methods, rather, HPOD leverages meta-learning to enable SMBO for HPO for OD.

Meta-learning aims to facilitate the new task learning by leveraging and transferring knowledge from prior/historical tasks vanschoren2018meta, which has been used in warm-starting SMBO feurer2014using and transferring surrogate in SMBO yogatama2014efficient; wistuba2016two; wistuba2018scalable. Recently, meta-learning has also been applied to unsupervised outlier model selection (UOMS), where Zhao et al. proposed MetaOD zhao2021automatic with comparison to baselines including global best (GB), ISAC conf/ecai/KadiogluMST10, and ARGOSMART (AS) nikolic2013simple. We adapt these UOMS methods as baselines for HPO for OD (see col. 6-9 of Table 1). Although these leverage meta-learning, they cannot tackle continuous HPO. HPOD outperforms them in all experiments (see §4.2).

3 Hpod: Hyperparameter Optimization for Unsupervised Outlier Detection

3.1 Problem Statement

We consider the hyperparameter optimization (HPO) problem for unsupervised outlier detection (OD), referred to as HPO for OD hereafter. Given a new dataset without any labels and an OD algorithm with the HP space , the goal is to identify a HP setting so that model (i.e., detector with HP ) achieves the highest performance111In this paper, we use the area under the precision-recall curve (AUCPR, a.k.a. Average Precision or AP) as the performance metric, which can be substituted with any other metric of interest.. HPs can be discrete and continuous, leading to an infinite number of candidate HP configurations. For instance, given hyperparameters , with domains , the hyperparameter space of is a subset of the cross product of these domains: . Eq. (1) presents the goal formally.

(1)
Problem 1 (HPO for OD)

Given a new input dataset (i.e., detection task222Throughout text, we use task and dataset interchangeably.) without any labels, pick a hyperparameter setting for a given detection algorithm to employ on to maximize its performance.

It is infeasible to evaluate an infinite number of configurations with continuous HP domain(s), and thus a key challenge is efficiently searching the space. Further, since HPO for OD does not have access to ground truth labels , HP performance (perf.) cannot be evaluated directly.

3.2 Overview of Hpod

In HPOD, we use meta-learning to enable (originally supervised) SMBO for HPO for OD, where the key idea is to transfer useful information from historical tasks to a new test task. As such, HPOD takes as input a collection of historical tasks , namely, a meta-train database with ground-truth labels where . Given an OD algorithm for HPO, we define a finite meta-HP set by discretizing continuous HP domains (if any) to get their cross-product, i.e., . We use to denote detector with the -th HP setting . HPOD uses to compute:

  • the historical output scores of each detector on each meta-train dataset , where refers to the output outlier scores using the -th HP setting for the points in the -th meta-train dataset ; and

  • the historical performance matrix of each detector , where is ’s performance\footreffootnote:metric on meta-train dataset .

HPOD consists of two phases. During the (offline) meta-learning, it leverages meta-train database with labels to build a proxy performance evaluator (PPE), which can predict HP performance on a new task without labels. Also, it trains a meta-surrogate function (MSF) for each meta-train dataset to facilitate later HPO on a new dataset. In the (online) HP Optimization for a new task, HPOD uses PPE to predict its HPs’ performance without using any labels, under the SMBO framework to identify promising HP settings in iteration effectively. Also, we improve the surrogate function in SMBO by transferring knowledge from similar meta-train datasets. An outline of HPOD is given in Algorithm 1, where we elaborate on the details of offline meta-training and online HPO for OD in §3.3 and §3.4, respectively.

0:  (Offline) meta-train database , OD algorithm , meta-HP set , performance evaluation ; (Online) new OD dataset (no labels), number of iterations
0:  (Offline) HPOD meta-learners; (Online) the selected hyperparameter setting for   (Offline) Meta-train: Learn functions for HP performance prediction 3.3)
1:  Train detector with HP on each of to get outlier scores ,
2:  Evaluate each against ground-truth labels to get performance matrix , where
3:  Extract meta-features (MF) per task,
4:  Compute internal perf. measures (IPM),
5:  Train proxy performance evaluator (PPE) to predict the performance from the respective {HP setting , meta-features , IPMs }                                    §3.3.1
6:  Train each meta-surrogate function (MSF) per meta-train dataset to predict the perf. from only the respective {HP setting }                            §3.3.2
7:  Save MF extractor , IPM extractor , PPE , and MSF   (Online) HPO on a new task: Iteratively identify promising HP settings and output the best one 3.4)
8:  Extract meta-features of the test dataset
9:  Initialize surrogate function and the evaluation set by the meta-train database and PPE \hfill §3.4.1
10:  for  to  do \hfill §3.4.2
11:     Transfer meta-surrogate functions to surrogate by performance similarity to meta-train tasks \hfill §3.5.2
12:     Get the next promising HP to evaluate by EI on surrogates’ prediction,
13:     Build w/ , and get outlier scores and IPMs
14:     Predict perf. of by ,
15:     Add to the evaluation set
16:     Update to with new pairs of information
17:  end for
18:  Output with the highest predicted perf. \hfill Eq. (2)
Algorithm 1 HPOD: Offline and Online Phases

3.3 (Offline) Meta-training

In principle, meta-learning carries over the prior experience of historical (meta-train) tasks to do better on a new task, given the latter at least resembles some of the historical tasks. Due to the lack of ground truth labels and/or a reliable internal objective function, the key challenge in HPO for OD is to evaluate the performance of HP settings. Thus, the core of HPOD’s meta-learning is learning the mapping from HP settings onto ground-truth performance by the supervision from the meta-train database. The first part (lines 1-7) of Algorithm 1 describes the core steps, and we further discuss on how to learn this mapping (§3.3.1) and transfer additional info. for a new task (§3.3.2) in the following.

3.3.1 Proxy Performance Evaluator (PPE).

In HPOD, we learn a regressor across all meta-train datasets, named Proxy Performance Evaluator, that maps their {HP settings, data characteristics, additional signals} onto ground truth performance. If only uses HP settings as the input feature, it fails to capture the performance variation of an HP given different datasets. Thus, we need additional input features to enable for quantifying dataset similarity, so that can make similar HP performance predictions on similar datasets, and vice versa.

How can we capture dataset similarity in OD? Recent work by (zhao2021automatic) introduced specialized OD meta-features (MF) to describe general characteristics of OD datasets; e.g., number of samples, basic feature statistics, etc. With the meta-feature extractor, both meta-train datasets and (later) the test dataset can be expressed as fixed-length vectors, and thus any similarity measure applies, e.g., Euclidean distance. To build , we extract meta-features from each meta-train dataset as , where is the extraction module, and is the number of MF.

Although meta-features describe general characteristics of OD datasets, their similarity does not necessarily correlate with the actual performance. Thus, we enrich the input features of with internal performance measures (IPMs) (ma2021large), which are more “performance-driven”. IPMs are noisy/weak unsupervised signals that are solely based on the input samples and/or a given model’s output (e.g., outlier scores) that can be used to compare two models goix2016evaluate; marques2020internal. In HPOD, we make the best use of these weak signals by learning in to regress the IPMs of a given HP setting (along with other signals) onto its true performance with supervision. To build , we extract IPMs of each detector with HP setting on each meta-train dataset , where refers to the IPMs using the -th HP setting for the -th meta-train dataset, and is the IPM extractor. We defer the details on IPMs to Appx. B.1.

Putting these together, we build Proxy Performance Evaluator to map {HP setting, meta-features, IPMs} of HP on the -th meta-train dataset onto its ground truth performance, i.e., . See training details of in Appx. B.2.

We find it critical to remark that provided , , and the trained at test time, predicting the detection performance of HP settings becomes possible for the new task without using any ground-truth labels.

3.3.2 Meta-Surrogate Functions (MSF).

Different from that trains on all meta-train datasets and leverages rich input features (i.e., HPs, MFs, and IPMs) to predict HP performance, we also train independent regressors with only HPs as input, . That is, for each meta-train dataset , we train a regressor that simply maps the -th HP setting to its detection performance on the -th meta-train dataset, i.e., .

Since these independent regressors only use HP settings as input, they can be readily transferred to the online model selection phase to improve HP performance evaluation on . We defer their specific usage to §3.4.2 and §3.5.2.

3.4 (Online) HPO on a New OD Task

After the meta-training phase, HPOD is ready to optimize HPs for a new dataset. In short, it outputs the HP with the highest predicted performance by , the trained performance evaluator (§3.4.1). To explore better HPs efficiently, HPOD leverages Sequential Model-based Optimization to iteratively select promising HPs for evaluation (§3.4.2). The second part (lines 8-18) of Algorithm 1 shows the core steps.

3.4.1 Hyperparameter Optimization via Proxy Performance Evaluator.

Given a new dataset , we can sample a set of HPs (termed as the evaluation set ), and use the proxy performance evaluator from meta-training to predict their performance, based on which we can output the one with the highest predicted value as follows.

(2)

By setting to some randomly sampled HPs and plugging it into Eq. (2), we have the “version 0” of HPOD, referred as HPOD_0. However, needs IPMs (i.e., in Eq. (2)) as part of the input, requiring detector building at test time. Thus, we should construct carefully to ensure it captures promising HPs, where random sampling is insufficient. Thus we ask: how can we efficiently identify promising HPs for model building and evaluation at test time?

3.4.2  Identifying Promising HPs by Sequential Model-based Optimization (SMBO).

As we briefly described in §2, SMBO can iteratively optimize an expensive objective (hutter2011sequential), and has been widely used in supervised model selection and HPO (bergstra2015hyperopt). Other than sampling HPs randomly, learning-based SMBO shows better efficiency in finding promising HPs to evaluate in iterations. In short, SMBO constructs a cheap regression model (called surrogate function ) and uses it for identifying the promising HPs to be evaluated by the (expensive) true objective function. It then iterates between fitting the surrogate function with newly evaluated HP information and gathering new information based on the surrogate function. We provide the pseudo-code of the supervised HPO by SMBO in Appx. Algorithm A1, and note that it does not directly apply to HPO for OD as perf. cannot be evaluated without ground truth labels (line 4).

We enable (originally supervised) SMBO for unsupervised outlier detection HPO by plugging the PPE from the meta-train in place of HP performance evaluation. The key steps of online HPO by HPOD are presented below.

Surrogate Function and Initialization. As an approximation of the expensive objection function, surrogate function only takes HP settings as input, aiming for fast performance evaluation on a large collection of sampled HPs. For the new task without access to true performance evaluation, HPOD lets learn a mapping from an HP to its predicted performance, i.e., . To enable on , HPOD needs one-time computation for the corresponding meta-features as . We want to remark that differs from in two aspects. First, can make fast performance predictions on HPs as it only needs HPs as input, while is more costly since IPMs require model building. Second, is a regression model that can measure both the predicted performance of HP settings and uncertainty (potential) around the prediction simultaneously. A popular choice for is the Gaussian Process (GP) (williams1995gaussian)333We use GP in HPOD; one may use any regressor with prediction uncertainty estimation, e.g., random forests (breiman2001random)..

To initialize , we train it on a small number of HPs. More specifically, we train with pairs of HPs and their corresponding predicted performance by on , and also initialize the evaluation set to these HPs. Although we can randomly sample the initial HPs, we propose to set them to top-performing HPs from similar meta-train tasks. Consequently, our initial is more accurate in predicting likely well-performing HPs on . We defer the details of this meta-learning-based surrogate initialization to §3.5.1.

Iteration: Identifying Promising HPs. Although we can already output an HP from with the highest predicted performance after initialization, we aim to use to identify “better and better” HPs within the time budget.

Under the SMBO framework, in each iteration we use to predict the performance (denoted as ) and the uncertainty around the prediction (denoted as ) of sampled 444 is a finite HP candidate set that is randomly sampled from the full (continuous) HP space (see details in Appx. C.1). Since can make fast predictions, ’s size can be large, e.g., 10,000 as in hutter2011sequential. , and then select the most promising one to be “evaluated” by . Intuitively, we would like to evaluate the HPs with both high predicted performance (i.e., exploitation) and high potential/prediction uncertainty (i.e., exploration), which is widely known as “exploitation-exploration trade-off” shahriari2015taking. Too much exploitation (i.e., always evaluating the similar HPs) will fail to identify promising HPs, while too much exploration (i.e., only considering high uncertainty HPs) may lead to low-performance HPs. Also, note that the quality of identified HPs depends on the prediction accuracy of , where we propose to transfer knowledge from meta-surrogate functions by performance similarity (see details in §3.5.2.)

How can we effectively balance the trade-off between exploitation and exploration in HPOD? As an SMBO method, we use the acquisition function to factor in the trade-off and pick a promising HP setting based on the outputs of the surrogate function. The acquisition function quantifies the “expected utility” of HPs by balancing their predicted performance and the uncertainty. Thus, we output the most promising HP to evaluate by maximizing :

(3)

One of the most prominent choices of is Expected Improvement (EI) (DBLP:journals/jgo/JonesSW98), which is used in HPOD and can be replaced by other choices. EI has a closed-form expression under the Gaussian assumption, and the EI value of HP setting is shown below.

(4)

In the above, and respectively denote the cumulative distribution and the probability density functions of a standard Normal distribution, and are the predicted performance and the uncertainty around the prediction of by the surrogate function , and is the highest predicted performance by on so far. We compare EI with other selection criteria in §4.4.1.

At the -th iteration, we plug the surrogate into Eq. (3), which returns to evaluate. Next, we train the OD model with to get its outlier scores and IPMs , and then predict its performance by as

(5)

Finally, we add to the evaluation set , and update the surrogate function to with newly evaluated HP information .

To sum up, HPOD alternates between (i) identifying the next promising HP by the surrogate function and (ii) updating based on newly evaluated HP.

Continuous HP search. Recall that an outstanding property of HPOD compared to the baselines is its capability for continuous HP search (Table 1). \footreffootnote:sample can be any subset of the full HP space and not restricted to the discrete .

Time Budget. As a SMBO method, HPOD is an anytime algorithm: at any time the user asks for a result, it can always output the HP with the highest predicted performance in the evaluation set at the current iteration (Eq. (2)). HPOD uses to denote the max number of iterations.

3.5 Details of (Online) HPO

3.5.1 Meta-learning-based Surrogate Initialization.

Other than initializing on a small set of randomly sampled HPs, we design a meta-learning initialization strategy for the surrogate function (see §3.4.2), based on the similarity between the test and meta-train datasets. Intuitively, we hope the surrogate function can make accurate predictions on the top-performing HPs of the test dataset, while the accuracy of the under-performing HP regions is less important. To this end, we use the meta-features (see §3.3.1) to calculate the similarity between the test dataset to each meta-train task , and initialize with the top performing HPs from the most similar meta-train dataset; these HPs may yield good performance on the test dataset as well. See the comparison to random initialization in §4.4.2.

3.5.2 Surrogate Transfer by Performance Similarity.

As introduced in §2.2, meta-learning can be used to improve the surrogate function in SMBO by transferring knowledge from meta-train datasets. In HPOD, we train independent Meta-Surrogate Functions (see §3.3.2) for meta-train datasets, ; each with the same regression function as (i.e., Gaussian Process). To this end, we identify the most similar meta-train dataset in iteration , and use the test surrogate and ’s meta-surrogate together to predict the performance of HP , i.e.,

(6)

where is the similarity between the test dataset and measured in iteration . Of course, we could use meta-features to measure the dataset similarity, but its value does not change in iterations and always finds the same meta-train dataset to transfer (even for different OD algorithms).

Instead, we (re)calculate the performance similarity every iteration based on the HPs in between each meta-train task and the test dataset, and dynamically transfer the most similar meta-train dataset’s meta-surrogate function. Specifically, HPOD computes a rank-based similarity by weighted Kendall tau shieh1998weighted) between each meta-train dataset’s ground truth performance and the test dataset’s predicted performance by on (which gets updated in every iteration). See the effect of surrogate transfer in §4.4.3.

4 Experiments

4.1 Experiment Setting

OD Algorithms and Testbed. We show the results of HPO on (a) deep RAE in §4.2.1 and (b) LOF and (c) iForest in §4.2.2. Each OD algorithm is evaluated on a 39-dataset testbed (Appx. §D.2, Table D1). Details of each algorithm’s HP spaces and the meta-HP set is provided in Appx. D.3. All experiments are conducted on an AMD 5900x@3.7GhZ, 64GB RAM workstation with NVIDIA RTX A6000 GPUs.

Baselines. Table 1 summarizes the baselines with categorization. We include (i) Simple methods: (1) Default always employs the same default/popular HP setting (only if specified in the literature) (2) Random choice of HPs and (3) MinLoss outputs the HP with the lowest internal loss (only applicable to the algorithms with an objective/loss function) and (ii) Complex methods: (4) HyperEnsemble (HyperEns or HE) that averages the results of randomly sampled HPs (wenzel2020hyperparameter) (5) Global Best (GB) selects the best performing HP on meta-train database on average (6) ISAC  conf/ecai/KadiogluMST10 (7) ARGOSMART (AS)  nikolic2013simple and (8) MetaOD  zhao2021automatic. Additionally, we include (9) HPOD_0, a variant of HPOD that directly uses to choose from randomly sampled HPs (see §3.4.1). Note that the unsupervised OD model selection baselines (5)-(8) are not for HPO for OD, i.e. they are infeasible with continuous HP spaces. We adapt them for HPO by selecting from the discrete meta-HP set in §3.2. More details of the baselines are in Appx. D.4.

Evaluation. We split the meta-train/test by leave-one-out cross-validation (LOOCV). Each time we use one dataset as the input dataset for HPO, and the remaining datasets as meta-train. We run five independent trials and report the average for the baselines with randomness. We use Average Precision (AP) as the performance measure, while it can be substituted with any other measure\footreffootnote:metric. As the raw performance like AP is not comparable across datasets with varying magnitude, we report the normalized AP rank of an HP, ranging from 1 (the best) to 0 (the worst)—thus higher the better. Also, we provide an additional metric called “top q%”, denoting that an HP’s performance has no statistical difference from the top q% HP from the meta HP-set, ranging from 0 (the best) to 1 (the worst)—thus lower the better. To compare two methods, we use the paired Wilcoxon signed rank test across all 39 datasets (significance level ). We give the full performance results in Appx. D.6.

Hyperparameters. The hyperparameters of HPOD are chosen by LOOCV on the meta-train set. To that end, we initialize the surrogate function with 10 HPs. Also, we report the results within a 30-minute budget for deep RAE and 10-minute budget for LOF and iForest, leading to =40 iters.

4.2 Experiment Results

4.2.1 Results on (Deep) Robust Autoencoder.

Fig. 1 (right) and 2 show that HPOD outperforms all baselines w.r.t. both the best avg. normalized AP rank and the top q% value. Furthermore, HPOD is also statistically better than all baselines as shown in Table 1(a), including strong meta-learning baseline MetaOD (). Its advantages can be credited to two reasons. First, meta-learning-based HPOD leverages prior knowledge on similar historical tasks to predict HP perf. on the new dataset, whereas simple baselines like Random and MinLoss cannot. Second, only HPOD and HPOD_0 can select HPs from continuous spaces, while other meta-learning baselines are limited to finite discrete HPs as specified for the meta-train datasets which are too few to capture optimal HPs (especially for deep models with huge HP spaces).

HPO by the internal objective(s) is insufficient. Fig. 2 shows that selecting HP by minimal reconstruction loss (i.e., MinLoss) has the worst performance for RAE, even if it can work with continuous spaces. This suggests that internal loss does not necessarily correlate with external performance. On avg., HPOD has 37% higher normalized AP rank, showing the benefit of transferring supervision via meta-learning.

\subfloat

[Results on RAE]

\subfloat

[Results on LOF]

\subfloat

[Results on iForest]

Figure 2: Comparison of avg. rank (lower is better) of algorithm performance across datasets on three OD algorithms. HPOD outperforms all w/ the lowest avg. algo. rank. The numbers on each line are the top q% value (lower is better) of the employed HP (or the avg.) by each method. HPOD shows the best performance in all three HPO experiments.
Ours baseline p-value Ours baseline p-value
HPOD AS 0.0309 HPOD ISAC 0.0028
HPOD Random 0.0014 HPOD MetaOD 0.0398
HPOD HyperEns 0.0382 HPOD MinLoss 0.0003
HPOD GB 0.0002 HPOD HPOD_0 0.0201
(a) On RAE, HPOD is statistically better than all baselines.
Ours baseline p-value Ours baseline p-value
HPOD AS 0.0023 HPOD ISAC 0.0246
HPOD Random 0.0001 HPOD MetaOD 0.0088
HPOD HyperEns 0.0607 HPOD Default 0.0029
HPOD GB 0.0017 HPOD HPOD_0 0.0016
(b) On LOF, HPOD is statistically better than all (except HyperEnsemble (HE)), including the default HP setting.
Ours baseline p-value Ours baseline p-value
HPOD AS 0.0055 HPOD ISAC 0.0088
HPOD Random 0.0003 HPOD MetaOD 0.0289
HPOD HyperEns 0.0484 HPOD Default 0.0013
HPOD GB 0.0027 HPOD HPOD_0 0.003
(c) On iForest, HPOD is statistically better than all baselines, including the default HP setting.
Table 2: Pairwise statistical test results between HPOD and baselines by Wilcoxon signed rank test. Statistically better method shown in bold (both marked bold if no significance).
# Iter Dist. metric # Neighbors AP Value
1 Manhattan 23 0.2866
6 Cosine 42 0.327
10 Cosine 55 0.3438
20 Chebyshev 72 0.3569
30 Chebyshev 73 0.357
Optimal Chebyshev 79 0.3609
Table 3: Trace of HPOD on Cardiotocography dataset. Over iterations (col. 1), HPOD gradually identifies better HPs (col. 2 &3), with higher AP (col. 4). The optimal HP on from the meta-HP set is {’Chebyshev’, 79}, which HPOD gets closer to the optimal HP during its adaptive search.

4.2.2  Results on LOF and iForest.

In addition to deep RAE, HPOD shows generality on diverse OD algorithms, including non-deep LOF (Fig. 1 and 2) with mixed HP spaces (see details in Appx. Table 1(b)) as well as ensemble-based iForest (Fig. 1 and 2). HPOD achieves the best performance in both with the best norm. AP rank and top q%.

HPOD is statistically better than the default HPs of LOF () and iForest () (see Table 1(b) and 1(c)). More specifically, we find that HPOD provides and performance (i.e., normalized AP rank) improvement over using the default HPs of LOF (Fig. 1 (right)) and iForest (Fig. 1). In fact, note that the default HPs rank the lowest for both LOF (Fig. 2) and iForest (Fig. 2), justifying the importance of HPO methods in unsupervised OD.

HyperEns that averages outlier scores from randomly samples HPs yield reasonable performance, which agrees with the observations in the literature ding2022hyperparameter. However, HyperEns has a higher inference cost as it needs to save and use all base models, not ideal for time-critical applications. Using a single model with the selected HPs by HPOD offers not only better detection accuracy but also efficiency.

4.3 Case Study

It is interesting to trace how HPOD identifies better HPs over iterations given a dataset. We show an example of tuning LOF on Cardiotocography dataset in Table 3. Among 200 candidate HP settings (see Appx. Table 1(b)), the optimal HP setting is {‘Chebyshev’, 79} with AP=0.3609. In 30 iterations, HPOD gradually identifies better HPs (closer to optimal), i.e., {‘Chebyshev’, 73}. Its AP improves from 0.2866 (1-st iteration) to 0.357 (30-th iteration). See full trace in Appx. D.5, Table D3.

4.4 Ablation Studies and Other Analysis

4.4.1 The Choices of Acquisition Function.

HPOD uses the EI acquisition to select an HP based on the surrogate function’s prediction (see §3.4.2). We compare it with random and greedy acquisition (latter picks the HP with the highest predicted performance, ignoring uncertainty) in Fig. 3, where EI-based acquisition shows the best performance.

Figure 3: Ablation of our EI (med.=0.893) vs. the greedy (med.=0.870) and random acquisition (med.=0.851).

4.4.2 Surrogate Initialization.

HPOD uses meta-learning to initialize the surrogate (see §3.5.1). Fig. 4 shows its advantage over random init. with higher performance.

Figure 4: Ablation of meta-initialization (med.=0.893) vs. random initialization (med. =0.874) of surrogate function.

4.4.3 The Effect of Surrogate Transfer.

To improve the prediction performance of the surrogate function, HPOD transfers meta-surrogate functions from similar meta-train tasks (see §3.5.2). Fig. 5 shows that this transfer helps find better HPs, demonstrating the added value of meta-learning besides PPE training of and surrogate initialization.

Figure 5: Ablation of our w/ surrogate transfer (med.=0.893) vs. without surrogate transfer (med. =0.872).

5 Conclusion

We introduce (to our knowledge) the first systematic hyperparameter optimization (HPO) approach for unsupervised outlier detection (OD). The proposed HPOD is a meta-learner, and builds on an extensive pool of historical OD benchmark datasets based on which it trains a performance predictor (offline). Given a new task without labels (online), it capitalizes on the performance predictor to enable (originally supervised) sequential model-based optimization for identifying promising HPs iteratively. Notably, HPOD stands out from all prior work on HPO for OD in being capable of handling both discrete and continuous HP search. Extensive experiments on three (including both deep and shallow) OD algorithms show its generality, where it significantly outperforms a diverse set of baselines. Future work will consider joint algorithm selection and continuous hyperparameter optimization for unsupervised OD.

References

Supplementary Material for Hpod

Details on algorithm design and experiments.

Appendix A Details on Supervised SMBO

Algorithm A1 shows the pseudo-code of classical SMBO for HPO. In each iteration, the surrogate function predicts the performance and uncertainty of a group of sampled HP settings, where the acquisition function selects the best one to be evaluated next by the objective function . With the newly evaluated pair of HP settings and the objective value, the surrogate function is updated to be more accurate in iteration.

0:  learning algorithm , surrogate function , input task , objective function , number of iterations
0:  selected hyperparameter setting for  
1:  Initialize surrogate function
2:  for  to  do
3:     
4:      \hfill infeasible for HPO for OD
5:     Update to with new information
6:  end for
7:  Output with the highest evaluated objective function values
Algorithm A1 SMBO for Supervised Hyperparameter Optimization

Clearly, the classical SMBO does not apply to HPO for OD directly since the objective function cannot be evaluated without ground truth labels (line 4). The proposed HPOD uses meta-learning to train a regressor to predict the performance of an HP on the new dataset without any labels (§3.3), and thus enables (originally supervised) SMBO for HPO for OD3.4).

Appendix B (Offline) Meta-training Details

b.1 Internal Performance Measures (IPMs)

As described in §3.3.1, IPMs are used as part of the input features of the proxy performance evaluator . In HPOD, we use three consensus-based IPMs (i.e., MC, SELECT, and HITS) which carry useful signals in unsupervised OD model selection ma2021large. More specifically, consensus-based IPMs consider the resemblance to the overall consensus of outlier scores as a sign of a better (performance of) model. Because of this, a group of models are needed to compute these IPMs for consensus measure.

In ma2021large, they use all models in for building IPMs (i.e., by pairing detector with each HP in meta-HP set ), leading to high cost in generating outlier scores and then IPMs. To reduce the cost, we instead identify a small subset of representative models called the anchor set (i.e., ), for calculating IPMs. That is, we generate the IPMs of a model with regarding to its consensus to rather than , for both meta-train database and the input dataset. The construction of the anchor set is by cross-validation in a forward selection way.

b.2 Building Proxy Performance Evaluator (PPE)

As outlined in §3.3.1, we build Proxy Performance Evaluator to map {HP setting, meta-features, IPMs} of HP on the -th meta-train dataset onto its ground truth performance, i.e., . Given we have meta-train datasets and the meta-HP set with HP settings, is trained on samples by pairing with meta-train datasets.

To construct the training samples of , we first extract meta-features from each meta-train dataset as , where is the extraction module, and is the length of meta-features.

We also need to extract IPMs of each detector with HP setting on each meta-train dataset , where refers to the IPMs using the -th HP setting the -th meta-train dataset and is the extractor.

Putting these together, we train with samples. In implementation we use LightGBM DBLP:conf/nips/KeMFWCMYL17 for , while it is flexible to choose any other.

b.3 Meta-surrogate Functions (MSF)

As described in §3.3.2, we also train independent regressors with only HPs as input, . That is, for each meta-train dataset , we train a regressor that simply maps the -th HP setting to its detection performance on the -th meta-train dataset, i.e., . Thus, only trains on the HP settings’ performance on the -th meta-train dataset.

In implementation, we use Gaussian Process (GP) williams1995gaussian for MSF \footreffootnote:gp, and we suggest using the same regressor as the surrogate in §3.4 for easy knowledge transfer in §3.5.2.

Appendix C (Online) Model Selection Details

c.1 Sampling Range

Given the PPE is trained on the meta-HP set of the meta-train database, it is more accurate in predicting the HPs from a similar range for the new dataset. Thus, HPOD samples HPs within the range of in SMBO (see §3.4.2). For instance, given the meta-HP set of iForest shown in Appx. Table 1(c), we sample HPs in range of: (i) n_estimators in (ii) max_samples in and (iii) max_features in for . We provide more details on the fast simulation of sampling in Appx. D.3.

Appendix D Additional Exp. Settings and Results

d.1 Code and Reproducibility

We foster future research by fully releasing the code and the testbed at anonymous repo: https://tinyurl.com/hpod22. Upon acceptance, we will release it on GitHub.

d.2 Datasets

In Table D1, we describe the details of the 39 benchmark datasets used in the experiments—it is composed by 18 datasets from DAMI library Campos2016evaluation and 21 datasets from ODDS library Rayana2016b.

Dataset Source #Samples #Dims %Outlier
1_ALOI DAMI 49534 27 3.04
2_Annthyroid DAMI 7129 21 7.49
3_Arrhythmia DAMI 450 259 45.78
4_Cardiotocography DAMI 2114 21 22.04
5_Glass DAMI 214 7 4.21
6_HeartDisease DAMI 270 13 44.44
7_InternetAds DAMI 1966 1555 18.72
8_PageBlocks DAMI 5393 10 9.46
9_PenDigits DAMI 9868 16 0.2
10_Pima DAMI 768 8 34.9
11_Shuttle DAMI 1013 9 1.28
12_SpamBase DAMI 4207 57 39.91
13_Stamps DAMI 340 9 9.12
14_Waveform DAMI 3443 21 2.9
15_WBC DAMI 223 9 4.48
16_WDBC DAMI 367 30 2.72
17_Wilt DAMI 4819 5 5.33
18_WPBC DAMI 198 33 23.74
19_annthyroid ODDS 7200 6 7.42
20_arrhythmia ODDS 452 274 14.6
21_breastw ODDS 683 9 34.99
22_glass ODDS 214 9 4.21
23_ionosphere ODDS 351 33 35.9
24_letter ODDS 1600 32 6.25
25_lympho ODDS 148 18 4.05
26_mammography ODDS 11183 6 2.32
27_mnist ODDS 7603 100 9.21
28_musk ODDS 3062 166 3.17
29_optdigits ODDS 5216 64 2.88
30_pendigits ODDS 6870 16 2.27
31_pima ODDS 768 8 34.9
32_satellite ODDS 6435 36 31.64
33_satimage-2 ODDS 5803 36 1.22
34_speech ODDS 3686 400 1.65
35_thyroid ODDS 3772 6 2.47
36_vertebral ODDS 240 6 12.5
37_vowels ODDS 1456 12 3.43
38_wbc ODDS 378 30 5.56
39_wine ODDS 129 13 7.75
Table D1: Testbed composed by 18 datasets from DAMI library Campos2016evaluation and 21 datasets from ODDS library Rayana2016b.

d.3 OD Algorithms and Hyperparameter Spaces

We demonstrate the HPOD effectiveness on three OD algorithms, namely RAE, LOF, and iForest. For RAE, we adapt the author’s code with seven key HPs. For LOF and iForest, we use the implementation from Python Outlier Detection (PyOD) library zhao2019pyod. For fast simulation, we also precompute the outlier scores and IPMs for the inner-HP set (denoted as ), which is within the range of the meta-HP set and serving as additional HPs sampled from continuous HP spaces. In the experiment, HPOD sets , and uses to score all the HPs in that are not yet evaluated by yet (see §3.4), thus simulating the advantage of sampling from larger “continuous” HP spaces. Table D2 and the code show details of HP spaces, the meta-HP set, and the inner-HP set.

HP Name Type Meta-HP Set Inner-HP Set
1 # EncodeLayers int (continuous) {2,4} {2,4}
2 Lambda float (continuous) {5e-5, 5e-3, 5e-1} {1e-4, 1e-3, 1e-1}
3 Learning Rate float (continuous) {1e-3, 1e-2} {1e-3, 1e-2}
4 # Inner Epochs int (continuous) {20, 50} {30, 40}
5 # Outer Epochs int (continuous) {20, 50} {30, 40}
6 Shrinkage Decay int (continuous) {2,4} {2,4}
7 Dropout float (continuous) {0, 0.1, 0.3, 0.5} {0, 0.1, 0.2, 0.4}
(a) Key HPs optimized by HPOD for RAE. Both meta-HP set and inner-HP set include =388 HP settings.
HP Name Type Meta-HP Set Inner-HP Set
1 n_neighbors int (continuous) {1,3,5,,80} {2,4,6,,81}
2 distance metric str (categorical) \hfill
{’chebyshev’, ’minkowski’, ’cosine’, ’euclidean’,’manhattan’}
\hfill
Same
(b) Key HPs optimized by HPOD for LOF. Both meta-HP set and inner-HP set include =200 HP settings.
HP Name Type Meta-HP Set Inner-HP Set
1 n_estimators int (continuous) {10,20,30,40,50,75,100,150} {10,20,30,40,50,75,100,150}
2 max_samples float (continuous) {0.1, 0.2, , 0.9} {0.1, 0.2, , 0.9}
3 max_features float (continuous) {0.2, 0.4, 0.6, 0.8} {0.3, 0.5, 0.7, 0.75}
(c) Key HPs optimized by HPOD for iForest. Both meta-HP set and inner-HP set include =288 HP settings.
Table D2: Key HPs optimized by HPOD, and the meta-HP set and the inner-HP set used in this study.

d.4 Baselines

We provide the details of baselines presented in Table 1 and §4.1, namely simple methods and complex methods.

Simple methods:

  1. Default always employees the same default/popular HP setting of the underlying OD algorithm (only applicable to the algorithms with recommended HPs).

  2. Random denotes selecting HPs randomly.

  3. MinLoss outputs the HP with the lowest internal loss (only applicable to the algorithms with an internal objective/loss) from a group of random samples HPs.

Complex methods:

  1. Hyperensemble (HyperEns or HE) that averages the outlier scores of randomly sampled HPs (wenzel2020hyperparameter). Strictly speaking, HE is not an HPO method.

  2. Global Best (GB) selects the best performing HP on meta-train database on average.

  3. ISAC  conf/ecai/KadiogluMST10 first groups meta-train datasets into clusters, and assigns the best performing HP in the meta-HP set to each cluster. During the online HPO phase, it first assigns the new dataset to one of the clusters and uses the group-based HP for the new dataset.

  4. ARGOSMART (AS)  nikolic2013simple identifies the most similar meta-train dataset of the new task, and outputs the best performing HP on the meta-task for the new task.

  5. MetaOD  zhao2021automatic uses matrix factorization to capture the dataset similarity and HPs’ performance similarity, which is the SOTA unsupervised outlier model selection method.

Additionally, we include (9) HPOD_0, a variant of HPOD that directly uses to choose from randomly sampled HPs (see §3.4.1). Note that these unsupervised OD model selection baselines (5-8) are not original for HPO for OD, and could not work with continuous HP spaces. We adapt them for HPO by selecting an HP from the meta-HP set described in §3.2.

d.5 Full Trace of Case Study

# Iter Metrics # Neighbors Norm. AP Rank
1 Manhattan 23 0.2866
2 Manhattan 23 0.2866
3 Manhattan 23 0.2866
4 Manhattan 23 0.2866
5 Cosine 41 0.327
6 Cosine 42 0.327
7 Cosine 55 0.3438
8 Cosine 55 0.3438
9 Cosine 55 0.3438
10 Cosine 55 0.3438
11 Cosine 55 0.3438
12 Cosine 55 0.3438
13 Cosine 55 0.3438
14 Chebyshev 72 0.3569
15 Chebyshev 72 0.3569
16 Chebyshev 72 0.3569
17 Chebyshev 72 0.3569
18 Chebyshev 72 0.3569
19 Chebyshev 72 0.3569
20 Chebyshev 72 0.3569
21 Chebyshev 72 0.3569
22 Chebyshev 73 0.357
23 Chebyshev 73 0.357
24 Chebyshev 73 0.357
25 Chebyshev 73 0.357
26 Chebyshev 73 0.357
27 Chebyshev 73 0.357
28 Chebyshev 73 0.357
29 Chebyshev 73 0.357
30 Chebyshev 73 0.357
31 Chebyshev 73 0.357
32 Chebyshev 73 0.357
33 Chebyshev 73 0.357
34 Chebyshev 73 0.357
35 Chebyshev 73 0.357
36 Chebyshev 73 0.357
37 Chebyshev 73 0.357
38 Chebyshev 73 0.357
39 Chebyshev 73 0.357
40 Chebyshev 73 0.357
41 Chebyshev 73 0.357
42 Chebyshev 73 0.357
43 Chebyshev 73 0.357
44 Chebyshev 73 0.357
45 Chebyshev 73 0.357
46 Chebyshev 73 0.357
47 Chebyshev 73 0.357
48 Chebyshev 73 0.357
49 Chebyshev 73 0.357
50 Chebyshev 73 0.357
Optimal Chebyshev 79 0.3609
Table D3: Full trace of HPOD on Cardiotocography dataset. Over iterations (col. 1), HPOD gradually identifies better HPs (col. 2 &3), with higher AP (col. 4). The optimal HP on from the meta-HP set is {’Chebyshev’, 79}, which HPOD gets closer to the optimal HP during its adaptive search (i.e., finding {’Chebyshev’, 73} in 30 iterations).

d.6 Full Performance Results

In addition to the avg. rank plot in Fig. 2, we provide the full performance of RAE, LOF, and iForest in Table D4, D5, and D6, respectively.

datasets Random GB ISAC AS HyperEns MetaOD MinLoss HPOD_0 HPOD
1_ALOI 0.8234 (3) 0.6445 (4) 0.6445 (4) 0.0612 (8) 0.6286 (6) 0.0612 (8) 0.3747 (7) 0.9206 (2) 0.974 (1)
2_Annthyroid 0.5506 (5) 0.0404 (9) 0.4336 (6) 0.5508 (4) 0.2499 (8) 0.8659 (2) 0.5651 (3) 0.3211 (7) 0.9883 (1)
3_Arrhythmia 0.4688 (9) 0.8112 (5) 0.7435 (6) 0.9987 (1) 0.4836 (8) 0.9036 (3) 0.7013 (7) 0.9451 (2) 0.9036 (3)
4_Cardiotocography 0.6519 (5) 0.5951 (6) 0.819 (3) 0.1432 (9) 0.6727 (4) 0.2786 (8) 0.5544 (7) 0.949 (2) 0.9688 (1)
5_Glass 0.6403 (3) 0.5781 (6) 0.4036 (7) 0.1823 (9) 0.7065 (1) 0.6094 (5) 0.2651 (8) 0.6549 (2) 0.6354 (4)
6_HeartDisease 0.5156 (3) 0.3958 (6) 0.5156 (2) 0.056 (8) 0.5662 (1) 0.1901 (7) 0.4128 (5) 0.4487 (4) 0.056 (8)
7_InternetAds 0.6623 (2) 0.5156 (7) 0.5156 (7) 0.5221 (6) 0.6743 (1) 0.5404 (5) 0.3419 (9) 0.6549 (3) 0.651 (4)
8_PageBlocks 0.7584 (5) 0.7591 (3) 0.7591 (3) 0.9779 (1) 0.6078 (7) 0.3047 (8) 0.7544 (6) 0.3047 (8) 0.9505 (2)
9_PenDigits 0.4909 (4) 0.0391 (9) 0.1289 (7) 0.4909 (5) 0.8 (1) 0.4909 (5) 0.5247 (3) 0.619 (2) 0.1289 (7)
10_Pima 0.4117 (8) 0.8698 (3) 0.8698 (3) 0.9779 (1) 0.7792 (5) 0.4115 (9) 0.6865 (6) 0.5247 (7) 0.9779 (1)
11_Shuttle 0.3195 (2) 0.319 (3) 0.319 (3) 0.319 (3) 0.3849 (1) 0.319 (3) 0.2464 (7) 0.0784 (8) 0.056 (9)
12_SpamBase 0.3208 (6) 0.1276 (8) 0.1276 (8) 0.9323 (1) 0.4618 (4) 0.7214 (2) 0.5242 (3) 0.194 (7) 0.3364 (5)
13_Stamps 0.6727 (5) 0.6302 (6) 0.1133 (9) 0.5924 (7) 0.7777 (4) 0.9844 (1) 0.5318 (8) 0.9026 (3) 0.9714 (2)
14_Waveform 0.687 (8) 0.7018 (7) 0.8242 (5) 0.8932 (2) 0.5257 (9) 0.8932 (2) 0.7977 (6) 0.9513 (1) 0.8932 (2)
15_WBC 0.3403 (4) 0.3398 (5) 0.8242 (1) 0.1628 (8) 0.6966 (2) 0.6328 (3) 0.0648 (9) 0.2628 (6) 0.2435 (7)
16_WDBC 0.5896 (5) 0.3542 (7) 0.0378 (9) 0.9115 (2) 0.5849 (6) 0.3542 (7) 0.7099 (4) 0.7219 (3) 0.9935 (1)
17_Wilt 0.5013 (8) 0.5013 (1) 0.5013 (1) 0.5013 (1) 0.0026 (9) 0.5013 (1) 0.5013 (1) 0.5013 (1) 0.5013 (1)
18_WPBC 0.5039 (5) 0.2943 (8) 0.5846 (4) 0.737 (2) 0.2899 (9) 0.6354 (3) 0.7562 (1) 0.4445 (7) 0.5039 (5)
19_annthyroid 0.6351 (7) 0.3112 (8) 0.9922 (1) 0.9701 (2) 0.8379 (5) 0.8307 (6) 0.131 (9) 0.8703 (4) 0.9312 (3)
20_arrhythmia 0.4753 (9) 0.6432 (7) 0.9714 (2) 0.9596 (4) 0.5143 (8) 0.7344 (5) 0.6883 (6) 0.969 (3) 0.9844 (1)
21_breastw 0.139 (6) 0.0716 (9) 0.6719 (2) 0.138 (7) 0.9818 (1) 0.138 (7) 0.2052 (5) 0.6719 (2) 0.6714 (4)
22_glass 0.4584 (6) 0.6589 (3) 0.6693 (2) 0.375 (8) 0.5475 (4) 0.7865 (1) 0.1318 (9) 0.3802 (7) 0.5312 (5)
23_ionosphere 0.7039 (9) 0.8047 (6) 0.8216 (5) 0.9844 (1) 0.7143 (8) 0.7969 (7) 0.8307 (4) 0.9096 (3) 0.9351 (2)
24_letter 0.5013 (3) 0.0872 (7) 0.0182 (8) 0.0182 (8) 0.5273 (2) 0.918 (1) 0.4492 (4) 0.1643 (6) 0.2591 (5)
25_lympho 0.7506 (8) 0.7695 (7) 0.8802 (5) 0.9583 (3) 0.6545 (9) 0.974 (2) 0.7805 (6) 0.9177 (4) 1 (1)
26_mammography 0.4844 (6) 0.3607 (9) 0.4844 (7) 0.3841 (8) 0.5896 (4) 0.9349 (1) 0.7328 (3) 0.531 (5) 0.7865 (2)
27_mnist 0.6052 (6) 0.9818 (1) 0.7786 (3) 0.2513 (9) 0.5496 (7) 0.6406 (5) 0.4677 (8) 0.7279 (4) 0.9401 (2)
28_musk 0.5584 (8) 0.901 (2) 0.6289 (6) 0.9857 (1) 0.5922 (7) 0.7018 (5) 0.1414 (9) 0.8112 (4) 0.8516 (3)
29_optdigits 0.5286 (5) 0.1406 (8) 0.5977 (3) 0.2122 (7) 0.6509 (2) 0.5977 (3) 0.794 (1) 0.2721 (6) 0.0755 (9)
30_pendigits 0.626 (3) 0.5169 (6) 0.5299 (5) 0.1445 (8) 0.8119 (1) 0.0924 (9) 0.318 (7) 0.7096 (2) 0.613 (4)
31_pima 0.713 (4) 0.7135 (3) 0.569 (7) 0.569 (7) 0.6312 (6) 0.569 (7) 0.6779 (5) 0.9247 (2) 0.9987 (1)
32_satellite 0.8714 (4) 0.8503 (6) 0.8685 (5) 0.9909 (1) 0.7813 (7) 0.2865 (9) 0.7471 (8) 0.9471 (2) 0.9323 (3)
33_satimage-2 0.9143 (3) 0.8333 (6) 0.8737 (5) 0.987 (1) 0.7927 (7) 0.3997 (9) 0.481 (8) 0.9063 (4) 0.9506 (2)
34_speech 0.6299 (7) 0.9714 (3) 0.4154 (9) 0.9792 (2) 0.4842 (8) 0.8177 (4) 0.6773 (6) 0.699 (5) 0.9909 (1)
35_thyroid 0.6468 (8) 0.9245 (4) 0.8633 (6) 0.9453 (2) 0.9647 (1) 0.7305 (7) 0.2867 (9) 0.9164 (5) 0.9247 (3)
36_vertebral 0.6974 (8) 0.6979 (3) 0.6979 (3) 0.6979 (3) 0.4649 (9) 0.6979 (3) 0.7487 (2) 0.8133 (1) 0.6979 (3)
37_vowels 0.7506 (7) 0.9349 (4) 0.4648 (9) 0.9974 (1) 0.6987 (8) 0.8581 (6) 0.869 (5) 0.9846 (2) 0.9818 (3)
38_wbc 0.6195 (6) 0.9401 (1) 0.8815 (3) 0.5443 (7) 0.6566 (5) 0.5443 (7) 0.456 (9) 0.9117 (2) 0.8727 (4)
39_wine 0.4818 (4) 0.4818 (5) 0.4818 (5) 0.4818 (5) 0.9403 (2) 0.9714 (1) 0.5797 (3) 0.387 (9) 0.4818 (5)
Average 0.5821 (7) 0.567 (8) 0.5981 (6) 0.6047 (5) 0.6226 (3) 0.6082 (4) 0.5258 (9) 0.6622 (2) 0.7216 (1)
STD 0.1580 0.2884 0.2630 0.3478 0.1937 0.2647 0.2261 0.2718 0.3103
Avg. Rank 5.5641 5.4103 4.8462 4.4359 5.0513 4.7949 5.7949 4.0256 3.3333
Table D4: Method evaluation on RAE (normalized AP rank). The most performing method is highlighted in bold. The algo. rank is provided in parenthesis (lower ranks denote better performance). HPOD achieves the best performance among all baselines.
datasets Random GB ISAC AS HyperEns MetaOD Default HPOD_0 HPOD
1_ALOI 0.6418 (5) 0.6825 (4) 0.625 (6) 0.95 (1) 0.005 (9) 0.1319 (8) 0.905 (2) 0.8975 (3) 0.3433 (7)
2_Annthyroid 0.5224 (7) 0.6175 (5) 0.565 (6) 0.0275 (8) 0.8836 (3) 0.9236 (2) 0.97 (1) 0.0125 (9) 0.6841 (4)
3_Arrhythmia 0.3134 (8) 0.4125 (5) 0.4 (6) 0.98 (1) 0.8557 (3) 0.2188 (9) 0.335 (7) 0.92 (2) 0.7761 (4)
4_Cardiotocography 0.4776 (4) 0.31 (6) 0.745 (3) 0.19 (8) 1 (1) 0.4583 (5) 0.165 (9) 0.2225 (7) 0.985 (2)
5_Glass 0.5721 (3) 0.3975 (5) 0.725 (2) 0.02 (9) 0.2647 (7) 0.1684 (8) 0.95 (1) 0.3525 (6) 0.5498 (4)
6_HeartDisease 0.5473 (6) 0.7625 (3) 0.4825 (7) 0.035 (9) 1 (1) 0.6319 (4) 0.32 (8) 0.555 (5) 0.8507 (2)
7_InternetAds 0.5174 (5) 0.5675 (3) 1 (1) 0.39 (7) 0.4825 (6) 0.9132 (2) 0.02 (9) 0.2775 (8) 0.5572 (4)
8_PageBlocks 0.4129 (6) 0.4425 (5) 0.985 (1) 0.235 (8) 0.61 (3) 0.5903 (4) 0.24 (7) 0.155 (9) 0.8507 (2)
9_PenDigits 0.3333 (7) 0.3475 (6) 0.69 (4) 0.4 (5) 0.1333 (8) 0.8924 (3) 0.9 (1) 0.075 (9) 0.9 (1)
10_Pima 0.3184 (8) 0.535 (6) 0.9975 (1) 0.47 (7) 0.99 (2) 0.783 (4) 0.02 (9) 0.59 (5) 0.9125 (3)
11_Shuttle 0.5174 (6) 0.2725 (8) 0.52 (5) 0.6925 (3) 0.0229 (9) 0.309 (7) 0.895 (1) 0.7125 (2) 0.6925 (3)
12_SpamBase 0.403 (5) 0.4075 (4) 0.21 (6) 0.17 (8) 1 (1) 0.1823 (7) 0.155 (9) 0.715 (3) 0.8259 (2)
13_Stamps 0.408 (4) 0.4225 (3) 0.8 (2) 0.08 (9) 0.408 (4) 0.1406 (8) 0.195 (6) 0.1625 (7) 0.815 (1)
14_Waveform 0.7761 (3) 0.2025 (6) 0.3425 (5) 0.175 (7) 0.0338 (9) 0.6528 (4) 0.85 (2) 0.0575 (8) 0.975 (1)
15_WBC 0.5522 (5) 0.4175 (6) 0.99 (1) 0.01 (9) 0.9284 (3) 0.9653 (2) 0.195 (7) 0.1775 (8) 0.765 (4)
16_WDBC 0.1592 (8) 0.6575 (3) 0.755 (2) 0.6575 (3) 0.1095 (9) 0.8368 (1) 0.235 (6) 0.6575 (3) 0.2 (7)
17_Wilt 0.4378 (4) 0.645 (3) 0.185 (7) 1 (1) 0.005 (9) 0.4045 (5) 0.2 (6) 0.995 (2) 0.17 (8)
18_WPBC 0.2886 (8) 0.695 (4) 0.7575 (2) 0.54 (6) 0.2527 (9) 0.3385 (7) 0.7275 (3) 0.95 (1) 0.615 (5)
19_annthyroid 0.4428 (8) 0.2525 (9) 0.58 (6) 1 (1) 0.9343 (2) 0.7743 (3) 0.485 (7) 0.61 (5) 0.7475 (4)
20_arrhythmia 0.4328 (6) 0.4425 (5) 0.165 (9) 0.94 (2) 0.7055 (3) 0.6927 (4) 0.205 (7) 0.19 (8) 0.99 (1)
21_breastw 0.6318 (6) 0.0975 (8) 0.8 (3) 0.7025 (4) 1 (1) 0.0556 (9) 0.185 (7) 0.8675 (2) 0.7025 (4)
22_glass 0.5473 (3) 0.5175 (4) 0.105 (8) 0.325 (7) 0.0229 (9) 0.3524 (6) 0.91 (2) 0.9825 (1) 0.4428 (5)
23_ionosphere 0.3781 (5) 0.5125 (4) 0.985 (1) 0.77 (3) 0.205 (7) 0.0694 (8) 0.055 (9) 0.2825 (6) 0.9403 (2)
24_letter 0.5224 (5) 0.7425 (3) 0.055 (7) 0.005 (8) 0.005 (9) 0.9722 (1) 0.95 (2) 0.44 (6) 0.54 (4)
25_lympho 0.4428 (6) 0.8275 (2) 0.03 (9) 0.3 (7) 0.9721 (1) 0.8108 (3) 0.6375 (5) 0.2425 (8) 0.805 (4)
26_mammography 0.4129 (6) 0.4175 (5) 0.9075 (2) 0.215 (7) 1 (1) 0.6563 (4) 0.045 (9) 0.1825 (8) 0.6775 (3)
27_mnist 0.5025 (5) 0.4175 (6) 0.24 (8) 0.055 (9) 0.7522 (2) 0.6302 (3) 0.505 (4) 0.405 (7) 0.79 (1)
28_musk 0.7662 (2) 0.6475 (4) 0.59 (5) 0.155 (8) 1 (1) 0.066 (9) 0.675 (3) 0.3575 (7) 0.3881 (6)
29_optdigits 0.5572 (7) 0.8925 (3) 0.3225 (9) 1 (1) 0.6299 (6) 0.7465 (5) 0.765 (4) 0.905 (2) 0.45 (8)
30_pendigits 0.6219 (8) 0.6625 (6) 0.6275 (7) 0.1725 (9) 1 (1) 1 (1) 0.675 (5) 0.7525 (4) 0.77 (3)
31_pima 0.4776 (6) 0.4225 (7) 0.9525 (1) 0.92 (2) 0.3775 (8) 0.5729 (5) 0.05 (9) 0.8875 (3) 0.86 (4)
32_satellite 0.408 (4) 0.2725 (8) 0.36 (7) 0.905 (3) 1 (1) 0.4063 (5) 0.25 (9) 0.365 (6) 0.985 (2)
33_satimage-2 0.5771 (6) 0.3325 (8) 0.7725 (5) 0.98 (1) 0.855 (4) 0.5226 (7) 0.94 (2) 0.915 (3) 0.005 (9)
34_speech 0.8209 (3) 0.8275 (2) 0.57 (5) 0.3275 (8) 0.8945 (1) 0.349 (7) 0.025 (9) 0.42 (6) 0.6825 (4)
35_thyroid 0.4925 (7) 0.2925 (8) 0.925 (3) 1 (1) 0.999 (2) 0.5556 (6) 0.24 (9) 0.58 (5) 0.91 (4)
36_vertebral 0.3881 (6) 0.6975 (4) 0.045 (9) 0.135 (8) 0.2418 (7) 0.8715 (2) 0.7375 (3) 0.885 (1) 0.49 (5)
37_vowels 0.403 (5) 0.2875 (7) 0.385 (6) 0.63 (3) 0.0259 (9) 0.8403 (2) 0.055 (8) 0.515 (4) 0.9652 (1)
38_wbc 0.3731 (6) 0.5575 (4) 0.435 (5) 0.19 (9) 0.9771 (1) 0.7326 (3) 0.2 (8) 0.32 (7) 0.9478 (2)
39_wine 0.3632 (5) 0.5875 (2) 0.44 (4) 0.055 (8) 0.1512 (7) 0.6806 (1) 0.21 (6) 0.0425 (9) 0.5275 (3)
Average 0.4811 (7) 0.5001 (6) 0.5658 (3) 0.4565 (8) 0.5829 (2) 0.5615 (4) 0.4379 (9) 0.5034 (5) 0.6945 (1)
STD 0.1354 0.1911 0.3005 0.3659 0.3996 0.2903 0.3397 0.3122 0.2457
Avg. Rank 5.5641 4.9744 4.7692 5.5897 4.5897 4.7179 5.6667 5.2564 3.6667
Table D5: Method evaluation on LOF (normalized AP rank). The most performing method is highlighted in bold. The algo. rank is provided in parenthesis (lower ranks denote better performance). HPOD achieves the best performance among all baselines.
datasets Random GB ISAC AS HyperEns MetaOD Default HPOD_0 HPOD
1_ALOI 0.3979 (3) 0.0191 (8) 0.5208 (2) 0.0868 (7) 0.2789 (5) 0.1319 (6) 0.0035 (9) 0.359 (4) 0.7326 (1)
2_Annthyroid 0.654 (6) 0.9583 (1) 0.0764 (9) 0.7535 (5) 0.5405 (7) 0.9236 (3) 0.8789 (4) 0.5167 (8) 0.9358 (2)
3_Arrhythmia 0.4879 (6) 0.8924 (1) 0.4427 (7) 0.7517 (4) 0.8166 (2) 0.2188 (8) 0.1107 (9) 0.6146 (5) 0.8125 (3)
4_Cardiotocography 0.481 (6) 0.691 (2) 0.9115 (1) 0.6319 (3) 0.5765 (4) 0.4583 (7) 0.5433 (5) 0.4115 (9) 0.4444 (8)
5_Glass 0.6021 (3) 0.2517 (8) 0.4253 (6) 0.3993 (7) 0.5606 (4) 0.1684 (9) 0.9585 (1) 0.5396 (5) 0.7795 (2)
6_HeartDisease 0.526 (8) 0.8681 (3) 0.9028 (2) 0.6997 (5) 0.6042 (7) 0.6319 (6) 0.09 (9) 0.8326 (4) 0.941 (1)
7_InternetAds 0.4498 (7) 0.8438 (3) 0.2465 (8) 0.0174 (9) 0.872 (2) 0.9132 (1) 0.4602 (6) 0.6438 (5) 0.7543 (4)
8_PageBlocks 0.3183 (7) 0.0972 (8) 0.6649 (2) 0.3333 (6) 0.5066 (5) 0.5903 (4) 0.0138 (9) 0.6167 (3) 0.9688 (1)
9_PenDigits 0.481 (4) 0.026 (9) 0.2951 (6) 0.9878 (1) 0.4976 (3) 0.8924 (2) 0.0623 (8) 0.3108 (5) 0.1806 (7)
10_Pima 0.3599 (7) 0.0677 (9) 0.9583 (2) 0.5417 (6) 0.6083 (5) 0.783 (3) 0.1211 (8) 0.6215 (4) 0.9861 (1)
11_Shuttle 0.4844 (5) 0.092 (9) 0.5035 (4) 0.3472 (6) 0.5654 (3) 0.309 (7) 0.1886 (8) 0.7326 (1) 0.6817 (2)
12_SpamBase 0.5433 (6) 0.8958 (4) 0.9028 (3) 0.9479 (2) 0.5772 (5) 0.1823 (9) 1 (1) 0.2563 (7) 0.218 (8)
13_Stamps 0.6298 (5) 0.8941 (2) 0.434 (8) 0.7795 (4) 0.6007 (6) 0.1406 (9) 0.564 (7) 0.8868 (3) 0.9913 (1)
14_Waveform 0.4844 (6) 0.0799 (8) 0.9757 (1) 0.3212 (7) 0.546 (5) 0.6528 (4) 0.0138 (9) 0.8441 (3) 0.872 (2)
15_WBC 0.3737 (7) 0.1372 (9) 0.7569 (4) 0.7917 (2) 0.5273 (6) 0.9653 (1) 0.1592 (8) 0.7563 (5) 0.7917 (2)
16_WDBC 0.526 (8) 0.7188 (4) 0.6354 (7) 0.6389 (6) 0.5038 (9) 0.8368 (2) 0.6903 (5) 0.7955 (3) 0.901 (1)
17_Wilt 0.5467 (4) 0.2882 (8) 0.3264 (6) 0.3264 (6) 0.5827 (3) 0.4045 (5) 0.1125 (9) 0.8028 (1) 0.7889 (2)
18_WPBC 0.4983 (3) 0.0799 (8) 0.2656 (7) 0.0313 (9) 0.4997 (2) 0.3385 (6) 0.4273 (5) 0.4399 (4) 0.7014 (1)
19_annthyroid 0.5952 (8) 0.6528 (7) 0.7292 (3) 0.6667 (5) 0.5661 (9) 0.7743 (2) 0.6713 (4) 0.8278 (1) 0.6574 (6)
20_arrhythmia 0.4706 (6) 0.3958 (7) 0.5399 (5) 0.0174 (9) 0.6879 (3) 0.6927 (2) 0.2664 (8) 0.5816 (4) 0.8564 (1)
21_breastw 0.4152 (7) 0.599 (5) 0.7257 (2) 0.5104 (6) 0.8844 (1) 0.0556 (9) 0.6609 (3) 0.6003 (4) 0.4115 (8)
22_glass 0.564 (5) 0.1684 (8) 0.75 (4) 0.0729 (9) 0.517 (6) 0.3524 (7) 0.9308 (1) 0.7594 (3) 0.9201 (2)
23_ionosphere 0.3218 (5) 0.0451 (8) 0.1875 (6) 0.941 (1) 0.4637 (4) 0.0694 (7) 0.0035 (9) 0.85 (3) 0.941 (1)
24_letter 0.4983 (6) 0.224 (7) 0.1354 (8) 0.7778 (4) 0.5384 (5) 0.9722 (1) 0.0104 (9) 0.7812 (3) 0.8201 (2)
25_lympho 0.4152 (5) 0.0556 (9) 0.2951 (6) 0.6233 (4) 0.7772 (2) 0.8108 (1) 0.6522 (3) 0.2455 (7) 0.1597 (8)
26_mammography 0.5502 (7) 0.6458 (4) 0.7361 (1) 0.5521 (6) 0.5356 (8) 0.6563 (3) 0.7076 (2) 0.5267 (9) 0.5536 (5)
27_mnist 0.4187 (6) 0.1285 (8) 0.8507 (2) 0.2917 (7) 0.618 (5) 0.6302 (4) 0.0035 (9) 0.7674 (3) 0.9152 (1)
28_musk 0.2595 (8) 0.9253 (2) 0.9253 (1) 0.3368 (7) 0.7661 (5) 0.066 (9) 0.8443 (4) 0.6722 (6) 0.9239 (3)
29_optdigits 0.4879 (7) 0.6215 (4) 0.6597 (3) 0.4861 (8) 0.6118 (5) 0.7465 (2) 0.6003 (6) 0.4281 (9) 0.9325 (1)
30_pendigits 0.526 (5) 0.0764 (9) 0.9826 (2) 0.3611 (7) 0.6893 (4) 1 (1) 0.8166 (3) 0.2007 (8) 0.3837 (6)
31_pima 0.3737 (7) 0.3802 (6) 0.4514 (4) 0.8628 (1) 0.609 (2) 0.5729 (3) 0.1246 (9) 0.4038 (5) 0.2535 (8)
32_satellite 0.4533 (6) 0.9861 (2) 0.3247 (8) 0.2535 (9) 0.51 (5) 0.4063 (7) 0.9965 (1) 0.5608 (4) 0.7274 (3)
33_satimage-2 0.436 (3) 0.3854 (5) 0.0069 (9) 0.4132 (4) 0.7848 (1) 0.5226 (2) 0.0138 (8) 0.266 (6) 0.191 (7)
34_speech 0.6817 (3) 0.6458 (4) 0.5573 (5) 0.5573 (5) 0.9052 (1) 0.349 (7) 0.0069 (9) 0.259 (8) 0.7024 (2)
35_thyroid 0.5294 (7) 0.9549 (1) 0.2309 (9) 0.8646 (3) 0.4976 (8) 0.5556 (6) 0.7197 (5) 0.7222 (4) 0.8819 (2)
36_vertebral 0.6367 (7) 0.8333 (6) 0.9549 (3) 0.9757 (2) 0.3239 (8) 0.8715 (5) 0.0208 (9) 0.9125 (4) 0.9861 (1)
37_vowels 0.5606 (4) 0.2639 (8) 0.3038 (7) 0.8785 (1) 0.6692 (3) 0.8403 (2) 0.0208 (9) 0.4854 (6) 0.5382 (5)
38_wbc 0.4844 (6) 0.1389 (7) 0.559 (5) 0.1007 (8) 0.5599 (4) 0.7326 (3) 0.9308 (1) 0.8899 (2) 0.0313 (9)
39_wine 0.5917 (4) 0.9063 (1) 0.059 (8) 0.0104 (9) 0.501 (6) 0.6806 (2) 0.5709 (5) 0.6031 (3) 0.3875 (7)
Average 0.4901 (7) 0.4598 (8) 0.5438 (5) 0.5113 (6) 0.5969 (3) 0.5615 (4) 0.4095 (9) 0.5981 (2) 0.6835 (1)
STD 0.0960 0.3493 0.2904 0.3025 0.1364 0.2903 0.3613 0.2087 0.2790
Avg. Rank 5.7179 5.6923 4.7692 5.3846 4.5641 4.5385 6.0769 4.6410 3.5128
Table D6: Method evaluation on iForest (normalized AP rank). The most performing method is highlighted in bold. The algo. rank is provided in parenthesis (lower ranks denote better performance). HPOD achieves the best performance among all.