While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
translated by 谷歌翻译
本文提出了一个贝叶斯模型,以比较任何度量的多个数据集上的多种算法。该模型基于Bradley-Terry模型,该模型计算出一种算法在不同数据集上的性能要好于另一个算法的次数。由于其贝叶斯基础,贝叶斯布拉德利·特里模型(BBT)的特征与经常主义的方法不同,可以比较多个数据集上的多种算法,例如Demsar(2006)对平均等级的测试,以及Benavoli等人。 (2016)多个成对的Wilcoxon测试,具有P-调整程序。特别是,贝叶斯的方法允许对算法发表更多细微的陈述,而不是声称差异是统计学意义的。贝叶斯的方法还允许定义何时出于实际目的或实际等效区域(绳索)等效的何时等效。与Benavoli等人提出的贝叶斯签名的等级比较程序不同。 (2017年),我们的方法可以为任何度量标准定义绳索,因为它基于概率声明,而不是基于该度量的差异。本文还提出了一个局部绳索概念,该概念评估了在某些交叉验证中对某些其他算法的平均值的平均度量之间的正差异是否应真正被视为基于效应大小的第一种算法比第二个算法更好。该局部绳索提案与贝叶斯的使用无关,可以根据等级的常见方式使用。可以使用实现BBT的R软件包和Python程序。
translated by 谷歌翻译
尽管在机器学习的方法论核心中是一个问题,但如何比较分类器仍未达成一致的共识。每个比较框架都面临着(至少)三个基本挑战:质量标准的多样性,数据集的多样性以及选择数据集选择的随机性/任意性。在本文中,我们通过采用决策理论的最新发展,为生动的辩论增添了新的观点。我们最终的框架基于所谓的偏好系统,通过广义的随机优势概念对分类器进行排名,该概念强大地绕过了繁琐的,甚至通常是自相矛盾的,对聚合的依赖。此外,我们表明,可以通过解决易于手柄的线性程序和通过适应的两样本观察随机化测试进行统计测试来实现广泛的随机优势。这确实产生了一个有力的框架,可以同时相对于多个质量标准进行分类器的统计比较。我们在模拟研究和标准基准数据集中说明和研究我们的框架。
translated by 谷歌翻译
评估模型性能的基准在机器学习中起重要作用。但是,没有确定的方法来描述和创建新的基准。此外,最常见的基准测试采用了具有多个限制的性能指标。例如,两个模型的性能差异没有概率的解释,没有参考点可以指示它们是否代表了显着的改进,并且比较数据集之间的此类差异是没有意义的。我们介绍了一种名为基于ELO的预测能力(EPP)的新的元评分评估,该评估构建在其他性能指标之上,并允许对模型进行可解释的比较。 EPP分数的差异具有概率的解释,可以直接比较数据集之间,此外,基于逻辑回归的设计允许根据偏差统计数据评估排名适应性。我们证明了EPP的数学属性,并通过30个分类数据集的大规模基准和视觉数据的现实基准测试的经验结果支持它们。此外,我们提出了一个统一的基准本体,用于对基准进行统一的描述。
translated by 谷歌翻译
本文介绍了分类器校准原理和实践的简介和详细概述。校准的分类器正确地量化了与其实例明智的预测相关的不确定性或信心水平。这对于关键应用,最佳决策,成本敏感的分类以及某些类型的上下文变化至关重要。校准研究具有丰富的历史,其中几十年来预测机器学习作为学术领域的诞生。然而,校准兴趣的最近增加导致了新的方法和从二进制到多种子体设置的扩展。需要考虑的选项和问题的空间很大,并导航它需要正确的概念和工具集。我们提供了主要概念和方法的介绍性材料和最新的技术细节,包括适当的评分规则和其他评估指标,可视化方法,全面陈述二进制和多字数分类的HOC校准方法,以及几个先进的话题。
translated by 谷歌翻译
Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.
translated by 谷歌翻译
We introduce a family of interpretable machine learning models, with two broad additions: Linearised Additive Models (LAMs) which replace the ubiquitous logistic link function in General Additive Models (GAMs); and SubscaleHedge, an expert advice algorithm for combining base models trained on subsets of features called subscales. LAMs can augment any additive binary classification model equipped with a sigmoid link function. Moreover, they afford direct global and local attributions of additive components to the model output in probability space. We argue that LAMs and SubscaleHedge improve the interpretability of their base algorithms. Using rigorous null-hypothesis significance testing on a broad suite of financial modelling data, we show that our algorithms do not suffer from large performance penalties in terms of ROC-AUC and calibration.
translated by 谷歌翻译
尽管机器学习方法已在金融领域广泛使用,但在非常成功的学位上,这些方法仍然可以根据解释性,可比性和可重复性来定制特定研究和不透明。这项研究的主要目的是通过提供一种通用方法来阐明这一领域,该方法是调查 - 不合Snostic且可解释给金融市场从业人员,从而提高了其效率,降低了进入的障碍,并提高了实验的可重复性。提出的方法在两个自动交易平台组件上展示。也就是说,价格水平,众所周知的交易模式和一种新颖的2步特征提取方法。该方法依赖于假设检验,该假设检验在其他社会和科学学科中广泛应用,以有效地评估除简单分类准确性之外的具体结果。提出的主要假设是为了评估所选的交易模式是否适合在机器学习设置中使用。在整个实验中,我们发现在机器学习设置中使用所考虑的交易模式仅由统计数据得到部分支持,从而导致效果尺寸微不足道(反弹7- $ 0.64 \ pm 1.02 $,反弹11 $ 0.38 \ pm 0.98 $,并且篮板15- $ 1.05 \ pm 1.16 $),但允许拒绝零假设。我们展示了美国期货市场工具上的通用方法,并提供了证据表明,通过这种方法,我们可以轻松获得除传统绩效和盈利度指标之外的信息指标。这项工作是最早将这种严格的统计支持方法应用于金融市场领域的工作之一,我们希望这可能是更多研究的跳板。
translated by 谷歌翻译
在现实世界中,非确定性测量很常见:随机优化算法的性能或在混乱环境中加强学习代理人的总奖励只是两个例子,其中不可预测的结果很常见。这些度量可以建模为随机变量,并通过其预期值或更复杂的工具(例如原假设统计检验)相互比较。在本文中,我们提出了一个替代框架,以根据估计的累积分布函数在视觉上比较两个样本。首先,我们为两个随机变量引入了一个优势度量,该变量量化了一个随机变量之一的累积分布函数学术上主导另一个变量的比例。然后,我们提出了一种在分位数中分解的图形方法i)提出的优势度量和ii)一个随机变量之一比另一个变量较低的值的概率。出于说明性目的,我们通过提出的方法重新评估了已经发表的工作的实验,我们表明可以推断出其他结论(通过其余方法错过)。此外,将软件包rvCompare创建为应用和实验建议的框架的一种方便方法。
translated by 谷歌翻译
本文开发了新型的保形方法,以测试是否从与参考集相同的分布中采样了新的观察结果。以创新的方式将感应性和偏置的共形推断融合,所描述的方法可以以原则性的方式基于已知的分布式数据的依赖侧信息重新权重标准p值,并且可以自动利用最强大的优势来自任何一级和二进制分类器的模型。该解决方案可以通过样品分裂或通过新颖的转置交叉验证+方案来实现,该方案与现有的交叉验证方法相比,由于更严格的保证,这也可能在共形推理的其他应用中有用。在研究错误的发现率控制和在具有几个可能的离群值的多个测试框架内的虚假发现率控制和功率之后,提出的解决方案被证明通过模拟以及用于图像识别和表格数据的应用超过了标准的共形P值。
translated by 谷歌翻译
无论是在功能选择的领域还是可解释的AI领域,都有基于其重要性的“排名”功能的愿望。然后可以将这种功能重要的排名用于:(1)减少数据集大小或(2)解释机器学习模型。但是,在文献中,这种特征排名没有以系统的,一致的方式评估。许多论文都有不同的方式来争论哪些具有重要性排名最佳的特征。本文通过提出一种新的评估方法来填补这一空白。通过使用合成数据集,可以事先知道特征重要性得分,从而可以进行更系统的评估。为了促进使用新方法的大规模实验,在Python建造了一个名为FSEVAL的基准测定框架。该框架允许并行运行实验,并在HPC系统上的计算机上分布。通过与名为“权重和偏见”的在线平台集成,可以在实时仪表板上进行交互探索图表。该软件作为开源软件发布,并在PYPI平台上以包裹发行。该研究结束时,探索了一个这样的大规模实验,以在许多方面找到参与算法的优势和劣势。
translated by 谷歌翻译
Dataset scaling, also known as normalization, is an essential preprocessing step in a machine learning pipeline. It is aimed at adjusting attributes scales in a way that they all vary within the same range. This transformation is known to improve the performance of classification models, but there are several scaling techniques to choose from, and this choice is not generally done carefully. In this paper, we execute a broad experiment comparing the impact of 5 scaling techniques on the performances of 20 classification algorithms among monolithic and ensemble models, applying them to 82 publicly available datasets with varying imbalance ratios. Results show that the choice of scaling technique matters for classification performance, and the performance difference between the best and the worst scaling technique is relevant and statistically significant in most cases. They also indicate that choosing an inadequate technique can be more detrimental to classification performance than not scaling the data at all. We also show how the performance variation of an ensemble model, considering different scaling techniques, tends to be dictated by that of its base model. Finally, we discuss the relationship between a model's sensitivity to the choice of scaling technique and its performance and provide insights into its applicability on different model deployment scenarios. Full results and source code for the experiments in this paper are available in a GitHub repository.\footnote{https://github.com/amorimlb/scaling\_matters}
translated by 谷歌翻译
推断线性关系是许多实证研究的核心。线性依赖性的度量应正确评估关系的强度,并符合对人群的有意义。 Pearson的相关系数(PCC)是双变量关系的\ textit {De-facto}量度,这两个方面都缺乏。估计的强度$ r $可能是由于样本量有限和数据非正态而可能错误的。在统计显着性测试的背景下,将$ p $值作为后验概率的错误解释导致I型错误 - 这是一个具有显着性测试的一般问题,扩展到PCC。同时测试多个假设时,此类错误会加剧。为了解决这些问题,我们提出了一种基于机器学习的预测数据校准方法,从本质上讲,该方法在预期的线性关系上进行了研究。使用校准数据计算PCC会产生校准的$ P $值,可以将其解释为后验概率以及校准的$ r $估计值,这是其他方法未提供的所需结果。此外,随之而来的对每个测试的独立解释可能会消除对多次测试校正的需求。我们提供了使用多个模拟和对现实世界数据的应用,有利于提出的方法的经验证据。
translated by 谷歌翻译
这篇综述的目的是将读者介绍到图表内,以将其应用于化学信息学中的分类问题。图内核是使我们能够推断分子的化学特性的功能,可以帮助您完成诸如寻找适合药物设计的化合物等任务。内核方法的使用只是一种特殊的两种方式量化了图之间的相似性。我们将讨论限制在这种方法上,尽管近年来已经出现了流行的替代方法,但最著名的是图形神经网络。
translated by 谷歌翻译
In this paper we i n vestigate the use of receiver operating characteristic (ROC) curve f o r the evaluation of machine learning algorithms. In particular, we i n vestigate the use of the area under the ROC curve ( A UC) as a measure of classi er performance. The machine learning algorithms used are chosen to be representative of those in common use: two decision trees (C4.5 and Multiscale Classi er) two n e u r a l n e t works (Perceptron and Multi-layer Perceptron) and two statistical methods (K-Nearest Neighbours and a Quadratic Discriminant F unction).The evaluation is done using six, \real world," medical diagnostics data sets that contain a varying numbers of inputs and samples, but are primarily continuous input, binary classi cation problems. We i d e n tify three forms of bias that can a ect comparisons of this type (estimation, selection, and expert bias) and detail the methods used to avoid them. We compare and discuss the use of AUC with the conventional measure of classi er performance, overall accuracy (the probability of a correct response). It is found that AUC exhibits a number of desirable properties when compared to overall accuracy: increased sensitivity in Analysis of Variance (ANOVA) tests a standard error that decreased as both AUC and the number of test samples increased decision threshold independent invariant t o a priori class probabilities and it gives an indication of the amount o f \ w ork done" by a classi cation scheme, giving low scores to both random and \one class only" classi ers.It has been known for some time that AUC actually represents the probability that a randomly chosen positive example is correctly rated (ranked) with greater suspicion than a randomly chosen negative example. Moreover, this probability of correct ranking is the same quantity estimated by the non-parametric Wilcoxon statistic. We use this equivalence to show that the standard deviation of AUC, estimated using 10 fold cross validation, is a reliable estimator of the standard error estimated using the Wilcoxon test. The paper concludes with the recommendation that AUC be used in preference to overall accuracy when \single number" evaluation of machine learning algorithms is required.
translated by 谷歌翻译
特征选择是数据科学流水线的重要步骤,以减少与大型数据集相关的复杂性。虽然对本主题的研究侧重于优化预测性能,但很少研究在特征选择过程的上下文中调查稳定性。在这项研究中,我们介绍了重复的弹性网技术(租金)进行特色选择。租金使用具有弹性净正常化的广义线性模型的集合,每个训练都培训了训练数据的不同子集。该特征选择基于三个标准评估所有基本模型的重量分布。这一事实导致选择具有高稳定性的特征,从而提高最终模型的稳健性。此外,与已建立的特征选择器不同,租金提供了有关在训练期间难以预测的数据中难以预测的对象的模型解释的有价值信息。在我们的实验中,我们在八个多变量数据集中对六个已建立的特征选择器进行基准测试,用于二进制分类和回归。在实验比较中,租金在预测性能和稳定之间展示了均衡的权衡。最后,我们强调了租金的额外解释价值与医疗保健数据集的探索性后HOC分析。
translated by 谷歌翻译
现在通常用于高风险设置,如医疗诊断,如医疗诊断,那么需要不确定量化,以避免后续模型失败。无分发的不确定性量化(无分布UQ)是用户友好的范式,用于为这种预测创建统计上严格的置信区间/集合。批判性地,间隔/集合有效而不进行分布假设或模型假设,即使具有最多许多DataPoints也具有显式保证。此外,它们适应输入的难度;当输入示例很困难时,不确定性间隔/集很大,信号传达模型可能是错误的。在没有多大的工作和没有再培训的情况下,可以在任何潜在的算法(例如神经网络)上使用无分​​发方法,以产生置信度集,以便包含用户指定概率,例如90%。实际上,这些方法易于理解和一般,应用于计算机视觉,自然语言处理,深度加强学习等领域出现的许多现代预测问题。这种实践介绍是针对对无需统计学家的免费UQ的实际实施感兴趣的读者。我们通过实际的理论和无分发UQ的应用领导读者,从保形预测开始,并使无关的任何风险的分布控制,如虚假发现率,假阳性分布检测,等等。我们将包括Python中的许多解释性插图,示例和代码样本,具有Pytorch语法。目标是提供读者对无分配UQ的工作理解,使它们能够将置信间隔放在算法上,其中包含一个自包含的文档。
translated by 谷歌翻译
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g. confidences on labels.
translated by 谷歌翻译
The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often referred to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of attempts so far at handling uncertainty in general and formalizing this distinction in particular.
translated by 谷歌翻译
This paper proposes a new tree-based ensemble method for supervised classification and regression problems. It essentially consists of randomizing strongly both attribute and cut-point choice while splitting a tree node. In the extreme case, it builds totally randomized trees whose structures are independent of the output values of the learning sample. The strength of the randomization can be tuned to problem specifics by the appropriate choice of a parameter. We evaluate the robustness of the default choice of this parameter, and we also provide insight on how to adjust it in particular situations. Besides accuracy, the main strength of the resulting algorithm is computational efficiency. A bias/variance analysis of the Extra-Trees algorithm is also provided as well as a geometrical and a kernel characterization of the models induced.
translated by 谷歌翻译