到目前为止,可解释的人工智能(XAI)主要集中在静态学习方案上。我们对逐步采样数据的动态场景感兴趣,学习是以增量而不是批处理模式进行的。我们寻求有效的增量算法来计算特征重要性(FI)度量,具体来说,基于缺乏特征的特征边缘化的增量FI度量,类似于置换功能的特征重要性(PFI)。我们提出了一种称为IPFI的高效,模型不足的算法,以逐步估算此度量,并在包括概念漂移(概念漂移)在内的动态建模条件下进行估算。我们证明了关于期望和差异方面的近似质量的理论保证。为了验证我们的理论发现和与传统批处理PFI相比,我们的方法的疗效,我们对具有和没有概念漂移的基准数据进行了多项实验研究。
translated by 谷歌翻译
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, overview the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts and practitioners.
translated by 谷歌翻译
随着复杂的机器学习模型越来越多地用于银行,交易或信用评分等敏感应用中,对可靠的解释机制的需求越来越不断增长。局部特征归因方法已成为事后和模型不足的解释的流行技术。但是,归因方法通常假设一个固定环境,其中预测模型已经受过训练并保持稳定。结果,通常不清楚本地归因在现实,不断发展的设置(例如流和在线应用程序)中的行为。在本文中,我们讨论了时间变化对本地特征归因的影响。特别是,我们表明,每次更新预测模型或概念漂移都会改变数据生成分布时,本地归因都会变得过时。因此,数据流中的局部特征归因只有在结合一种机制结合使用的机制时才能提供高解释性功能,该机制使我们能够随着时间的推移检测和响应局部变化。为此,我们介绍了Cdleeds,这是一个灵活而模型的不合理框架,用于检测局部变化和概念漂移。 CDEREDS是基于归因的解释技术的直观扩展,以识别过时的局部归因并实现更多针对性的重新计算。在实验中,我们还表明,所提出的框架可以可靠地检测到本地和全球概念漂移。因此,我们的工作在在线机器学习中有助于更有意义,更强大的解释性。
translated by 谷歌翻译
基于Shapley值的功能归因在解释机器学习模型中很受欢迎。但是,从理论和计算的角度来看,它们的估计是复杂的。我们将这种复杂性分解为两个因素:(1)〜删除特征信息的方法,以及(2)〜可拖动估计策略。这两个因素提供了一种天然镜头,我们可以更好地理解和比较24种不同的算法。基于各种特征删除方法,我们描述了多种类型的Shapley值特征属性和计算每个类型的方法。然后,基于可进行的估计策略,我们表征了两个不同的方法家族:模型 - 不合时宜的和模型特定的近似值。对于模型 - 不合稳定的近似值,我们基准了广泛的估计方法,并将其与Shapley值的替代性但等效的特征联系起来。对于特定于模型的近似值,我们阐明了对每种方法的线性,树和深模型的障碍至关重要的假设。最后,我们确定了文献中的差距以及有希望的未来研究方向。
translated by 谷歌翻译
The notion of concept drift refers to the phenomenon that the distribution generating the observed data changes over time. If drift is present, machine learning models may become inaccurate and need adjustment. Many technologies for learning with drift rely on the interleaved test-train error (ITTE) as a quantity which approximates the model generalization error and triggers drift detection and model updates. In this work, we investigate in how far this procedure is mathematically justified. More precisely, we relate a change of the ITTE to the presence of real drift, i.e., a changed posterior, and to a change of the training result under the assumption of optimality. We support our theoretical findings by empirical evidence for several learning algorithms, models, and datasets.
translated by 谷歌翻译
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
translated by 谷歌翻译
Classical asymptotic theory for statistical inference usually involves calibrating a statistic by fixing the dimension $d$ while letting the sample size $n$ increase to infinity. Recently, much effort has been dedicated towards understanding how these methods behave in high-dimensional settings, where $d$ and $n$ both increase to infinity together. This often leads to different inference procedures, depending on the assumptions about the dimensionality, leaving the practitioner in a bind: given a dataset with 100 samples in 20 dimensions, should they calibrate by assuming $n \gg d$, or $d/n \approx 0.2$? This paper considers the goal of dimension-agnostic inference; developing methods whose validity does not depend on any assumption on $d$ versus $n$. We introduce an approach that uses variational representations of existing test statistics along with sample splitting and self-normalization to produce a new test statistic with a Gaussian limiting distribution, regardless of how $d$ scales with $n$. The resulting statistic can be viewed as a careful modification of degenerate U-statistics, dropping diagonal blocks and retaining off-diagonal blocks. We exemplify our technique for some classical problems including one-sample mean and covariance testing, and show that our tests have minimax rate-optimal power against appropriate local alternatives. In most settings, our cross U-statistic matches the high-dimensional power of the corresponding (degenerate) U-statistic up to a $\sqrt{2}$ factor.
translated by 谷歌翻译
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
translated by 谷歌翻译
Many scientific and engineering challenges-ranging from personalized medicine to customized marketing recommendations-require an understanding of treatment effect heterogeneity. In this paper, we develop a non-parametric causal forest for estimating heterogeneous treatment effects that extends Breiman's widely used random forest algorithm. In the potential outcomes framework with unconfoundedness, we show that causal forests are pointwise consistent for the true treatment effect, and have an asymptotically Gaussian and centered sampling distribution. We also discuss a practical method for constructing asymptotic confidence intervals for the true treatment effect that are centered at the causal forest estimates. Our theoretical results rely on a generic Gaussian theory for a large family of random forest algorithms. To our knowledge, this is the first set of results that allows any type of random forest, including classification and regression forests, to be used for provably valid statistical inference. In experiments, we find causal forests to be substantially more powerful than classical methods based on nearest-neighbor matching, especially in the presence of irrelevant covariates.
translated by 谷歌翻译
可变重要性措施是分析随机林的黑盒机制的主要工具。虽然平均值降低精度(MDA)被广泛接受作为随机森林最有效的可变重要性措施,但对其统计特性知之甚少。实际上,确切的MDA定义在主随机林软件上变化。在本文中,我们的目标是严格分析主要MDA实施的行为。因此,我们在数学上正式地形化各种实施的MDA算法,然后在样本量增加时建立限制。特别是,我们在三个组件中分解了这些限制:第一个与Sobol指数有关,这是对响应方差的协变度贡献的明确定义措施,广泛应用于敏感性分析领域,而不是TheThird术语,谁的价值随着协变量的依赖而增加。因此,我们理论上证明了MDA在协变者依赖时,MDA不会瞄准正确的数量,这是实验发现的事实。为了解决这个问题,我们为随机林,Sobol-MDA定义了一个新的重要性测量,它修复了原始MDA的缺陷。我们证明了Sobol-MDA的一致性,并表明Sobol-MDA在模拟和实际数据上经验胜过其竞争对手。 R和C ++中的开源实现可在线获取。
translated by 谷歌翻译
由于其出色的经验表现,随机森林是过去十年中使用的机器学习方法之一。然而,由于其黑框的性质,在许多大数据应用中很难解释随机森林的结果。量化各个特征在随机森林中的实用性可以大大增强其解释性。现有的研究表明,一些普遍使用的特征对随机森林的重要性措施遭受了偏见问题。此外,对于大多数现有方法,缺乏全面的规模和功率分析。在本文中,我们通过假设检验解决了问题,并提出了一个自由化特征 - 弥散性相关测试(事实)的框架,以评估具有偏见性属性的随机森林模型中给定特征的重要性,我们零假设涉及该特征是否与所有其他特征有条件地独立于响应。关于高维随机森林一致性的一些最新发展,对随机森林推断的这种努力得到了赋予的能力。在存在功能依赖性的情况下,我们的事实测试的香草版可能会遇到偏见问题。我们利用偏置校正的不平衡和调节技术。我们通过增强功率的功能转换将合奏的想法进一步纳入事实统计范围。在相当普遍的具有依赖特征的高维非参数模型设置下,我们正式确定事实可以提供理论上合理的随机森林具有P值,并通过非催化分析享受吸引人的力量。新建议的方法的理论结果和有限样本优势通过几个模拟示例和与Covid-19的经济预测应用进行了说明。
translated by 谷歌翻译
部署的机器学习模型面临着随着时间的流逝而改变数据的问题,这一现象也称为概念漂移。尽管现有的概念漂移检测方法已经显示出令人信服的结果,但它们需要真正的标签作为成功漂移检测的先决条件。尤其是在许多实际应用程序场景中,这种工作真实标签中涵盖的情况很少,而且它们的收购价格昂贵。因此,我们引入了一种用于漂移检测,不确定性漂移检测(UDD)的新算法,该算法能够检测到漂移而无需访问真正的标签。我们的方法基于深层神经网络与蒙特卡洛辍学的不确定性估计。通过将ADWIN技术应用于不确定性估计值,并检测到漂移触发预测模型的重新验证,可以检测到随时间变化的结构变化。与基于输入数据的漂移检测相反,我们的方法考虑了当前输入数据对预测模型属性的影响,而不是仅检测输入数据的变化(这可能导致不必要的重新培训)。我们表明,UDD在两个合成和十个现实世界数据集的回归和分类任务方面优于其他最先进的策略。
translated by 谷歌翻译
A flexible method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this paper focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
translated by 谷歌翻译
在本文中,我们提出了一种新的可解释性形式主义,旨在阐明测试集的每个输入变量如何影响机器学习模型的预测。因此,我们根据训练有素的机器学习决策规则提出了一个群体的解释性形式,它们是根据其对输入变量分布的可变性的反应。为了强调每个输入变量的影响,这种形式主义使用信息理论框架,该框架量化了基于熵投影的所有输入输出观测值的影响。因此,这是第一个统一和模型不可知的形式主义,使数据科学家能够解释输入变量之间的依赖性,它们对预测错误的影响以及它们对输出预测的影响。在大型样本案例中提供了熵投影的收敛速率。最重要的是,我们证明,计算框架中的解释具有低算法的复杂性,使其可扩展到现实生活中的大数据集。我们通过解释通过在各种数据集上使用XGBoost,随机森林或深层神经网络分类器(例如成人收入,MNIST,CELEBA,波士顿住房,IRIS以及合成的)上使用的复杂决策规则来说明我们的策略。最终,我们明确了基于单个观察结果的解释性策略石灰和摇摆的差异。可以通过使用自由分布的Python工具箱https://gems-ai.aniti.fr/来复制结果。
translated by 谷歌翻译
机器学习渗透到许多行业,这为公司带来了新的利益来源。然而,在人寿保险行业中,机器学习在实践中并未被广泛使用,因为在过去几年中,统计模型表明了它们的风险评估效率。因此,保险公司可能面临评估人工智能价值的困难。随着时间的流逝,专注于人寿保险行业的修改突出了将机器学习用于保险公司的利益以及通过释放数据价值带来的利益。本文回顾了传统的生存建模方法论,并通过机器学习技术扩展了它们。它指出了与常规机器学习模型的差异,并强调了特定实现在与机器学习模型家族中面对审查数据的重要性。在本文的补充中,已经开发了Python库。已经调整了不同的开源机器学习算法,以适应人寿保险数据的特殊性,即检查和截断。此类模型可以轻松地从该SCOR库中应用,以准确地模拟人寿保险风险。
translated by 谷歌翻译
Originating from cooperative game theory, Shapley values have become one of the most widely used measures for variable importance in applied Machine Learning. However, the statistical understanding of Shapley values is still limited. In this paper, we take a nonparametric (or smoothing) perspective by introducing Shapley curves as a local measure of variable importance. We propose two estimation strategies and derive the consistency and asymptotic normality both under independence and dependence among the features. This allows us to construct confidence intervals and conduct inference on the estimated Shapley curves. The asymptotic results are validated in extensive experiments. In an empirical application, we analyze which attributes drive the prices of vehicles.
translated by 谷歌翻译
This paper proposes a novel approach to explain the predictions made by data-driven methods. Since such predictions rely heavily on the data used for training, explanations that convey information about how the training data affects the predictions are useful. The paper proposes a novel approach to quantify how different data-clusters of the training data affect a prediction. The quantification is based on Shapley values, a concept which originates from coalitional game theory, developed to fairly distribute the payout among a set of cooperating players. A player's Shapley value is a measure of that player's contribution. Shapley values are often used to quantify feature importance, ie. how features affect a prediction. This paper extends this to cluster importance, letting clusters of the training data act as players in a game where the predictions are the payouts. The novel methodology proposed in this paper lets us explore and investigate how different clusters of the training data affect the predictions made by any black-box model, allowing new aspects of the reasoning and inner workings of a prediction model to be conveyed to the users. The methodology is fundamentally different from existing explanation methods, providing insight which would not be available otherwise, and should complement existing explanation methods, including explanations based on feature importance.
translated by 谷歌翻译
The widely used 'Counterfactual' definition of Causal Effects was derived for unbiasedness and accuracy - and not generalizability. We propose a simple definition for the External Validity (EV) of Interventions and Counterfactuals. The definition leads to EV statistics for individual counterfactuals, and to non-parametric effect estimators for sets of counterfactuals (i.e., for samples). We use this new definition to discuss several issues that have baffled the original counterfactual formulation: out-of-sample validity, reliance on independence assumptions or estimation, concurrent estimation of multiple effects and full-models, bias-variance tradeoffs, statistical power, omitted variables, and connections to current predictive and explaining techniques. Methodologically, the definition also allows us to replace the parametric, and generally ill-posed, estimation problems that followed the counterfactual definition by combinatorial enumeration problems in non-experimental samples. We use this framework to generalize popular supervised, explaining, and causal-effect estimators, improving their performance across three dimensions (External Validity, Unconfoundness and Accuracy) and enabling their use in non-i.i.d. samples. We demonstrate gains over the state-of-the-art in out-of-sample prediction, intervention effect prediction and causal effect estimation tasks. The COVID19 pandemic highlighted the need for learning solutions to provide general predictions in small samples - many times with missing variables. We also demonstrate applications in this pressing problem.
translated by 谷歌翻译
We introduce the XPER (eXplainable PERformance) methodology to measure the specific contribution of the input features to the predictive or economic performance of a model. Our methodology offers several advantages. First, it is both model-agnostic and performance metric-agnostic. Second, XPER is theoretically founded as it is based on Shapley values. Third, the interpretation of the benchmark, which is inherent in any Shapley value decomposition, is meaningful in our context. Fourth, XPER is not plagued by model specification error, as it does not require re-estimating the model. Fifth, it can be implemented either at the model level or at the individual level. In an application based on auto loans, we find that performance can be explained by a surprisingly small number of features. XPER decompositions are rather stable across metrics, yet some feature contributions switch sign across metrics. Our analysis also shows that explaining model forecasts and model performance are two distinct tasks.
translated by 谷歌翻译
福利值广泛用作模型不可知的解释框架,以解释复杂的预测机器学习模型。福利值具有理想的理论特性和声音数学基础。精确的福芙值估计依赖数据依赖于所有特征组合之间的依赖性的准确建模。在本文中,我们使用具有任意调节(VAEAC)的变形AutoEncoder来同时建模所有特征依赖性。我们通过全面的仿真研究证明了VAEAC对于连续和混合依赖特征的各种环境来说,VAEAC优于最先进的方法。最后,我们将VAEAC应用于从UCI机器学习存储库中的鲍鱼数据集。
translated by 谷歌翻译