挖掘数据流姿势存在许多挑战,包括数据的连续和非静止性质,待处理的大量信息和限制计算资源。虽然在文献中提出了一些针对这个问题的监督解决方案,但大多数人都假定访问地面真理(以类标签的形式)是无限的,并且在更新学习系统时可以立即使用此类信息。这远非现实,因为必须考虑获取标签的基本成本。因此,需要解决流方案中实际真相要求的解决方案。在本文中,通过组合来自主动学习和自我标签的信息,提出了一种用于预算的挖水数据流的新框架。我们介绍了几种策略,可以利用智能实例选择和半监督程序,同时考虑到概念漂移的潜在存在。这种混合方法允许有效的探索和利用在现实标记预算中的流数据结构。由于我们的框架工作为包装器,因此它可以应用于不同的学习算法。实验研究,在具有各种类型的概念漂移的多样化现实数据流中进行的实验研究,证明了在处理对类标签的高度限制时拟议的策略的有用性。当一个人不能增加标签或更换低效分类器的预算时,呈现的混合方法尤其可行。我们为我们的战略提供了一套关于适用性领域的建议。
translated by 谷歌翻译
流数据分类的重要问题之一是概念漂移的发生,包括分类任务的概率特征的变化。这种现象不稳定了分类模型的性能,并严重降低了其质量。需要抵消这种现象的适当策略来使分类器适应变化的概率特征。实现此类解决方案的一个重要问题是访问数据标签。它通常是昂贵的,从而最大限度地减少与该过程相关的费用,提出了基于半监督学习的学习策略,例如,采用主动学习方法,该方法指示哪些传入对象是有价值的,以便标记为提高分类器的性能。本文提出了一种基于基于分类器集合学习的非静止数据流的基于块的方法,以及考虑可以成功应用于任何数据流分类算法的有限预算的主动学习策略。已经通过使用真实和生成的数据流进行了计算机实验来评估所提出的方法。结果证实了最先进的方法的高质量。
translated by 谷歌翻译
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, overview the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts and practitioners.
translated by 谷歌翻译
Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.
translated by 谷歌翻译
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
translated by 谷歌翻译
数据流分类是机器学习领域的重要问题。由于数据的非平稳性,其基础分布会随着时间的流逝而变化(概念漂移),因此该模型需要不断适应新的数据统计信息。基于流的主动学习(AL)方法通过交互式查询人类专家以在有限的预算内为最新样本提供新的数据标签来解决此问题。现有的AL策略假设可以立即可用标签,而在现实情况下,专家需要时间提供查询标签(验证延迟),而当请求的标签到达时,它们可能不再相关。在本文中,我们研究了在AL方法上存在概念漂移的情况下,有限,时间变化和未知验证延迟的影响。我们提出了繁殖(PR),这是一种独立的延迟效用估计器,它也预测了所请求但尚不清楚的标签。此外,我们提出了一种依赖漂移的动态预算策略,该策略在检测到的漂移后使用标签预算的可变分布。彻底的实验评估,包括合成和现实世界的非平稳数据集,以及验证延迟和预算的不同设置。我们从经验上表明,所提出的方法始终优于最先进的方法。此外,我们证明,随着时间的及时预算分配,可以提高AL策略的性能,而不会增加整体标签预算。
translated by 谷歌翻译
近年来,随着传感器和智能设备的广泛传播,物联网(IoT)系统的数据生成速度已大大增加。在物联网系统中,必须经常处理,转换和分析大量数据,以实现各种物联网服务和功能。机器学习(ML)方法已显示出其物联网数据分析的能力。但是,将ML模型应用于物联网数据分析任务仍然面临许多困难和挑战,特别是有效的模型选择,设计/调整和更新,这给经验丰富的数据科学家带来了巨大的需求。此外,物联网数据的动态性质可能引入概念漂移问题,从而导致模型性能降解。为了减少人类的努力,自动化机器学习(AUTOML)已成为一个流行的领域,旨在自动选择,构建,调整和更新机器学习模型,以在指定任务上实现最佳性能。在本文中,我们对Automl区域中模型选择,调整和更新过程中的现有方法进行了审查,以识别和总结将ML算法应用于IoT数据分析的每个步骤的最佳解决方案。为了证明我们的发现并帮助工业用户和研究人员更好地实施汽车方法,在这项工作中提出了将汽车应用于IoT异常检测问题的案例研究。最后,我们讨论并分类了该领域的挑战和研究方向。
translated by 谷歌翻译
Automated Machine Learning (AutoML) has been used successfully in settings where the learning task is assumed to be static. In many real-world scenarios, however, the data distribution will evolve over time, and it is yet to be shown whether AutoML techniques can effectively design online pipelines in dynamic environments. This study aims to automate pipeline design for online learning while continuously adapting to data drift. For this purpose, we design an adaptive Online Automated Machine Learning (OAML) system, searching the complete pipeline configuration space of online learners, including preprocessing algorithms and ensembling techniques. This system combines the inherent adaptation capabilities of online learners with the fast automated pipeline (re)optimization capabilities of AutoML. Focusing on optimization techniques that can adapt to evolving objectives, we evaluate asynchronous genetic programming and asynchronous successive halving to optimize these pipelines continually. We experiment on real and artificial data streams with varying types of concept drift to test the performance and adaptation capabilities of the proposed system. The results confirm the utility of OAML over popular online learning algorithms and underscore the benefits of continuous pipeline redesign in the presence of data drift.
translated by 谷歌翻译
主动学习(al)试图通过标记最少的样本来最大限度地提高模型的性能增益。深度学习(DL)是贪婪的数据,需要大量的数据电源来优化大量参数,因此模型了解如何提取高质量功能。近年来,由于互联网技术的快速发展,我们处于信息种类的时代,我们有大量的数据。通过这种方式,DL引起了研究人员的强烈兴趣,并已迅速发展。与DL相比,研究人员对Al的兴趣相对较低。这主要是因为在DL的崛起之前,传统的机器学习需要相对较少的标记样品。因此,早期的Al很难反映其应得的价值。虽然DL在各个领域取得了突破,但大多数这一成功都是由于大量现有注释数据集的宣传。然而,收购大量高质量的注释数据集消耗了很多人力,这在某些领域不允许在需要高专业知识,特别是在语音识别,信息提取,医学图像等领域中, al逐渐受到适当的关注。自然理念是AL是否可用于降低样本注释的成本,同时保留DL的强大学习能力。因此,已经出现了深度主动学习(DAL)。虽然相关的研究非常丰富,但它缺乏对DAL的综合调查。本文要填补这一差距,我们为现有工作提供了正式的分类方法,以及全面和系统的概述。此外,我们还通过申请的角度分析并总结了DAL的发展。最后,我们讨论了DAL中的混乱和问题,为DAL提供了一些可能的发展方向。
translated by 谷歌翻译
As an important data selection schema, active learning emerges as the essential component when iterating an Artificial Intelligence (AI) model. It becomes even more critical given the dominance of deep neural network based models, which are composed of a large number of parameters and data hungry, in application. Despite its indispensable role for developing AI models, research on active learning is not as intensive as other research directions. In this paper, we present a review of active learning through deep active learning approaches from the following perspectives: 1) technical advancements in active learning, 2) applications of active learning in computer vision, 3) industrial systems leveraging or with potential to leverage active learning for data iteration, 4) current limitations and future research directions. We expect this paper to clarify the significance of active learning in a modern AI model manufacturing process and to bring additional research attention to active learning. By addressing data automation challenges and coping with automated machine learning systems, active learning will facilitate democratization of AI technologies by boosting model production at scale.
translated by 谷歌翻译
异常值是一个事件或观察,其被定义为不同于距群体的不规则距离的异常活动,入侵或可疑数据点。然而,异常事件的定义是主观的,取决于应用程序和域(能量,健康,无线网络等)。重要的是要尽可能仔细地检测异常事件,以避免基础设施故障,因为异常事件可能导致对基础设施的严重损坏。例如,诸如微电网的网络物理系统的攻击可以发起电压或频率不稳定性,从而损坏涉及非常昂贵的修复的智能逆变器。微电网中的不寻常活动可以是机械故障,行为在系统中发生变化,人体或仪器错误或恶意攻击。因此,由于其可变性,异常值检测(OD)是一个不断增长的研究领域。在本章中,我们讨论了使用AI技术的OD方法的进展。为此,通过多个类别引入每个OD模型的基本概念。广泛的OD方法分为六大类:基于统计,基于距离,基于密度的,基于群集的,基于学习的和合奏方法。对于每个类别,我们讨论最近最先进的方法,他们的应用领域和表演。之后,关于对未来研究方向的建议提供了关于各种技术的优缺点和挑战的简要讨论。该调查旨在指导读者更好地了解OD方法的最新进展,以便保证AI。
translated by 谷歌翻译
复杂的事件处理(CEP)是一组方法,可以使用复杂和高度描述性模式从大规模数据流中提取有效的知识。许多应用程序,例如在线金融,医疗保健监控和欺诈检测,使用CEP技术来实时捕获关键警报,潜在威胁或重要通知。截至今天,在许多领域,模式是由人类专家手动定义的。但是,所需的模式通常包含令人费解的关系,而人类很难检测到,并且在许多领域中,人类的专业知识都是稀缺的。我们提出了救赎主(基于加固的CEP模式矿工),这是一种新颖的增强和主动学习方法,旨在采矿CEP模式,允许在减少所需人类努力的同时提取知识的扩展。这种方法包括一种新颖的政策梯度方法,用于庞大的多元空间,以及一种结合强化和积极学习以进行CEP规则学习的新方法,同时最大程度地减少培训所需的标签数量。救赎主的目标是使CEP集成在以前无法使用的域中。据我们所知,救赎主是第一个提出事先观察到的新CEP规则的系统,并且是第一种旨在增加专家没有足够信息的领域模式知识的方法。我们对各种数据集的实验表明,救赎主能够扩展模式知识,同时超过了几种用于模式挖掘的最先进的强化学习方法。
translated by 谷歌翻译
通常,机器学习应用程序必须应对动态环境,其中数据以潜在无限长度和瞬态行为的连续数据流的形式收集。与传统(批量)数据挖掘相比,流处理算法对计算资源和对数据演进的适应性具有额外要求。它们必须逐步处理实例,因为数据的连续流量禁止存储多次通过的数据。合奏学习在这种情况下取​​得了显着的预测性能。实现为一组(几个)个别分类器,合奏是自然可用于任务并行性的。但是,用于捕获概念漂移的增量学习和动态数据结构增加了缓存未命中并阻碍了并行性的好处。本文提出了一种迷你批处理策略,可以改善多核环境中用于流挖掘的多个集合算法的内存访问局部性和性能。借助正式框架,我们证明迷你批量可以显着降低重用距离(以及缓存未命中的数量)。在六种不同的最先进的集合算法上应用四个基准数据集的六种不同特性的实验显示了8个核心处理器上高达5倍的加速。这些效益牺牲了预测性能的少量减少。
translated by 谷歌翻译
Fairness-aware mining of massive data streams is a growing and challenging concern in the contemporary domain of machine learning. Many stream learning algorithms are used to replace humans at critical decision-making points e.g., hiring staff, assessing credit risk, etc. This calls for handling massive incoming information with minimum response delay while ensuring fair and high quality decisions. Recent discrimination-aware learning methods are optimized based on overall accuracy. However, the overall accuracy is biased in favor of the majority class; therefore, state-of-the-art methods mainly diminish discrimination by partially or completely ignoring the minority class. In this context, we propose a novel adaptation of Na\"ive Bayes to mitigate discrimination embedded in the streams while maintaining high predictive performance for both the majority and minority classes. Our proposed algorithm is simple, fast, and attains multi-objective optimization goals. To handle class imbalance and concept drifts, a dynamic instance weighting module is proposed, which gives more importance to recent instances and less importance to obsolete instances based on their membership in minority or majority class. We conducted experiments on a range of streaming and static datasets and deduced that our proposed methodology outperforms existing state-of-the-art fairness-aware methods in terms of both discrimination score and balanced accuracy.
translated by 谷歌翻译
操作网络通常依靠机器学习模型来进行许多任务,包括检测异常,推断应用程序性能和预测需求。然而,不幸的是,模型精度会因概念漂移而降低,从而,由于从软件升级到季节性到用户行为的变化,功能和目标预测之间的关系会发生变化。因此,缓解概念漂移是操作机器学习模型的重要组成部分,尽管它很重要,但在网络或一般的回归模型的背景下,概念漂移并未得到广泛的探索。因此,对于当前依赖机器学习模型的许多常见网络管理任务,如何检测或减轻它并不是一件好事。不幸的是,正如我们所展示的那样,通过使用新可用的数据经常重新培训模型可以充分缓解概念漂移,甚至可以进一步降低模型的准确性。在本文中,我们表征了美国主要大都市地区的大型蜂窝网络中的概念漂移。我们发现,概念漂移发生在许多重要的关键性能指标(KPI)上,独立于模型,训练集大小和时间间隔,因此需要采用实用方法来检测,解释和减轻它。为此,我们开发了特征(叶)的局部误差近似。叶检测到漂移;解释最有助于漂移的功能和时间间隔;并使用遗忘和过度采样来减轻漂移。我们使用超过四年的蜂窝KPI数据来评估叶子与行业标准的缓解方法。在美国,我们对主要的细胞提供商进行的初步测试表明,LEAF在各种KPI和模型上都是有效的。叶子始终优于周期性,并触发重新培训,同时还要降低昂贵的重新经营操作。
translated by 谷歌翻译
概念漂移过程挖掘(PM)是一种挑战,因为古典方法假设进程处于稳态,即事件共享相同的进程版本。我们对这些领域的交叉点进行了系统的文献综述,从而审查了过程采矿中的概念漂移,并提出了用于漂移检测和在线流程挖掘的现有技术的分类,以实现不断发展的环境。现有的作品描绘了(i)PM仍然主要关注离线分析,并且(ii)由于缺乏公共评估协议,数据集和指标,过程中的概念漂移技术的评估是麻烦的。
translated by 谷歌翻译
部署的机器学习模型面临着随着时间的流逝而改变数据的问题,这一现象也称为概念漂移。尽管现有的概念漂移检测方法已经显示出令人信服的结果,但它们需要真正的标签作为成功漂移检测的先决条件。尤其是在许多实际应用程序场景中,这种工作真实标签中涵盖的情况很少,而且它们的收购价格昂贵。因此,我们引入了一种用于漂移检测,不确定性漂移检测(UDD)的新算法,该算法能够检测到漂移而无需访问真正的标签。我们的方法基于深层神经网络与蒙特卡洛辍学的不确定性估计。通过将ADWIN技术应用于不确定性估计值,并检测到漂移触发预测模型的重新验证,可以检测到随时间变化的结构变化。与基于输入数据的漂移检测相反,我们的方法考虑了当前输入数据对预测模型属性的影响,而不是仅检测输入数据的变化(这可能导致不必要的重新培训)。我们表明,UDD在两个合成和十个现实世界数据集的回归和分类任务方面优于其他最先进的策略。
translated by 谷歌翻译
本文解决了在水模型部署民主化中采用了机器学习的一些挑战。第一个挑战是减少了在主动学习的帮助下减少了标签努力(因此关注数据质量),模型推断与Oracle之间的反馈循环:如在保险中,未标记的数据通常丰富,主动学习可能会成为一个重要的资产减少标签成本。为此目的,本文在研究其对合成和真实数据集的实证影响之前,阐述了各种古典主动学习方法。保险中的另一个关键挑战是模型推论中的公平问题。我们将在此主动学习框架中介绍和整合一个用于多级任务的后处理公平,以解决这两个问题。最后对不公平数据集的数值实验突出显示所提出的设置在模型精度和公平性之间存在良好的折衷。
translated by 谷歌翻译
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g. confidences on labels.
translated by 谷歌翻译
近年来,在线增量学习中兴趣增长。然而,这方面存在三个主要挑战。第一个主要困难是概念漂移,即流数据中的概率分布会随着数据到达而改变。第二个重大困难是灾难性的遗忘,即忘记在学习新知识之前学到的东西。我们经常忽略的最后一个是学习潜在的代表。只有良好的潜在表示可以提高模型的预测准确性。我们的研究在此观察中建立并试图克服这些困难。为此,我们提出了一种适应性在线增量学习,用于不断发展数据流(AOL)。我们使用带内存模块的自动编码器,一方面,我们获得了输入的潜在功能,另一方面,根据自动编码器的重建丢失与内存模块,我们可以成功检测存在的存在概念漂移并触发更新机制,调整模型参数及时。此外,我们划分从隐藏层的激活导出的特征,分为两个部分,用于分别提取公共和私有特征。通过这种方法,该模型可以了解新的即将到来的实例的私有功能,但不要忘记我们在过去(共享功能)中学到的内容,这减少了灾难性遗忘的发生。同时,要获取融合特征向量,我们使用自我关注机制来有效地融合提取的特征,这进一步改善了潜在的代表学习。
translated by 谷歌翻译