This paper presents an introduction to the state-of-the-art in anomaly and change-point detection. On the one hand, the main concepts needed to understand the vast scientific literature on those subjects are introduced. On the other, a selection of important surveys and books, as well as two selected active research topics in the field, are presented.
translated by 谷歌翻译
Time series anomaly detection has applications in a wide range of research fields and applications, including manufacturing and healthcare. The presence of anomalies can indicate novel or unexpected events, such as production faults, system defects, or heart fluttering, and is therefore of particular interest. The large size and complex patterns of time series have led researchers to develop specialised deep learning models for detecting anomalous patterns. This survey focuses on providing structured and comprehensive state-of-the-art time series anomaly detection models through the use of deep learning. It providing a taxonomy based on the factors that divide anomaly detection models into different categories. Aside from describing the basic anomaly detection technique for each category, the advantages and limitations are also discussed. Furthermore, this study includes examples of deep anomaly detection in time series across various application domains in recent years. It finally summarises open issues in research and challenges faced while adopting deep anomaly detection models.
translated by 谷歌翻译
该行业许多领域的自动化越来越多地要求为检测异常事件设计有效的机器学习解决方案。随着传感器的普遍存在传感器监测几乎连续地区的复杂基础设施的健康,异常检测现在可以依赖于以非常高的频率进行采样的测量,从而提供了在监视下的现象的非常丰富的代表性。为了充分利用如此收集的信息,观察不能再被视为多变量数据,并且需要一个功能分析方法。本文的目的是探讨近期对实际数据集的功能设置中异常检测技术的性能。在概述最先进的和视觉描述性研究之后,比较各种异常检测方法。虽然功能设置中的异常分类(例如,形状,位置)在文献中记录,但为所识别的异常分配特定类型似乎是一个具有挑战性的任务。因此,鉴于模拟研究中的这些突出显示类型,现有方法的强度和弱点是基准测试。接下来在两个数据集上评估异常检测方法,与飞行中的直升机监测和建筑材料的光谱相同有关。基准分析由从业者的建议指导结束。
translated by 谷歌翻译
Unsupervised anomaly detection in time-series has been extensively investigated in the literature. Notwithstanding the relevance of this topic in numerous application fields, a complete and extensive evaluation of recent state-of-the-art techniques is still missing. Few efforts have been made to compare existing unsupervised time-series anomaly detection methods rigorously. However, only standard performance metrics, namely precision, recall, and F1-score are usually considered. Essential aspects for assessing their practical relevance are therefore neglected. This paper proposes an original and in-depth evaluation study of recent unsupervised anomaly detection techniques in time-series. Instead of relying solely on standard performance metrics, additional yet informative metrics and protocols are taken into account. In particular, (1) more elaborate performance metrics specifically tailored for time-series are used; (2) the model size and the model stability are studied; (3) an analysis of the tested approaches with respect to the anomaly type is provided; and (4) a clear and unique protocol is followed for all experiments. Overall, this extensive analysis aims to assess the maturity of state-of-the-art time-series anomaly detection, give insights regarding their applicability under real-world setups and provide to the community a more complete evaluation protocol.
translated by 谷歌翻译
The detection of anomalies in time series data is crucial in a wide range of applications, such as system monitoring, health care or cyber security. While the vast number of available methods makes selecting the right method for a certain application hard enough, different methods have different strengths, e.g. regarding the type of anomalies they are able to find. In this work, we compare six unsupervised anomaly detection methods with different complexities to answer the questions: Are the more complex methods usually performing better? And are there specific anomaly types that those method are tailored to? The comparison is done on the UCR anomaly archive, a recent benchmark dataset for anomaly detection. We compare the six methods by analyzing the experimental results on a dataset- and anomaly type level after tuning the necessary hyperparameter for each method. Additionally we examine the ability of individual methods to incorporate prior knowledge about the anomalies and analyse the differences of point-wise and sequence wise features. We show with broad experiments, that the classical machine learning methods show a superior performance compared to the deep learning methods across a wide range of anomaly types.
translated by 谷歌翻译
日志数据异常检测是IT操作的人工智能区域中的核心组件。但是,大量现有方法使其难以为特定系统选择正确的方法。更好地了解不同种类的异常,以及哪些算法适合检测它们,将支持研究人员和IT运营商。虽然已经存在的异常分类常见的分类,但尚未专门应用于记录数据,指出该域中的特征和特点。在本文中,我们为不同种类的日志数据异常提供了一种分类,并介绍了一种分析标记数据集中的这种异常的方法。我们将我们的分类系统应用于三个常见的基准数据集Thunderbird,Spirit和BGL,并培训了五种最先进的无监督异常检测算法,以评估它们在检测不同种类的异常中的性能。我们的结果表明,最常见的异常类型也是最容易预测的。此外,基于深度学习的方法在所有异常类型中占据了基于数据的方法,但特别是当涉及到检测语境异常时。
translated by 谷歌翻译
对于由硬件和软件组件组成的复杂分布式系统而言,异常检测是一个重要的问题。对此类系统的异常检测的要求和挑战的透彻理解对于系统的安全性至关重要,尤其是对于现实世界的部署。尽管有许多解决问题的研究领域和应用领域,但很少有人试图对这种系统进行深入研究。大多数异常检测技术是针对某些应用域的专门开发的,而其他检测技术则更为通用。在这项调查中,我们探讨了基于图的算法在复杂分布式异质系统中识别和减轻不同类型异常的重要潜力。我们的主要重点是在分布在复杂分布式系统上的异质计算设备上应用时,可深入了解图。这项研究分析,比较和对比该领域的最新研究文章。首先,我们描述了现实世界分布式系统的特征及其在复杂网络中的异常检测的特定挑战,例如数据和评估,异常的性质以及现实世界的要求。稍后,我们讨论了为什么可以在此类系统中利用图形以及使用图的好处。然后,我们将恰当地深入研究最先进的方法,并突出它们的优势和劣势。最后,我们评估和比较这些方法,并指出可能改进的领域。
translated by 谷歌翻译
本文提出了在时间序列中对在线异常检测的虚假发现率控制(FDRC)进行新规则。在线FDRC规则允许控制统计测试序列的属性。在异常检测的背景下,零假设是观察是正常的,并且替代方案是它是异常的。FDRC规则允许用户在无监督的设置中瞄准精度的较低限制。本文中提出的方法在异常检测的背景下克服了先前FDRC规则的短暂转移,特别是即使当替代方案非常罕见(典型的异常检测)并且测试统计数据串行依赖时,即使在替代的情况下,也是确保电力仍然很高。(典型的在时间序列中)。我们在理论和实验中展示了这些规则的健全性。
translated by 谷歌翻译
Concept drift primarily refers to an online supervised learning scenario when the relation between the input data and the target variable changes over time. Assuming a general knowledge of supervised learning in this paper we characterize adaptive learning process, categorize existing strategies for handling concept drift, overview the most representative, distinct and popular techniques and algorithms, discuss evaluation methodology of adaptive algorithms, and present a set of illustrative applications. The survey covers the different facets of concept drift in an integrated way to reflect on the existing scattered state-of-the-art. Thus, it aims at providing a comprehensive introduction to the concept drift adaptation for researchers, industry analysts and practitioners.
translated by 谷歌翻译
给定传感器读数随着时间的推移从电网上,我们如何在发生异常时准确地检测?实现这一目标的关键部分是使用电网传感器网络在电网上实时地在实时检测到自然故障或恶意的任何不寻常的事件。行业中现有的坏数据探测器缺乏鲁布布利地检测广泛类型的异常,特别是由于新兴网络攻击而造成的复杂性,因为它们一次在网格的单个测量快照上运行。新的ML方法更广泛适用,但通常不会考虑拓扑变化对传感器测量的影响,因此无法适应历史数据中的定期拓扑调整。因此,我们向DynWatch,基于域知识和拓扑知识算法用于使用动态网格上的传感器进行异常检测。我们的方法准确,优于实验中的现有方法20%以上(F-Measure);快速,在60K +分支机用中的每次传感器上平均运行小于1.7ms,使用笔记本电脑,并在图表的大小上线性缩放。
translated by 谷歌翻译
异常值是一个事件或观察,其被定义为不同于距群体的不规则距离的异常活动,入侵或可疑数据点。然而,异常事件的定义是主观的,取决于应用程序和域(能量,健康,无线网络等)。重要的是要尽可能仔细地检测异常事件,以避免基础设施故障,因为异常事件可能导致对基础设施的严重损坏。例如,诸如微电网的网络物理系统的攻击可以发起电压或频率不稳定性,从而损坏涉及非常昂贵的修复的智能逆变器。微电网中的不寻常活动可以是机械故障,行为在系统中发生变化,人体或仪器错误或恶意攻击。因此,由于其可变性,异常值检测(OD)是一个不断增长的研究领域。在本章中,我们讨论了使用AI技术的OD方法的进展。为此,通过多个类别引入每个OD模型的基本概念。广泛的OD方法分为六大类:基于统计,基于距离,基于密度的,基于群集的,基于学习的和合奏方法。对于每个类别,我们讨论最近最先进的方法,他们的应用领域和表演。之后,关于对未来研究方向的建议提供了关于各种技术的优缺点和挑战的简要讨论。该调查旨在指导读者更好地了解OD方法的最新进展,以便保证AI。
translated by 谷歌翻译
成像,散射和光谱是理解和发现新功能材料的基础。自动化和实验技术的当代创新导致这些测量更快,分辨率更高,从而产生了大量的分析数据。这些创新在用户设施和同步射击光源时特别明显。机器学习(ML)方法经常开发用于实时地处理和解释大型数据集。然而,仍然存在概念障碍,进入设施一般用户社区,通常缺乏ML的专业知识,以及部署ML模型的技术障碍。在此,我们展示了各种原型ML模型,用于在国家同步光源II(NSLS-II)的多个波束线上在飞行分析。我们谨慎地描述这些示例,专注于将模型集成到现有的实验工作流程中,使得读者可以容易地将它们自己的ML技术与具有普通基础设施的NSLS-II或设施的实验中的实验。此处介绍的框架展示了几乎没有努力,多样化的ML型号通过集成到实验编程和数据管理的现有Blueske套件中与反馈回路一起运行。
translated by 谷歌翻译
The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.
translated by 谷歌翻译
为了允许机器学习算法从原始数据中提取知识,必须首先清除,转换,并将这些数据置于适当的形式。这些通常很耗时的阶段被称为预处理。预处理阶段的一个重要步骤是特征选择,其目的通过减少数据集的特征量来更好地执行预测模型。在这些数据集中,不同事件的实例通常是不平衡的,这意味着某些正常事件被超出,而其他罕见事件非常有限。通常,这些罕见的事件具有特殊的兴趣,因为它们具有比正常事件更具辨别力。这项工作的目的是过滤提供给这些罕见实例的特征选择方法的实例,从而积极影响特征选择过程。在这项工作过程中,我们能够表明这种过滤对分类模型的性能以及异常值检测方法适用于该过滤。对于某些数据集,所产生的性能增加仅为百分点,但对于其他数据集,我们能够实现高达16%的性能的增加。这项工作应导致预测模型的改进以及在预处理阶段的过程中的特征选择更好的可解释性。本着公开科学的精神,提高了我们的研究领域的透明度,我们已经在公开的存储库中提供了我们的所有源代码和我们的实验结果。
translated by 谷歌翻译
现代高性能计算(HPC)系统的复杂性日益增加,需要引入自动化和数据驱动的方法,以支持系统管理员为增加系统可用性的努力。异常检测是改善可用性不可或缺的一部分,因为它减轻了系统管理员的负担,并减少了异常和解决方案之间的时间。但是,对当前的最新检测方法进行了监督和半监督,因此它们需要具有异常的人体标签数据集 - 在生产HPC系统中收集通常是不切实际的。基于聚类的无监督异常检测方法,旨在减轻准确的异常数据的需求,到目前为止的性能差。在这项工作中,我们通过提出RUAD来克服这些局限性,RUAD是一种新型的无监督异常检测模型。 Ruad比当前的半监督和无监督的SOA方法取得了更好的结果。这是通过考虑数据中的时间依赖性以及在模型体系结构中包括长短期限内存单元的实现。提出的方法是根据tier-0系统(带有980个节点的Cineca的Marconi100的完整历史)评估的。 RUAD在半监督训练中达到曲线(AUC)下的区域(AUC)为0.763,在无监督的训练中达到了0.767的AUC,这改进了SOA方法,在半监督训练中达到0.747的AUC,无需训练的AUC和0.734的AUC在无处不在的AUC中提高了AUC。训练。它还大大优于基于聚类的当前SOA无监督的异常检测方法,其AUC为0.548。
translated by 谷歌翻译
自动日志文件分析可以尽早发现相关事件,例如系统故障。特别是,自我学习的异常检测技术在日志数据中捕获模式,随后向系统操作员报告意外的日志事件事件,而无需提前提供或手动对异常情况进行建模。最近,已经提出了越来越多的方法来利用深度学习神经网络为此目的。与传统的机器学习技术相比,这些方法证明了出色的检测性能,并同时解决了不稳定数据格式的问题。但是,有许多不同的深度学习体系结构,并且编码由神经网络分析的原始和非结构化日志数据是不平凡的。因此,我们进行了系统的文献综述,概述了部署的模型,数据预处理机制,异常检测技术和评估。该调查没有定量比较现有方法,而是旨在帮助读者了解不同模型体系结构的相关方面,并强调未来工作的开放问题。
translated by 谷歌翻译
在许多应用程序中,检测异常行为是新兴的需求,尤其是在安全性和可靠性是关键方面的情况下。尽管对异常的定义严格取决于域框架,但它通常是不切实际的或太耗时的,无法获得完全标记的数据集。使用无监督模型来克服缺乏标签的模型通常无法捕获特定的特定异常情况,因为它们依赖于异常值的一般定义。本文提出了一种新的基于积极学习的方法Alif,以通过减少所需标签的数量并将检测器调整为用户提供的异常的定义来解决此问题。在存在决策支持系统(DSS)的情况下,提出的方法特别有吸引力,这种情况在现实世界中越来越流行。尽管常见的DSS嵌入异常检测功能取决于无监督的模型,但它们没有办法提高性能:Alif能够通过在常见操作期间利用用户反馈来增强DSS的功能。 Alif是对流行的隔离森林的轻巧修改,在许多真实的异常检测数据集中,相对于其他最先进的算法证明了相对于其他最先进算法的出色性能。
translated by 谷歌翻译
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g. confidences on labels.
translated by 谷歌翻译
分布(OOD)检测对于确保机器学习系统的可靠性和安全性至关重要。例如,在自动驾驶中,我们希望驾驶系统在发现在训练时间中从未见过的异常​​场景或对象时,发出警报并将控件移交给人类,并且无法做出安全的决定。该术语《 OOD检测》于2017年首次出现,此后引起了研究界的越来越多的关注,从而导致了大量开发的方法,从基于分类到基于密度到基于距离的方法。同时,其他几个问题,包括异常检测(AD),新颖性检测(ND),开放式识别(OSR)和离群检测(OD)(OD),在动机和方法方面与OOD检测密切相关。尽管有共同的目标,但这些主题是孤立发展的,它们在定义和问题设定方面的细微差异通常会使读者和从业者感到困惑。在这项调查中,我们首先提出一个称为广义OOD检测的统一框架,该框架涵盖了上述五个问题,即AD,ND,OSR,OOD检测和OD。在我们的框架下,这五个问题可以看作是特殊情况或子任务,并且更容易区分。然后,我们通过总结了他们最近的技术发展来审查这五个领域中的每一个,特别关注OOD检测方法。我们以公开挑战和潜在的研究方向结束了这项调查。
translated by 谷歌翻译
自2009年比特币成立以来,随着日常交易超过100亿美元,加密货币的市场已经超出了初始预期。随着行业的自动化,自动欺诈探测器的需求变得非常明显。实时检测异常会阻止潜在的事故和经济损失。多元时间序列数据中的异常检测提出了一个特定的挑战,因为它需要同时考虑时间依赖性和变量之间的关系。实时识别异常并不是一项容易的任务,特别是因为他们观察到的确切的异常行为。有些要点可能会呈现全球或局部异常行为,而其他点由于其频率或季节性行为或趋势的变化,可能是异常的。在本文中,我们建议从特定帐户进行以太坊的实时交易,并调查了各种各样的传统和新算法。我们根据他们搜索的策略和异常行为对它们进行分类,并表明当它们将它们捆绑在一起时,它们可以证明是一个很好的实时探测器,警报时间不超过几秒钟,并且非常有高信心。
translated by 谷歌翻译