This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious attackers. Current state of the art anomaly detection approaches suffer from scalability, use-case restrictions , difficulty of use and a large number of false positives. Our system at Yahoo, EGADS, uses a collection of anomaly detection and forecasting models with an anomaly filtering layer for accurate and scalable anomaly detection on time-series. We compare our approach against other anomaly detection systems on real and synthetic data with varying time-series characteristics. We found that our framework allows for 50-60% improvement in precision and recall for a variety of use-cases. Both the data and the framework are being open-sourced. The open-sourcing of the data, in particular , represents the first of its kind effort to establish the standard benchmark for anomaly detection.
translated by 谷歌翻译
Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) is arguably the most widely applied un-supervised anomaly detection technique for networked data streams due to its simplicity and efficiency. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we first proposed a novel joint sparse PCA method to perform anomaly detection and localization for network data streams. Our key observation is that we can detect anomalies and localize anomalous sources by identifying a low dimensional abnormal subspace that captures the abnormal behavior of data. To better capture the sources of anomalies, we incorporated the structure of the network stream data in our anomaly localization framework. Also, an extended version of PCA, multi-dimensional KLE, was introduced to stabilize the localization performance. We performed comprehensive experimental studies on four real-world data sets from different application domains and compared our proposed techniques with several state-of-the-arts. Our experimental studies demonstrate the utility of the proposed methods.
translated by 谷歌翻译
在本文中,我们提出了一种新颖的异常检测机器学习方法。着眼于具有周期性特征的数据,其中明确允许随机变化的周期长度,通过训练由执行相位分类的深度卷积神经网络组成的数据自适应分类器来进行多维时间序列分析。整个算法包括数据预处理,周期检测,分割,神经网络的动态调整,用于完全自动执行。所提出的方法在来自心脏病学,入侵检测和信号处理的三个示例数据集上进行评估,呈现合理的性能。
translated by 谷歌翻译
由于数据量大,因此增加了对自主和通用异常检测系统的需求。然而,开发一种准确且快速的独立的通用异常检测系统仍然是一个挑战。在本文中,我们提出了传统的时间序列分析方法,季节自回归整合移动平均(SARIMA)模型和使用黄土(STL)的SeasonalTrend分解,以检测复杂和各种异常。通常,SARIMA和STL仅用于静止和周期时间 - 系列,但通过组合,我们表明他们可以检测高精度的异常,甚至嘈杂和非周期性的数据。我们将该算法与Long ShortTerm Memory(LSTM)进行了比较,LSTM是一种用于异常检测系统的基于深度学习的算法。我们总共使用了七个真实数据集和四个具有不同时间序列属性的人工数据集来验证所提算法的性能。
translated by 谷歌翻译
不同域中的许多应用程序产生大量的时间序列数据。准确预测对于许多决策者来说至关重要。存在各种时间序列预测方法,其使用线性和非线性模型或两者的组合。研究表明,线性和非线性模型的结合可以有效地提高预测性能。但是,这些现有方法所做的一些假设可能会限制它们在某些情况下的性能。我们提供了一种新的自回归集成移动平均(ARIMA) - 人工神经网络(ANN)混合方法,该方法在更通用的框架中工作。实验结果表明,在整个杂交过程中分解原始数据和组合线性和非线性模型的策略是预测方法性能的关键因素。通过使用适当的策略,我们的混合方法可以成为提高传统混合方法获得的预测准确度的有效方法,也可以是单独使用的单独方法之一。
translated by 谷歌翻译
如今,多变量时间序列数据越来越多地收集在各​​种现实世界系统中,例如发电厂,可穿戴设备等。多变量时间序列中的异常检测和诊断是指在某些时间步骤中识别异常状态并查明根本原因。然而,这样的系统具有挑战性,因为它不仅需要捕获每个时间序列中的时间依赖性,而且还需要编码不同时间序列对之间的相关性。此外,系统应该对噪声具有鲁棒性,并根据不同事件的严重程度为操作员提供不同级别的异常分数。尽管已经开发了许多无监督的异常检测算法,但是它们中很少能够共同解决这些挑战。在本文中,我们提出了一种多尺度卷积递归编码器 - 解码器(MSCRED),用于在多变量时间序列数据中进行性能检测和诊断。具体来说,MSCRED首先构建多尺度(分辨率)签名矩阵,以在不同的时间步长中表征系统状态的多个级别。随后,给定签名矩阵,使用卷积编码器来编码传感器间(时间序列)相关性和注意力。基于卷积长短期记忆(ConvLSTM)网络被开发用于捕获时间模式。最后,基于编码传感器间相关性和时间信息的特征图,使用卷积解码器重建输入签名矩阵,并且进一步利用残余签名矩阵来检测和诊断异常。基于合成数据集和真实发电厂数据集的广泛实证研究表明,MSCRED可以胜过最先进的基线方法。
translated by 谷歌翻译
Anomalies are unusual and significant changes in a network's traffic levels, which can often involve multiple links. Diagnosing anomalies is critical for both network operators and end users. It is a difficult problem because one must extract and interpret anomalous patterns from large amounts of high-dimensional, noisy data. In this paper we propose a general method to diagnose anomalies. This method is based on a separation of the high-dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and anomalous network conditions. We show that this separation can be performed effectively using Principal Component Analysis. Using only simple traffic measurements from links, we study volume anomalies and show that the method can: (1) accurately detect when a volume anomaly is occurring; (2) correctly identify the underlying origin-destination (OD) flow which is the source of the anomaly; and (3) accurately estimate the amount of traffic involved in the anomalous OD flow. We evaluate the method's ability to diagnose (i.e., detect, identify, and quantify) both existing and synthetically injected volume anomalies in real traffic from two backbone networks. Our method consistently diagnoses the largest volume anomalies, and does so with a very low false alarm rate. £ A. Lakhina and M. Crovella are with the
translated by 谷歌翻译
Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation , one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection Editor: Tow Fawcett. U. Rebbapragada () · 282 Mach Learn (2009) 74: 281-313 methods for unphased data, and show that PCAD's reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
translated by 谷歌翻译
In this paper, we introduce SPIRIT (Stream-ing Pattern dIscoveRy in multIple Time-series). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is anytime , single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively.
translated by 谷歌翻译
随着大数据的出现,如今在许多应用程序中都可以获得包含大量相似时间序列的数据库。这些领域的预测时间序列与传统的单变量预测程序相结合,为产生未开发的准确预测提供了巨大的潜力。最近的神经网络(RNN),特别是长期短期记忆(LSTM)网络已经证明,在所有可用的时间序列中训练时,它们能够在这种情况下超越最先进的单变量时间序列预测方法。但是,如果时间序列数据库是异构的,精度可能会退化,因此在这个空间中的自动预测方法的路上,需要在方法中建立时间序列之间的相似性概念。为此,我们提出了一种预测模型,该模型可以与不同类型的RNN模型一起用于相似时间序列的子群,这些子群由时间序列聚类技术识别。我们使用广泛流行的RNN变体LSTM网络评估我们提出的方法。我们的方法在竞争评估程序下在基准标记数据集上获得了有竞争力的结果。特别是,就平均sMAPE准确性而言,它始终优于基线LSTM模型,并且优于CIF2016预测竞争日期的所有其他方法。
translated by 谷歌翻译
Identifying anomalies rapidly and accurately is critical to the efficient operation of large computer networks. Accurately characterizing important classes of anomalies greatly facilitates their identification; however , the subtleties and complexities of anomalous traffic can easily confound this process. In this paper we report results of signal analysis of four classes of network traffic anomalies: outages, flash crowds, attacks and measurement failures. Data for this study consists of IP flow and SNMP measurements collected over a six month period at the border router of a large university. Our results show that wavelet filters are quite effective at exposing the details of both ambient and anomalous traffic. Specifically, we show that a pseudo-spline filter tuned at specific aggregation levels will expose distinct characteristics of each class of anomaly. We show that an effective way of exposing anomalies is via the detection of a sharp increase in the local variance of the filtered data. We evaluate traffic anomaly signals at different points within a network based on topological distance from the anomaly source or destination. We show that anomalies can be exposed effectively even when aggregated with a large amount of additional traffic. We also compare the difference between the same traffic anomaly signals as seen in SNMP and IP flow data, and show that the more coarse-grained SNMP data can also be used to expose anomalies effectively.
translated by 谷歌翻译
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, Responsible editor: Johannes Gehrke. 108 J. Lin et al. and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
translated by 谷歌翻译
We are seeing an enormous increase in the availability of streaming, time-series data. Largely driven by the rise of connected real-time data sources, this data presents technical challenges and opportunities. One fundamental capability for streaming analytics is to model each stream in an unsupervised fashion and detect unusual, anomalous behaviors in real-time. Early anomaly detection is valuable, yet it can be difficult to execute reliably in practice. Application constraints require systems to process data in real-time, not batches. Streaming data inherently exhibits concept drift, favoring algorithms that learn continuously. Furthermore, the massive number of independent streams in practice requires that anomaly detectors be fully automated. In this paper we propose a novel anomaly detection algorithm that meets these constraints. The technique is based on an online sequence memory algorithm called Hierarchical Temporal Memory (HTM). We also present results using the Numenta Anomaly Benchmark (NAB), a benchmark containing real-world data streams with labeled anomalies. The benchmark, the first of its kind, provides a controlled open-source environment for testing anomaly detection algorithms on streaming data. We present results and analysis for a wide range of algorithms on this benchmark, and discuss future challenges for the emerging field of streaming analytics.
translated by 谷歌翻译
In this study we consider the problem of outlier detection with multiple co-evolving time series data. To capture both the temporal dependence and the inter-series relatedness, a multi-task non-parametric model is proposed, which can be extended to data with a broader exponential family distribution by adopting the notion of Bregman divergence. Albeit convex, the learning problem can be hard as the time series accumulate. In this regards, an efficient randomized block coordinate descent (RBCD) algorithm is proposed. The model and the algorithm is tested with a real-world application, involving outlier detection and event analysis in power distribution networks with high resolution multi-stream measurements. It is shown that the incorporation of inter-series relatedness enables the detection of system level events which would otherwise be unobservable with traditional methods.
translated by 谷歌翻译
In Gardner (1985), I reviewed the research in exponential smoothing since the original work by Brown and Holt. This paper brings the state of the art up to date. The most important theoretical advance is the invention of a complete statistical rationale for exponential smoothing based on a new class of state-space models with a single source of error. The most important practical advance is the development of a robust method for smoothing damped multiplicative trends. We also have a new adaptive method for simple smoothing, the first such method to demonstrate credible improved forecast accuracy over fixed-parameter smoothing. Longstanding confusion in the literature about whether and how to renormalize seasonal indices in the Holt-Winters methods has finally been resolved. There has been significant work in forecasting for inventory control, including the development of new prediction distributions for total lead-time demand and several improved versions of Croston's method for forecasting intermittent time series. Regrettably, there has been little progress in the identification and selection of exponential smoothing methods. The research in this area is best described as inconclusive, and it is still difficult to beat the application of a damped trend to every time series.
translated by 谷歌翻译
Anomaly detection is the problem of finding patterns in data that do not conform to an a priori expected behavior. This is related to the problem in which some samples are distant, in terms of a given metric, from the rest of the dataset, where these anomalous samples are indicated as outliers. Anomaly detection has recently attracted the attention of the research community, because of its relevance in real-world applications, like intrusion detection, fraud detection, fault detection and system health monitoring, among many others. Anomalies themselves can have a positive or negative nature, depending on their context and interpretation. However, in either case, it is important for decision makers to be able to detect them in order to take appropriate actions. The petroleum industry is one of the application contexts where these problems are present. The correct detection of such types of unusual information empowers the decision maker with the capacity to act on the system in order to correctly avoid, correct or react to the situations associated with them. In that application context, heavy extraction machines for pumping and generation operations, like turbomachines, are intensively monitored by hundreds of sensors each that send measurements with a high frequency for damage prevention. In this paper, we propose a combination of yet another segmentation algorithm (YASA), a novel fast and Sensors 2015, 15 2775 high quality segmentation algorithm, with a one-class support vector machine approach for efficient anomaly detection in turbomachines. The proposal is meant for dealing with the aforementioned task and to cope with the lack of labeled training data. As a result, we perform a series of empirical studies comparing our approach to other methods applied to benchmark problems and a real-life application related to oil platform turbomachinery anomaly detection.
translated by 谷歌翻译
我们研究了三种常规异常检测方法的使用,并评估了它们在线刀具磨损监测的潜力。通过对此处提出的算法进行有效的数据处理和转换,在实时环境中,对这些方法进行了测试,以便快速评估NC机床上的刀具。我们使用的三维力数据流是从21次运行的转弯实验中提取的,其中运行工具直到它通常满足寿命终止标准。我们的实时异常检测算法根据其如何精确地预测刀具侧面的渐进磨损进行评分和优化。我们的大多数刀具磨损预测都是准确可靠的,如我们的离线模拟结果所示。特别是当应用多变量分析时,发现我们开发的算法在不同场景和反参数变化中非常稳健。将我们的方法应用于其他地方以进行实时工具磨损分析应该相当容易。
translated by 谷歌翻译
In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.
translated by 谷歌翻译
Detecting anomalies in data is a vital task, with numerous high-impactapplications in areas such as security, finance, health care, and lawenforcement. While numerous techniques have been developed in past years forspotting outliers and anomalies in unstructured collections ofmulti-dimensional points, with graph data becoming ubiquitous, techniques forstructured {\em graph} data have been of focus recently. As objects in graphshave long-range correlations, a suite of novel technology has been developedfor anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overviewof the state-of-the-art methods for anomaly detection in data represented asgraphs. As a key contribution, we provide a comprehensive exploration of bothdata mining and machine learning algorithms for these {\em detection} tasks. wegive a general framework for the algorithms categorized under various settings:unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs,for attributed vs. plain graphs. We highlight the effectiveness, scalability,generality, and robustness aspects of the methods. What is more, we stress theimportance of anomaly {\em attribution} and highlight the major techniques thatfacilitate digging out the root cause, or the `why', of the detected anomaliesfor further analysis and sense-making. Finally, we present several real-worldapplications of graph-based anomaly detection in diverse domains, includingfinancial, auction, computer traffic, and social networks. We conclude oursurvey with a discussion on open theoretical and practical challenges in thefield.
translated by 谷歌翻译
The use of a Traffic Matrix (TM) to describe the characteristics of a global network has attracted significant interest in network performance research. Due to the high dimensionality and sparsity of network traffic, Principal Component Analysis (PCA) has been successfully applied to TM analysis. PCA is one of the most common methods used in analysis of high-dimensional objects. This paper shows how to apply PCA to TM analysis and anomaly detection. The experiment results demonstrate that the PCA-based method can detect anomalies for both single and multiple nodes with high accuracy and efficiency.
translated by 谷歌翻译