This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious attackers. Current state of the art anomaly detection approaches suffer from scalability, use-case restrictions , difficulty of use and a large number of false positives. Our system at Yahoo, EGADS, uses a collection of anomaly detection and forecasting models with an anomaly filtering layer for accurate and scalable anomaly detection on time-series. We compare our approach against other anomaly detection systems on real and synthetic data with varying time-series characteristics. We found that our framework allows for 50-60% improvement in precision and recall for a variety of use-cases. Both the data and the framework are being open-sourced. The open-sourcing of the data, in particular , represents the first of its kind effort to establish the standard benchmark for anomaly detection.
translated by 谷歌翻译
Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) is arguably the most widely applied un-supervised anomaly detection technique for networked data streams due to its simplicity and efficiency. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we first proposed a novel joint sparse PCA method to perform anomaly detection and localization for network data streams. Our key observation is that we can detect anomalies and localize anomalous sources by identifying a low dimensional abnormal subspace that captures the abnormal behavior of data. To better capture the sources of anomalies, we incorporated the structure of the network stream data in our anomaly localization framework. Also, an extended version of PCA, multi-dimensional KLE, was introduced to stabilize the localization performance. We performed comprehensive experimental studies on four real-world data sets from different application domains and compared our proposed techniques with several state-of-the-arts. Our experimental studies demonstrate the utility of the proposed methods.
translated by 谷歌翻译

2018-11-29

translated by 谷歌翻译

2018-11-30

translated by 谷歌翻译

2018-12-30

translated by 谷歌翻译

2018-11-20

translated by 谷歌翻译
Anomalies are unusual and significant changes in a network's traffic levels, which can often involve multiple links. Diagnosing anomalies is critical for both network operators and end users. It is a difficult problem because one must extract and interpret anomalous patterns from large amounts of high-dimensional, noisy data. In this paper we propose a general method to diagnose anomalies. This method is based on a separation of the high-dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and anomalous network conditions. We show that this separation can be performed effectively using Principal Component Analysis. Using only simple traffic measurements from links, we study volume anomalies and show that the method can: (1) accurately detect when a volume anomaly is occurring; (2) correctly identify the underlying origin-destination (OD) flow which is the source of the anomaly; and (3) accurately estimate the amount of traffic involved in the anomalous OD flow. We evaluate the method's ability to diagnose (i.e., detect, identify, and quantify) both existing and synthetically injected volume anomalies in real traffic from two backbone networks. Our method consistently diagnoses the largest volume anomalies, and does so with a very low false alarm rate. £ A. Lakhina and M. Crovella are with the
translated by 谷歌翻译
Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation , one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection Editor: Tow Fawcett. U. Rebbapragada () · 282 Mach Learn (2009) 74: 281-313 methods for unphased data, and show that PCAD's reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
translated by 谷歌翻译
In this paper, we introduce SPIRIT (Stream-ing Pattern dIscoveRy in multIple Time-series). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is anytime , single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively.
translated by 谷歌翻译

2017-10-09

translated by 谷歌翻译
Identifying anomalies rapidly and accurately is critical to the efficient operation of large computer networks. Accurately characterizing important classes of anomalies greatly facilitates their identification; however , the subtleties and complexities of anomalous traffic can easily confound this process. In this paper we report results of signal analysis of four classes of network traffic anomalies: outages, flash crowds, attacks and measurement failures. Data for this study consists of IP flow and SNMP measurements collected over a six month period at the border router of a large university. Our results show that wavelet filters are quite effective at exposing the details of both ambient and anomalous traffic. Specifically, we show that a pseudo-spline filter tuned at specific aggregation levels will expose distinct characteristics of each class of anomaly. We show that an effective way of exposing anomalies is via the detection of a sharp increase in the local variance of the filtered data. We evaluate traffic anomaly signals at different points within a network based on topological distance from the anomaly source or destination. We show that anomalies can be exposed effectively even when aggregated with a large amount of additional traffic. We also compare the difference between the same traffic anomaly signals as seen in SNMP and IP flow data, and show that the more coarse-grained SNMP data can also be used to expose anomalies effectively.
translated by 谷歌翻译
Many high level representations of time series have been proposed for data mining, including Fourier transforms, wavelets, eigenwaves, piecewise polynomial models, etc. Many researchers have also considered symbolic representations of time series, noting that such representations would potentiality allow researchers to avail of the wealth of data structures and algorithms from the text processing and bioinformatics communities. While many symbolic representations of time series have been introduced over the past decades, they all suffer from two fatal flaws. First, the dimensionality of the symbolic representation is the same as the original data, and virtually all data mining algorithms scale poorly with dimensionality. Second, although distance measures can be defined on the symbolic approaches, these distance measures have little correlation with distance measures defined on the original time series. In this work we formulate a new symbolic representation of time series. Our representation is unique in that it allows dimensionality/numerosity reduction, Responsible editor: Johannes Gehrke. 108 J. Lin et al. and it also allows distance measures to be defined on the symbolic approach that lower bound corresponding distance measures defined on the original series. As we shall demonstrate, this latter feature is particularly exciting because it allows one to run certain data mining algorithms on the efficiently manipulated symbolic representation, while producing identical results to the algorithms that operate on the original data. In particular, we will demonstrate the utility of our representation on various data mining tasks of clustering, classification, query by content, anomaly detection, motif discovery, and visualization.
translated by 谷歌翻译
We are seeing an enormous increase in the availability of streaming, time-series data. Largely driven by the rise of connected real-time data sources, this data presents technical challenges and opportunities. One fundamental capability for streaming analytics is to model each stream in an unsupervised fashion and detect unusual, anomalous behaviors in real-time. Early anomaly detection is valuable, yet it can be difficult to execute reliably in practice. Application constraints require systems to process data in real-time, not batches. Streaming data inherently exhibits concept drift, favoring algorithms that learn continuously. Furthermore, the massive number of independent streams in practice requires that anomaly detectors be fully automated. In this paper we propose a novel anomaly detection algorithm that meets these constraints. The technique is based on an online sequence memory algorithm called Hierarchical Temporal Memory (HTM). We also present results using the Numenta Anomaly Benchmark (NAB), a benchmark containing real-world data streams with labeled anomalies. The benchmark, the first of its kind, provides a controlled open-source environment for testing anomaly detection algorithms on streaming data. We present results and analysis for a wide range of algorithms on this benchmark, and discuss future challenges for the emerging field of streaming analytics.
translated by 谷歌翻译
In this study we consider the problem of outlier detection with multiple co-evolving time series data. To capture both the temporal dependence and the inter-series relatedness, a multi-task non-parametric model is proposed, which can be extended to data with a broader exponential family distribution by adopting the notion of Bregman divergence. Albeit convex, the learning problem can be hard as the time series accumulate. In this regards, an efficient randomized block coordinate descent (RBCD) algorithm is proposed. The model and the algorithm is tested with a real-world application, involving outlier detection and event analysis in power distribution networks with high resolution multi-stream measurements. It is shown that the incorporation of inter-series relatedness enables the detection of system level events which would otherwise be unobservable with traditional methods.
translated by 谷歌翻译
In Gardner (1985), I reviewed the research in exponential smoothing since the original work by Brown and Holt. This paper brings the state of the art up to date. The most important theoretical advance is the invention of a complete statistical rationale for exponential smoothing based on a new class of state-space models with a single source of error. The most important practical advance is the development of a robust method for smoothing damped multiplicative trends. We also have a new adaptive method for simple smoothing, the first such method to demonstrate credible improved forecast accuracy over fixed-parameter smoothing. Longstanding confusion in the literature about whether and how to renormalize seasonal indices in the Holt-Winters methods has finally been resolved. There has been significant work in forecasting for inventory control, including the development of new prediction distributions for total lead-time demand and several improved versions of Croston's method for forecasting intermittent time series. Regrettably, there has been little progress in the identification and selection of exponential smoothing methods. The research in this area is best described as inconclusive, and it is still difficult to beat the application of a damped trend to every time series.
translated by 谷歌翻译
Anomaly detection is the problem of finding patterns in data that do not conform to an a priori expected behavior. This is related to the problem in which some samples are distant, in terms of a given metric, from the rest of the dataset, where these anomalous samples are indicated as outliers. Anomaly detection has recently attracted the attention of the research community, because of its relevance in real-world applications, like intrusion detection, fraud detection, fault detection and system health monitoring, among many others. Anomalies themselves can have a positive or negative nature, depending on their context and interpretation. However, in either case, it is important for decision makers to be able to detect them in order to take appropriate actions. The petroleum industry is one of the application contexts where these problems are present. The correct detection of such types of unusual information empowers the decision maker with the capacity to act on the system in order to correctly avoid, correct or react to the situations associated with them. In that application context, heavy extraction machines for pumping and generation operations, like turbomachines, are intensively monitored by hundreds of sensors each that send measurements with a high frequency for damage prevention. In this paper, we propose a combination of yet another segmentation algorithm (YASA), a novel fast and Sensors 2015, 15 2775 high quality segmentation algorithm, with a one-class support vector machine approach for efficient anomaly detection in turbomachines. The proposal is meant for dealing with the aforementioned task and to cope with the lack of labeled training data. As a result, we perform a series of empirical studies comparing our approach to other methods applied to benchmark problems and a real-life application related to oil platform turbomachinery anomaly detection.
translated by 谷歌翻译

2018-12-20

translated by 谷歌翻译
In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.
translated by 谷歌翻译

2014-04-18
Detecting anomalies in data is a vital task, with numerous high-impactapplications in areas such as security, finance, health care, and lawenforcement. While numerous techniques have been developed in past years forspotting outliers and anomalies in unstructured collections ofmulti-dimensional points, with graph data becoming ubiquitous, techniques forstructured {\em graph} data have been of focus recently. As objects in graphshave long-range correlations, a suite of novel technology has been developedfor anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overviewof the state-of-the-art methods for anomaly detection in data represented asgraphs. As a key contribution, we provide a comprehensive exploration of bothdata mining and machine learning algorithms for these {\em detection} tasks. wegive a general framework for the algorithms categorized under various settings:unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs,for attributed vs. plain graphs. We highlight the effectiveness, scalability,generality, and robustness aspects of the methods. What is more, we stress theimportance of anomaly {\em attribution} and highlight the major techniques thatfacilitate digging out the root cause, or the `why', of the detected anomaliesfor further analysis and sense-making. Finally, we present several real-worldapplications of graph-based anomaly detection in diverse domains, includingfinancial, auction, computer traffic, and social networks. We conclude oursurvey with a discussion on open theoretical and practical challenges in thefield.
translated by 谷歌翻译
The use of a Traffic Matrix (TM) to describe the characteristics of a global network has attracted significant interest in network performance research. Due to the high dimensionality and sparsity of network traffic, Principal Component Analysis (PCA) has been successfully applied to TM analysis. PCA is one of the most common methods used in analysis of high-dimensional objects. This paper shows how to apply PCA to TM analysis and anomaly detection. The experiment results demonstrate that the PCA-based method can detect anomalies for both single and multiple nodes with high accuracy and efficiency.
translated by 谷歌翻译
${authors} 分类：${tags}
${pubdate}${abstract_cn}
translated by 谷歌翻译