This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious attackers. Current state of the art anomaly detection approaches suffer from scalability, use-case restrictions , difficulty of use and a large number of false positives. Our system at Yahoo, EGADS, uses a collection of anomaly detection and forecasting models with an anomaly filtering layer for accurate and scalable anomaly detection on time-series. We compare our approach against other anomaly detection systems on real and synthetic data with varying time-series characteristics. We found that our framework allows for 50-60% improvement in precision and recall for a variety of use-cases. Both the data and the framework are being open-sourced. The open-sourcing of the data, in particular , represents the first of its kind effort to establish the standard benchmark for anomaly detection.
translated by 谷歌翻译
不同域中的许多应用程序产生大量的时间序列数据。准确预测对于许多决策者来说至关重要。存在各种时间序列预测方法,其使用线性和非线性模型或两者的组合。研究表明,线性和非线性模型的结合可以有效地提高预测性能。但是,这些现有方法所做的一些假设可能会限制它们在某些情况下的性能。我们提供了一种新的自回归集成移动平均(ARIMA) - 人工神经网络(ANN)混合方法,该方法在更通用的框架中工作。实验结果表明,在整个杂交过程中分解原始数据和组合线性和非线性模型的策略是预测方法性能的关键因素。通过使用适当的策略,我们的混合方法可以成为提高传统混合方法获得的预测准确度的有效方法,也可以是单独使用的单独方法之一。
translated by 谷歌翻译
Determining anomalies in data streams that are collected and transformed from various types of networks has recently attracted significant research interest. Principal Component Analysis (PCA) is arguably the most widely applied un-supervised anomaly detection technique for networked data streams due to its simplicity and efficiency. However, none of existing PCA based approaches addresses the problem of identifying the sources that contribute most to the observed anomaly, or anomaly localization. In this paper, we first proposed a novel joint sparse PCA method to perform anomaly detection and localization for network data streams. Our key observation is that we can detect anomalies and localize anomalous sources by identifying a low dimensional abnormal subspace that captures the abnormal behavior of data. To better capture the sources of anomalies, we incorporated the structure of the network stream data in our anomaly localization framework. Also, an extended version of PCA, multi-dimensional KLE, was introduced to stabilize the localization performance. We performed comprehensive experimental studies on four real-world data sets from different application domains and compared our proposed techniques with several state-of-the-arts. Our experimental studies demonstrate the utility of the proposed methods.
translated by 谷歌翻译
Anomalies are unusual and significant changes in a network's traffic levels, which can often involve multiple links. Diagnosing anomalies is critical for both network operators and end users. It is a difficult problem because one must extract and interpret anomalous patterns from large amounts of high-dimensional, noisy data. In this paper we propose a general method to diagnose anomalies. This method is based on a separation of the high-dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and anomalous network conditions. We show that this separation can be performed effectively using Principal Component Analysis. Using only simple traffic measurements from links, we study volume anomalies and show that the method can: (1) accurately detect when a volume anomaly is occurring; (2) correctly identify the underlying origin-destination (OD) flow which is the source of the anomaly; and (3) accurately estimate the amount of traffic involved in the anomalous OD flow. We evaluate the method's ability to diagnose (i.e., detect, identify, and quantify) both existing and synthetically injected volume anomalies in real traffic from two backbone networks. Our method consistently diagnoses the largest volume anomalies, and does so with a very low false alarm rate. £ A. Lakhina and M. Crovella are with the
translated by 谷歌翻译
In this paper, we introduce SPIRIT (Stream-ing Pattern dIscoveRy in multIple Time-series). Given n numerical data streams, all of whose values we observe at each time tick t, SPIRIT can incrementally find correlations and hidden variables, which summarise the key trends in the entire stream collection. It can do this quickly, with no buffering of stream values and without comparing pairs of streams. Moreover, it is anytime , single pass, and it dynamically detects changes. The discovered trends can also be used to immediately spot potential anomalies, to do efficient forecasting and, more generally, to dramatically simplify further data processing. Our experimental evaluation and case studies show that SPIRIT can incrementally capture correlations and discover trends, efficiently and effectively.
translated by 谷歌翻译
Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation , one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection Editor: Tow Fawcett. U. Rebbapragada () · 282 Mach Learn (2009) 74: 281-313 methods for unphased data, and show that PCAD's reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.
translated by 谷歌翻译
由于数据量大,因此增加了对自主和通用异常检测系统的需求。然而,开发一种准确且快速的独立的通用异常检测系统仍然是一个挑战。在本文中,我们提出了传统的时间序列分析方法,季节自回归整合移动平均(SARIMA)模型和使用黄土(STL)的SeasonalTrend分解,以检测复杂和各种异常。通常,SARIMA和STL仅用于静止和周期时间 - 系列,但通过组合,我们表明他们可以检测高精度的异常,甚至嘈杂和非周期性的数据。我们将该算法与Long ShortTerm Memory(LSTM)进行了比较,LSTM是一种用于异常检测系统的基于深度学习的算法。我们总共使用了七个真实数据集和四个具有不同时间序列属性的人工数据集来验证所提算法的性能。
translated by 谷歌翻译
随着大数据的出现,如今在许多应用程序中都可以获得包含大量相似时间序列的数据库。这些领域的预测时间序列与传统的单变量预测程序相结合,为产生未开发的准确预测提供了巨大的潜力。最近的神经网络(RNN),特别是长期短期记忆(LSTM)网络已经证明,在所有可用的时间序列中训练时,它们能够在这种情况下超越最先进的单变量时间序列预测方法。但是,如果时间序列数据库是异构的,精度可能会退化,因此在这个空间中的自动预测方法的路上,需要在方法中建立时间序列之间的相似性概念。为此,我们提出了一种预测模型,该模型可以与不同类型的RNN模型一起用于相似时间序列的子群,这些子群由时间序列聚类技术识别。我们使用广泛流行的RNN变体LSTM网络评估我们提出的方法。我们的方法在竞争评估程序下在基准标记数据集上获得了有竞争力的结果。特别是,就平均sMAPE准确性而言,它始终优于基线LSTM模型,并且优于CIF2016预测竞争日期的所有其他方法。
translated by 谷歌翻译
In Gardner (1985), I reviewed the research in exponential smoothing since the original work by Brown and Holt. This paper brings the state of the art up to date. The most important theoretical advance is the invention of a complete statistical rationale for exponential smoothing based on a new class of state-space models with a single source of error. The most important practical advance is the development of a robust method for smoothing damped multiplicative trends. We also have a new adaptive method for simple smoothing, the first such method to demonstrate credible improved forecast accuracy over fixed-parameter smoothing. Longstanding confusion in the literature about whether and how to renormalize seasonal indices in the Holt-Winters methods has finally been resolved. There has been significant work in forecasting for inventory control, including the development of new prediction distributions for total lead-time demand and several improved versions of Croston's method for forecasting intermittent time series. Regrettably, there has been little progress in the identification and selection of exponential smoothing methods. The research in this area is best described as inconclusive, and it is still difficult to beat the application of a damped trend to every time series.
translated by 谷歌翻译
提供长期预报是时间序列建模的一个基本挑战,只有在以前从未观察到时间序列时必须形成此类预测的挑战更加复杂。后面的挑战是在指令系统中看到的冷启动问题的时间序列版本,据我们所知,这在前面的工作中没有得到解决。当需要进行长距离预测后,只需观察少量时间点 - 热启动预测,就会出现类似的问题。考虑到这些目标,我们专注于预测季节性情况 - 或基线需求 - 在三种情况下的一年的时间段:具有多个先前观察到的季节性概况的长距离情况,没有先前观察到的季节性情况的冷启动情况和仅有部分观察到的轮廓的热启动情况。经典时间序列方法基于先前的观察结果,执行迭代的超前预测,以提供准确的长程预测;在几乎没有观察到的数据的设置中,这种方法根本不适用。相反,我们提出了一个简单的框架,它将高维回归和矩阵分解的思想结合在一个精心构建的数据矩阵上。我们的制定和最终表现的关键是利用(1)重复模式超过一段时间和跨系列,以及(2)与个人系列相关的元数据;如果没有这些额外的数据,冷启动/热启动问题几乎无法解决。我们证明了我们的框架可以准确地预测多个大型比例图上的一系列季节性概况。
translated by 谷歌翻译
如今,多变量时间序列数据越来越多地收集在各​​种现实世界系统中,例如发电厂,可穿戴设备等。多变量时间序列中的异常检测和诊断是指在某些时间步骤中识别异常状态并查明根本原因。然而,这样的系统具有挑战性,因为它不仅需要捕获每个时间序列中的时间依赖性,而且还需要编码不同时间序列对之间的相关性。此外,系统应该对噪声具有鲁棒性,并根据不同事件的严重程度为操作员提供不同级别的异常分数。尽管已经开发了许多无监督的异常检测算法,但是它们中很少能够共同解决这些挑战。在本文中,我们提出了一种多尺度卷积递归编码器 - 解码器(MSCRED),用于在多变量时间序列数据中进行性能检测和诊断。具体来说,MSCRED首先构建多尺度(分辨率)签名矩阵,以在不同的时间步长中表征系统状态的多个级别。随后,给定签名矩阵,使用卷积编码器来编码传感器间(时间序列)相关性和注意力。基于卷积长短期记忆(ConvLSTM)网络被开发用于捕获时间模式。最后,基于编码传感器间相关性和时间信息的特征图,使用卷积解码器重建输入签名矩阵,并且进一步利用残余签名矩阵来检测和诊断异常。基于合成数据集和真实发电厂数据集的广泛实证研究表明,MSCRED可以胜过最先进的基线方法。
translated by 谷歌翻译
一般的入侵检测系统(IDS)基本上基于异常检测系统(ADS)或异常检测和基于签名的方法的组合,收集和分析观察结果并报告可能的可疑案例给系统管理员或其他用户以进行进一步调查。即使是最先进的ADS和IDS尚未克服的臭名昭着的挑战之一是可能出现非常高的误报率。特别是在非常大而复杂的系统设置中,低级别警报的数量很容易超过管理员,并增加了忽略警报的倾向。我们可以将现有的误报警策略分为两大系列:第一组涵盖直接定制的方法,并应用于更高级别的方法。质量异常评分ADS。第二组包括在相关环境中使用的方法,作为降低误报率可能性的过滤方法。由于缺乏关于可能的方法来控制误报率的综合研究,在本文中,我们回顾了现有的误报警减轻技术。 ADS并介绍了每种技术的优缺点。我们还研究了一些应用于基于签名的IDS和其他相关背景的有前景的技术,如商业安全信息和事件管理(SIEM)工具,这些工具在ADS背景下是适用和推广的。最后,我们总结了未来研究的一些方向。
translated by 谷歌翻译
我们提出了一种算法,通过将观察到的时间序列转换为矩阵来估算和预测时间序列,利用矩阵估计来恢复传输值并对观察到的条目进行去噪,并进行线性回归以进行预测。我们分析的核心是表示结果,其表明对于大型模型类,通过我们的算法从时间序列获得的变换矩阵是(近似)低秩。这实际上推广了广泛使用的文献中的奇异谱分析(SSA),并允许我们在时间序列分析和矩阵估计之间建立严格的联系。关键是用文献中的Hankel矩阵构造一个非重叠条目的矩阵,包括在SSA中。我们提供有限的样本分析用于插补和预测,导致我们方法的渐近一致性。我们算法的一个显着特征是,它与模型无关,无论是基础时间动态还是观测中的噪声模型。与噪声无关使得算法适用于隐藏状态的设置,并且我们只能访问隐藏马尔可夫模型的噪声观察,例如,观察具有时变参数的泊松过程,而不知道过程是泊松,但仍然在恢复时变参数准确。作为预测算法的一部分,一个重要的任务是通过对特征进行粗略观察来执行回归,以及变量误差回归。 Inessence,我们的方法提出了一种基于矩阵估计的方法来进行这样的设置,这本身就是有意义的。通过合成和现实世界的数据集,我们证明我们的算法在存在丢失数据以及高水平噪声的情况下优于标准软件包(包括R库)。
translated by 谷歌翻译
In this study we consider the problem of outlier detection with multiple co-evolving time series data. To capture both the temporal dependence and the inter-series relatedness, a multi-task non-parametric model is proposed, which can be extended to data with a broader exponential family distribution by adopting the notion of Bregman divergence. Albeit convex, the learning problem can be hard as the time series accumulate. In this regards, an efficient randomized block coordinate descent (RBCD) algorithm is proposed. The model and the algorithm is tested with a real-world application, involving outlier detection and event analysis in power distribution networks with high resolution multi-stream measurements. It is shown that the incorporation of inter-series relatedness enables the detection of system level events which would otherwise be unobservable with traditional methods.
translated by 谷歌翻译
Hodrick-Prescott(HP)滤波器是应用宏观经济研究中应用最广泛的计量经济学方法之一。该技术是非参数的,并且在没有经济理论或先前趋势规范的帮助下将时间序列分解为趋势和循环分量。与所有非参数方法一样,HP滤波器主要依赖于控制平滑程度的调整参数。然而,与现代非参数方法和应用这些程序相比,惠普过滤器的经验实践普遍依赖于调整参数的标准设置,这些设置主要是通过宏观经济数据的实验和经济周期和趋势形式的启发式推理得出的。正如最近的研究表明,标准设置可能不足以消除经济数据中的趋势,特别是随机趋势。本文提出了一种实现迭代HP平滑器的实用程序,该程序旨在使滤波器成为一种更智能的平滑装置,用于趋势估计和趋势消除。我们将这种迭代的HP技术称为增强的HP滤波器,以及它与机器学习中L2增强的连接。本文开发了极限理论,表明增压HP滤波器渐近恢复趋势机制,涉及单位根过程,确定性多项式漂移和带结构断裂的多项式漂移 - 宏观经济数据和当前建模方法中出现的最常见趋势。 stopcriterion用于自动化迭代HP算法,使其成为adata确定的方法,为现代数据丰富的环境做好了经济研究。该方法使用三个真实数据示例进行说明,这些示例突出了简单HP过滤,数据确定的增强过滤器和替代自回归方法之间的差异。
translated by 谷歌翻译
Identifying anomalies rapidly and accurately is critical to the efficient operation of large computer networks. Accurately characterizing important classes of anomalies greatly facilitates their identification; however , the subtleties and complexities of anomalous traffic can easily confound this process. In this paper we report results of signal analysis of four classes of network traffic anomalies: outages, flash crowds, attacks and measurement failures. Data for this study consists of IP flow and SNMP measurements collected over a six month period at the border router of a large university. Our results show that wavelet filters are quite effective at exposing the details of both ambient and anomalous traffic. Specifically, we show that a pseudo-spline filter tuned at specific aggregation levels will expose distinct characteristics of each class of anomaly. We show that an effective way of exposing anomalies is via the detection of a sharp increase in the local variance of the filtered data. We evaluate traffic anomaly signals at different points within a network based on topological distance from the anomaly source or destination. We show that anomalies can be exposed effectively even when aggregated with a large amount of additional traffic. We also compare the difference between the same traffic anomaly signals as seen in SNMP and IP flow data, and show that the more coarse-grained SNMP data can also be used to expose anomalies effectively.
translated by 谷歌翻译
In the statistics community, outlier detection for time series data has been studied for decades. Recently, with advances in hardware and software technology, there has been a large body of work on temporal outlier detection from a computational perspective within the computer science community. In particular, advances in hardware technology have enabled the availability of various forms of temporal data collection mechanisms, and advances in software technology have enabled a variety of data management mechanisms. This has fueled the growth of different kinds of data sets such as data streams, spatio-temporal data, distributed streams, temporal networks, and time series data, generated by a multitude of applications. There arises a need for an organized and detailed study of the work done in the area of outlier detection with respect to such temporal datasets. In this survey, we provide a comprehensive and structured overview of a large set of interesting outlier definitions for various forms of temporal data, novel techniques, and application scenarios in which specific definitions and techniques have been widely used.
translated by 谷歌翻译
[1] Data analysis has been one of the core activities in scientific research, but limited by the availability of analysis methods in the past, data analysis was often relegated to data processing. To accommodate the variety of data generated by nonlinear and nonstationary processes in nature, the analysis method would have to be adaptive. Hilbert-Huang transform, consisting of empirical mode decomposition and Hilbert spectral analysis, is a newly developed adaptive data analysis method, which has been used extensively in geophysical research. In this review, we will briefly introduce the method, list some recent developments, demonstrate the usefulness of the method, summarize some applications in various geophysical research areas, and finally, discuss the outstanding open problems. We hope this review will serve as an introduction of the method for those new to the concepts, as well as a summary of the present frontiers of its applications for experienced research scientists.
translated by 谷歌翻译
我们研究了三种常规异常检测方法的使用,并评估了它们在线刀具磨损监测的潜力。通过对此处提出的算法进行有效的数据处理和转换,在实时环境中,对这些方法进行了测试,以便快速评估NC机床上的刀具。我们使用的三维力数据流是从21次运行的转弯实验中提取的,其中运行工具直到它通常满足寿命终止标准。我们的实时异常检测算法根据其如何精确地预测刀具侧面的渐进磨损进行评分和优化。我们的大多数刀具磨损预测都是准确可靠的,如我们的离线模拟结果所示。特别是当应用多变量分析时,发现我们开发的算法在不同场景和反参数变化中非常稳健。将我们的方法应用于其他地方以进行实时工具磨损分析应该相当容易。
translated by 谷歌翻译
Trends in terrestrial temperature variability are perhaps more relevant for species viability than trends in mean temperature. In this paper, we develop methodology for estimating such trends using multi-resolution climate data from polar orbiting weather satellites. We derive two novel algorithms for computation that are tailored for dense, gridded observations over both space and time. We evaluate our methods with a simulation that mimics these data's features and on a large, publicly available, global temperature dataset with the eventual goal of tracking trends in cloud reflectance temperature variability.
translated by 谷歌翻译