In recent years, we have seen a significant interest in data-driven deep learning approaches for video anomaly detection, where an algorithm must determine if specific frames of a video contain abnormal behaviors. However, video anomaly detection is particularly context-specific, and the availability of representative datasets heavily limits real-world accuracy. Additionally, the metrics currently reported by most state-of-the-art methods often do not reflect how well the model will perform in real-world scenarios. In this article, we present the Charlotte Anomaly Dataset (CHAD). CHAD is a high-resolution, multi-camera anomaly dataset in a commercial parking lot setting. In addition to frame-level anomaly labels, CHAD is the first anomaly dataset to include bounding box, identity, and pose annotations for each actor. This is especially beneficial for skeleton-based anomaly detection, which is useful for its lower computational demand in real-world settings. CHAD is also the first anomaly dataset to contain multiple views of the same scene. With four camera views and over 1.15 million frames, CHAD is the largest fully annotated anomaly detection dataset including person annotations, collected from continuous video streams from stationary cameras for smart video surveillance applications. To demonstrate the efficacy of CHAD for training and evaluation, we benchmark two state-of-the-art skeleton-based anomaly detection algorithms on CHAD and provide comprehensive analysis, including both quantitative results and qualitative examination.
translated by 谷歌翻译
The existing methods for video anomaly detection mostly utilize videos containing identifiable facial and appearance-based features. The use of videos with identifiable faces raises privacy concerns, especially when used in a hospital or community-based setting. Appearance-based features can also be sensitive to pixel-based noise, straining the anomaly detection methods to model the changes in the background and making it difficult to focus on the actions of humans in the foreground. Structural information in the form of skeletons describing the human motion in the videos is privacy-protecting and can overcome some of the problems posed by appearance-based features. In this paper, we present a survey of privacy-protecting deep learning anomaly detection methods using skeletons extracted from videos. We present a novel taxonomy of algorithms based on the various learning approaches. We conclude that skeleton-based approaches for anomaly detection can be a plausible privacy-protecting alternative for video anomaly detection. Lastly, we identify major open research questions and provide guidelines to address them.
translated by 谷歌翻译
多媒体异常数据集在自动监视中发挥着至关重要的作用。它们具有广泛的应用程序,从异常对象/情况检测到检测危及生命事件的检测。该字段正在接收大量的1.5多年的巨大研究兴趣,因此,已经创建了越来越多地专用于异常动作和对象检测的数据集。点击这些公共异常数据集使研究人员能够生成和比较具有相同输入数据的各种异常检测框架。本文介绍了各种视频,音频以及基于异常检测的应用的综合调查。该调查旨在解决基于异常检测的多媒体公共数据集缺乏全面的比较和分析。此外,它可以帮助研究人员选择最佳可用数据集,用于标记框架。此外,我们讨论了现有数据集和未来方向洞察中开发多峰异常检测数据集的差距。
translated by 谷歌翻译
Surveillance videos are able to capture a variety of realistic anomalies. In this paper, we propose to learn anomalies by exploiting both normal and anomalous videos. To avoid annotating the anomalous segments or clips in training videos, which is very time consuming, we propose to learn anomaly through the deep multiple instance ranking framework by leveraging weakly labeled training videos, i.e. the training labels (anomalous or normal) are at videolevel instead of clip-level. In our approach, we consider normal and anomalous videos as bags and video segments as instances in multiple instance learning (MIL), and automatically learn a deep anomaly ranking model that predicts high anomaly scores for anomalous video segments. Furthermore, we introduce sparsity and temporal smoothness constraints in the ranking loss function to better localize anomaly during training.We also introduce a new large-scale first of its kind dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies such as fighting, road accident, burglary, robbery, etc. as well as normal activities. This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities. Our experimental results show that our MIL method for anomaly detection achieves significant improvement on anomaly detection performance as compared to the state-of-the-art approaches. We provide the results of several recent deep learning baselines on anomalous activity recognition. The low recognition performance of these baselines reveals that our dataset is very challenging and opens more opportunities for future work. The dataset is
translated by 谷歌翻译
People living with dementia often exhibit behavioural and psychological symptoms of dementia that can put their and others' safety at risk. Existing video surveillance systems in long-term care facilities can be used to monitor such behaviours of risk to alert the staff to prevent potential injuries or death in some cases. However, these behaviours of risk events are heterogeneous and infrequent in comparison to normal events. Moreover, analyzing raw videos can also raise privacy concerns. In this paper, we present two novel privacy-protecting video-based anomaly detection approaches to detect behaviours of risks in people with dementia. We either extracted body pose information as skeletons and use semantic segmentation masks to replace multiple humans in the scene with their semantic boundaries. Our work differs from most existing approaches for video anomaly detection that focus on appearance-based features, which can put the privacy of a person at risk and is also susceptible to pixel-based noise, including illumination and viewing direction. We used anonymized videos of normal activities to train customized spatio-temporal convolutional autoencoders and identify behaviours of risk as anomalies. We show our results on a real-world study conducted in a dementia care unit with patients with dementia, containing approximately 21 hours of normal activities data for training and 9 hours of data containing normal and behaviours of risk events for testing. We compared our approaches with the original RGB videos and obtained an equivalent area under the receiver operating characteristic curve performance of 0.807 for the skeleton-based approach and 0.823 for the segmentation mask-based approach. This is one of the first studies to incorporate privacy for the detection of behaviours of risks in people with dementia.
translated by 谷歌翻译
视频异常检测是视觉中的核心问题。正确检测和识别视频数据中行人中的异常行为将使安全至关重要的应用,例如监视,活动监测和人类机器人的互动。在本文中,我们建议利用无监督的行人异常事件检测的轨迹定位和预测。与以前的基于重建的方法不同,我们提出的框架依赖于正常和异常行人轨迹的预测误差来在空间和时间上检测异常。我们介绍了有关不同时间尺度的现实基准数据集的实验结果,并表明我们提出的基于轨迹预言的异常检测管道在识别视频中行人的异常活动方面有效有效。代码将在https://github.com/akanuasiegbu/leveraging-trajectory-prediction-for-pedestrian-video-anomaly-detection上提供。
translated by 谷歌翻译
We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable - our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, Ped2) and show significant improvements over the previous state-of-the-art.
translated by 谷歌翻译
深度学习模型已广泛用于监控视频中的异常检测。典型模型配备了重建普通视频的能力,并评估异常视频的重建错误以指示异常的程度。然而,现有方法遭受了两个缺点。首先,它们只能独立地编码每个身份的运动,而不考虑身份之间的相互作用,这也可以指示异常。其次,他们利用了结构在不同场景下固定的粘合模型,这种配置禁止了对场景的理解。在本文中,我们提出了一个分层时空图卷积神经网络(HSTGCNN)来解决这些问题,HSTGCNN由对应于不同级别的图形表示的多个分支组成。高级图形表示编码人们的轨迹以及多个身份之间的交互,而低级图表表示编码每个人的本地身体姿势。此外,我们建议加权组合在不同场景中更好的多个分支。以这种方式实现了对单级图形表示的改进。实现了对场景的理解并提供异常检测。在低分辨率视频中为在低分辨率视频中编码低分辨率视频中的人员的移动速度和方向编码高级别的图表表示,而在高分辨率视频中将更高的权重分配更高的权重。实验结果表明,建议的HSTGCNN在四个基准数据集(UCSD Spistrian,Shanghaitech,Cuhk Aveance和IITB-Whent)上的当前最先进的模型显着优于最新的最先进模型。
translated by 谷歌翻译
检测视频中的异常事件通常被帧为单级分类任务,其中培训视频仅包含正常事件,而测试视频包含正常和异常事件。在这种情况下,异常检测是一个开放式问题。然而,一些研究吸收异常检测行动识别。这是一个封闭式场景,无法测试检测到新的异常类型时系统的能力。为此,我们提出UbnorMal,这是一个由多个虚拟场景组成的新的监督开放式基准,用于视频异常检测。与现有数据集不同,我们首次引入在训练时间的像素级别注释的异常事件,从而实现了用于异常事件检测的完全监督的学习方法。为了保留典型的开放式配方,我们确保在我们的培训和测试集合中包括Disjoint集的异常类型。据我们所知,Ubnormal是第一个视频异常检测基准,以允许一流的开放模型和监督闭合模型之间的公平头部比较,如我们的实验所示。此外,我们提供了实证证据,表明Ubnormal可以提高两个突出数据集,大道和上海学习的最先进的异常检测框架的性能。
translated by 谷歌翻译
计算机视觉上的最新进展已经提高了使用神经网络来理解人类姿势的应用的突出性。但是,尽管在最新数据集上的准确性一直在稳步提高,但这些数据集通常无法解决现实世界应用程序中所面临的挑战。这些挑战是与远离镜头的人们,人群中的人们和被遮挡的人打交道。结果,许多现实世界的应用程序已经对没有反映部署中存在的数据的数据进行了培训,从而导致表现不佳。本文介绍了ADG-POSE,这是一种自动生成用于现实世界姿势估计的数据集的方法。可以定制这些数据集以确定人的距离,拥挤和遮挡分布。接受我们方法培训的模型能够在存在这些挑战的情况下执行,而这些挑战在其他数据集中训练的挑战失败。使用ADG置端,基于现实世界骨架的动作识别的端到端精度可在中等距离和遮挡水平的场景上增加20%,并且在其他型号的表现胜过比随机性更好的场景中增加了4倍。
translated by 谷歌翻译
在当代社会中,监视异常检测,即在监视视频中发现异常事件,例如犯罪或事故,是一项关键任务。由于异常发生很少发生,大多数培训数据包括没有标记的视频,没有异常事件,这使得任务具有挑战性。大多数现有方法使用自动编码器(AE)学习重建普通视频;然后,他们根据未能重建异常场景的出现来检测异常。但是,由于异常是通过外观和运动来区分的,因此许多先前的方法使用预训练的光流模型明确分开了外观和运动信息,例如。这种明确的分离限制了两种类型的信息之间的相互表示功能。相比之下,我们提出了一个隐式的两路AE(ITAE),其中两个编码器隐含模型外观和运动特征以及一个将它们组合在一起以学习正常视频模式的结构。对于正常场景的复杂分布,我们建议通过归一化流量(NF)的生成模型对ITAE特征的正常密度估计,以学习可拖动的可能性,并使用无法分布的检测来识别异常。 NF模型通过隐式学习的功能通过学习正常性来增强ITAE性能。最后,我们在六个基准测试中演示了ITAE及其特征分布建模的有效性,包括在现实世界中包含各种异常的数据库。
translated by 谷歌翻译
自动交通事故检测已吸引机器视觉社区,因为它对自动智能运输系统(ITS)的发展产生了影响和对交通安全的重要性。然而,大多数关于有效分析和交通事故预测的研究都使用了覆盖范围有限的小规模数据集,从而限制了其效果和适用性。交通事故中现有的数据集是小规模,不是来自监视摄像机,而不是开源的,或者不是为高速公路场景建造的。由于在高速公路上发生事故,因此往往会造成严重损坏,并且太快了,无法赶上现场。针对从监视摄像机收集的高速公路交通事故的开源数据集非常需要和实际上。为了帮助视觉社区解决这些缺点,我们努力收集涵盖丰富场景的真实交通事故的视频数据。在通过各个维度进行集成和注释后,在这项工作中提出了一个名为TAD的大规模交通事故数据集。在这项工作中,使用公共主流视觉算法或框架进行了有关图像分类,对象检测和视频分类任务的各种实验,以证明不同方法的性能。拟议的数据集以及实验结果将作为改善计算机视觉研究的新基准提出,尤其是在其中。
translated by 谷歌翻译
Video anomaly detection (VAD) is a challenging computer vision task with many practical applications. As anomalies are inherently ambiguous, it is essential for users to understand the reasoning behind a system's decision in order to determine if the rationale is sound. In this paper, we propose a simple but highly effective method that pushes the boundaries of VAD accuracy and interpretability using attribute-based representations. Our method represents every object by its velocity and pose. The anomaly scores are computed using a density-based approach. Surprisingly, we find that this simple representation is sufficient to achieve state-of-the-art performance in ShanghaiTech, the largest and most complex VAD dataset. Combining our interpretable attribute-based representations with implicit, deep representation yields state-of-the-art performance with a $99.1\%, 93.3\%$, and $85.9\%$ AUROC on Ped2, Avenue, and ShanghaiTech, respectively. Our method is accurate, interpretable, and easy to implement.
translated by 谷歌翻译
在由车辆安装的仪表板摄像机捕获的视频中检测危险交通代理(仪表板)对于促进在复杂环境中的安全导航至关重要。与事故相关的视频只是驾驶视频大数据的一小部分,并且瞬态前的事故流程具有高度动态和复杂性。此外,风险和非危险交通代理的外观可能相似。这些使驾驶视频中的风险对象本地化特别具有挑战性。为此,本文提出了一个注意力引导的多式功能融合网络(AM-NET),以将仪表板视频的危险交通代理本地化。两个封闭式复发单元(GRU)网络使用对象边界框和从连续视频帧中提取的光流功能来捕获时空提示,以区分危险交通代理。加上GRUS的注意力模块学会了与事故相关的交通代理。融合了两个功能流,AM-NET预测了视频中交通代理的风险评分。在支持这项研究的过程中,本文还引入了一个名为“风险对象本地化”(ROL)的基准数据集。该数据集包含带有事故,对象和场景级属性的空间,时间和分类注释。拟议的AM-NET在ROL数据集上实现了85.73%的AUC的有希望的性能。同时,AM-NET在DOTA数据集上优于视频异常检测的当前最新视频异常检测。一项彻底的消融研究进一步揭示了AM-NET通过评估其不同组成部分的贡献的优点。
translated by 谷歌翻译
无人驾驶飞机(UAV)通过低成本,大型覆盖,实时和高分辨率数据采集能力而广泛应用于检查,搜索和救援行动的目的。在这些过程中产生了大量航空视频,在这些过程中,正常事件通常占压倒性的比例。本地化和提取异常事件非常困难,这些事件包含手动从长视频流中的潜在有价值的信息。因此,我们致力于开发用于解决此问题的异常检测方法。在本文中,我们创建了一个新的数据集,名为Droneanomaly,用于空中视频中的异常检测。该数据集提供了37个培训视频序列和22个测试视频序列,这些视频序列来自7个不同的现实场景,其中包括各种异常事件。有87,488个彩色视频框架(训练51,635,测试35,853),每秒30帧的尺寸为640美元\ times 640美元。基于此数据集,我们评估现有方法并为此任务提供基准。此外,我们提出了一种新的基线模型,即变压器(ANDT)的异常检测,该模型将连续的视频帧视为一系列小管,它利用变压器编码器从序列中学习特征表示,并利用解码器来预测下一帧。我们的网络模型在训练阶段模型正常,并确定了具有不可预测的时间动力学的事件,作为测试阶段的异常。此外,为了全面评估我们提出的方法的性能,我们不仅使用无人机 - 异常数据集,而且使用另一个数据集。我们将使我们的数据集和代码公开可用。可以在https://youtu.be/ancczyryoby上获得演示视频。我们使数据集和代码公开可用。
translated by 谷歌翻译
瑜伽是全球广受好评的,广泛推荐的健康生活实践。在执行瑜伽时保持正确的姿势至关重要。在这项工作中,我们采用了从人类姿势估计模型中的转移学习来提取整个人体的136个关键点,以训练一个随机的森林分类器,该分类器用于估算瑜伽室。在内部收集的内部收集的瑜伽视频数据库中评估了结果,该数据库是从4个不同的相机角度记录的51个主题。我们提出了一个三步方案,用于通过对1)看不见的帧,2)看不见的受试者进行测试来评估瑜伽分类器的普遍性。我们认为,对于大多数应用程序,对看不见的主题的验证精度和看不见的摄像头是最重要的。我们经验分析了三个公共数据集,转移学习的优势以及目标泄漏的可能性。我们进一步证明,分类精度在很大程度上取决于所采用的交叉验证方法,并且通常会产生误导。为了促进进一步的研究,我们已公开提供关键点数据集和代码。
translated by 谷歌翻译
视频异常检测是现在计算机视觉中的热门研究主题之一,因为异常事件包含大量信息。异常是监控系统中的主要检测目标之一,通常需要实时行动。关于培训的标签数据的可用性(即,没有足够的标记数据进行异常),半监督异常检测方法最近获得了利益。本文介绍了该领域的研究人员,以新的视角,并评论了最近的基于深度学习的半监督视频异常检测方法,基于他们用于异常检测的共同策略。我们的目标是帮助研究人员开发更有效的视频异常检测方法。由于选择右深神经网络的选择对于这项任务的几个部分起着重要作用,首先准备了对DNN的快速比较审查。与以前的调查不同,DNN是从时空特征提取观点审查的,用于视频异常检测。这部分审查可以帮助本领域的研究人员选择合适的网络,以获取其方法的不同部分。此外,基于其检测策略,一些最先进的异常检测方法受到严格调查。审查提供了一种新颖,深入了解现有方法,并导致陈述这些方法的缺点,这可能是未来作品的提示。
translated by 谷歌翻译
在本文中,使用Resnet-34作为功能提取器,将基于LSTM的基于LSTM自动编码器的体系结构用于嗜睡。该问题被认为是单个受试者的异常检测。因此,只有普通的驾驶表示形式,并且可以根据网络的知识来区分嗜睡表征,从而产生更高的重建损失。在我们的研究中,通过标签分配的方法研究了正常和异常夹的置信度水平,以便根据不同的置信率分析LSTM自动编码器的训练性能以及测试过程中遇到的异常情况的解释。我们的方法在NTHU-DDD上进行了实验,并通过最先进的异常检测方法进行基准测试,以使驱动器嗜睡。结果表明,所提出的模型在曲线(AUC)下达到0.8740面积的检测率,并能够在某些情况下提供重大改进。
translated by 谷歌翻译
这项工作的目的是检测并自动生成视频中异常事件的高级解释。了解异常事件的原因至关重要,因为所需的响应取决于其性质和严重程度。最近的作品通常使用对象或操作分类器来检测和提供异常事件的标签。然而,这将检测系统限制为有限的已知类别,并防止到未知物体或行为的概括。在这里,我们展示了如何在不使用对象或操作分类器的情况下稳健地检测异组织,但仍然恢复事件背后的高级原因。我们提出以下贡献:(1)一种使用显着性图来解除对象和动作分类器的异常事件解释的方法,(2)显示如何使用新的神经架构来学习视频的离散表示来提高显着图的质量通过预测未来帧和(3)将最先进的异常解释方法击败60 \%在公共基准X-MAN数据集的子集上。
translated by 谷歌翻译
This paper presents a new large scale multi-person tracking dataset -- \texttt{PersonPath22}, which is over an order of magnitude larger than currently available high quality multi-object tracking datasets such as MOT17, HiEve, and MOT20 datasets. The lack of large scale training and test data for this task has limited the community's ability to understand the performance of their tracking systems on a wide range of scenarios and conditions such as variations in person density, actions being performed, weather, and time of day. \texttt{PersonPath22} dataset was specifically sourced to provide a wide variety of these conditions and our annotations include rich meta-data such that the performance of a tracker can be evaluated along these different dimensions. The lack of training data has also limited the ability to perform end-to-end training of tracking systems. As such, the highest performing tracking systems all rely on strong detectors trained on external image datasets. We hope that the release of this dataset will enable new lines of research that take advantage of large scale video based training data.
translated by 谷歌翻译