机器学习具有帮助解决难以手动解决的复杂信息安全问题的悠久传统。机器学习技术从数据表示中学习模型以解决任务。这些数据表示由领域专家手工制作。深度学习是机器学习的子领域,它使用由多个层组成的模型。因此,用于解决任务的表示是从数据中学习而不是手动设计的。在本次调查中,我们研究了在信息安全领域中使用DL技术。我们系统地审阅了77篇论文,并从数据中心的角度进行了介绍。这种以数据为中心的观点反映了DL技术最重要的优势之一 - 域独立性。如果DLL方法成功解决了一个域中数据类型的问题,那么它们最有可能成功地对来自另一个域的类似数据。 DL方法的其他优点是无与伦比的可扩展性和效率,关于可以分析的示例的数量以及输入数据的维度。 DL方法通常能够实现高性能并且很好地概括。但是,信息安全是一个具有独特要求和挑战的领域。根据对我们评论的论文的分析,我们指出DL方法的短期内容符合这些要求,并讨论了进一步的研究机会。
translated by 谷歌翻译
本文介绍了细粒度事件视频检索(FIVR)的问题。给定查询视频,目标是检索所有关联的视频,考虑从复制视频到同一事件的视频的几种类型的关联。 FIVR提供了一个单独的框架,包含几个检索任务的特殊情况。为了解决所有这些任务的基准测试需求,我们构建并呈现了一个大型的带注释的视频数据集,我们称之为FIVR-200K,它包含225,960个视频。为了创建数据集,我们设计了一个收集YouTube视频的流程,这些视频基于近年来从维基百科中抓取的重大事件,并部署了用于根据作为基准的预测适用性自动选择查询视频的aretrieval管道。我们还设计了一个关于FIVR定义的四种视频关联的数据集注释协议。最后,我们报告了一个关于数据集的实验研究结果,比较了各种最先进的视觉描述符和聚集技术,突出了手头问题的挑战。
translated by 谷歌翻译
Despite the recent success of neural networks in image feature learning, a major problem in the video domain is the lack of sufficient labeled data for learning to model temporal information. In this paper, we propose an unsupervised temporal modeling method that learns from untrimmed videos. The speed of motion varies constantly, e.g., a man may run quickly or slowly. We therefore train a Multirate Visual Recurrent Model (MVRM) by encoding frames of a clip with different intervals. This learning process makes the learned model more capable of dealing with motion speed variance. Given a clip sampled from a video, we use its past and future neighboring clips as the temporal context, and reconstruct the two temporal transitions , i.e., present→past transition and present→future transition, reflecting the temporal information in different views. The proposed method exploits the two transitions simultaneously by incorporating a bidirectional reconstruction which consists of a backward reconstruction and a forward reconstruction. We apply the proposed method to two challenging video tasks, i.e., complex event detection and video captioning, in which it achieves state-of-the-art performance. Notably, our method generates the best single feature for event detection with a relative improvement of 10.4% on the MEDTest-13 dataset and achieves the best performance in video captioning across all evaluation met-rics on the YouTube2Text dataset.
translated by 谷歌翻译
With the growing popularity of short-form video sharing platforms such as Instagram and Vine, there has been an increasing need for techniques that automatically extract highlights from video. Whereas prior works have approached this problem with heuristic rules or supervised learning, we present an unsupervised learning approach that takes advantage of the abundance of user-edited videos on social media websites such as YouTube. Based on the idea that the most significant sub-events within a video class are commonly present among edited videos while less interesting ones appear less frequently, we identify the significant sub-events via a robust recurrent auto-encoder trained on a collection of user-edited videos queried for each particular class of interest. The auto-encoder is trained using a proposed shrinking exponential loss function that makes it robust to noise in the web-crawled training data, and is configured with bidirectional long short term memory (LSTM) [5] cells to better model the temporal structure of highlight segments. Different from supervised techniques, our method can infer highlights using only a set of down-loaded edited videos, without also needing their pre-edited counterparts which are rarely available online. Extensive experiments indicate the promise of our proposed solution in this challenging unsupervised setting.
translated by 谷歌翻译
In the era of the Internet of Things (IoT), an enormous amount of sensing devices collect and/or generate various sensory data over time for a wide range of fields and applications. Based on the nature of the application, these devices will result in big or fast/real-time data streams. Applying analytics over such data streams to discover new information, predict future insights, and make control decisions is a crucial process that makes IoT a worthy paradigm for businesses and a quality-of-life improving technology. In this paper, we provide a thorough overview on using a class of advanced machine learning techniques, namely Deep Learning (DL), to facilitate the analytics and learning in the IoT domain. We start by articulating IoT data characteristics and identifying two major treatments for IoT data from a machine learning perspective, namely IoT big data analytics and IoT streaming data analytics. We also discuss why DL is a promising approach to achieve the desired analytics in these types of data and applications. The potential of using emerging DL techniques for IoT data analytics are then discussed, and its promises and challenges are introduced. We present a comprehensive background on different DL architectures and algorithms. We also analyze and summarize major reported research attempts that leveraged DL in the IoT domain. The smart IoT devices that have incorporated DL in their intelligence background are also discussed. DL implementation approaches on the fog and cloud centers in support of IoT applications are also surveyed. Finally, we shed light on some challenges and potential directions for future research. At the end of each section, we highlight the lessons learned based on our experiments and review of the recent literature.
translated by 谷歌翻译
当前最先进的视频理解方法采用时间抖动来模拟以不同帧速率分析视频。但是,这对于多速率视频效果不佳,其中动作或子动作的速度不同。帧采样率应根据不同的运动速度而变化。在这项工作中,我们提出了一个简单但有效的策略,称为随机时间跳过,以解决这种情况。该策略通过随机抽样评估培训有效地处理多速率视频。这是一种详尽的方法,可以涵盖所有的运动速度变化。此外,由于大量的时间跳过,我们的网络可以看到最初覆盖超过100帧的视频剪辑。这样的时间范围足以分析大多数动作/事件。我们还介绍了一种能够识别人类动作识别的改进运动图的一种感知识别光流学习方法。我们的框架是端到端的可训练,实时运行,并在六个广泛采用的视频基准测试中实现了最先进的性能。
translated by 谷歌翻译
源于计算机视觉和机器学习的快速发展,视频分析任务已经从推断现状到预测未来状态。基于视觉的动作识别和来自视频的预测是这样的任务,其中动作识别是基于完整动作执行来推断人类动作(呈现状态),以及基于不完整动作执行来预测动作(未来状态)的动作预测。这些twotasks最近已经成为特别流行的主题,因为它们具有爆炸性的新兴现实应用,例如视觉监控,自动驾驶车辆,娱乐和视频检索等。在过去的几十年中,为了建立一个强大的应用程序,已经投入了大量的时间。行动识别和预测的有效框架。在本文中,我们调查了动作识别和预测中完整的最先进技术。现有的模型,流行的算法,技术难点,流行的行动数据库,评估协议和有希望的未来方向也提供了系统的讨论。
translated by 谷歌翻译
Convolutional Neural Networks (CNN) have been regarded as a powerful class of models for image recognition problems. Nevertheless, it is not trivial when utilizing a CNN for learning spatio-temporal video representation. A few studies have shown that performing 3D convolutions is a rewarding approach to capture both spatial and temporal dimensions in videos. However, the development of a very deep 3D CNN from scratch results in expensive computational cost and memory demand. A valid question is why not recycle off-the-shelf 2D networks for a 3D CNN. In this paper, we devise multiple variants of bottleneck building blocks in a residual learning framework by simulating 3 × 3 × 3 convolutions with 1 × 3 × 3 convolutional filters on spatial domain (equivalent to 2D CNN) plus 3 × 1 × 1 convolutions to construct temporal connections on adjacent feature maps in time. Furthermore, we propose a new architecture , named Pseudo-3D Residual Net (P3D ResNet), that exploits all the variants of blocks but composes each in different placement of ResNet, following the philosophy that enhancing structural diversity with going deep could improve the power of neural networks. Our P3D ResNet achieves clear improvements on Sports-1M video classification dataset against 3D CNN and frame-based 2D CNN by 5.3% and 1.8%, respectively. We further examine the generalization performance of video representation produced by our pre-trained P3D ResNet on five different benchmarks and three different tasks, demonstrating superior performances over several state-of-the-art techniques.
translated by 谷歌翻译
这项工作的目的是识别跟踪面部所说的短语和句子,有或没有音频。与之前致力于识别有限数量的单词或短语的作品不同,我们将唇读作为一个开放世界的问题 - 无约束的自然语言句子,以及野外视频。我们的主要贡献是:(1)我们比较两种唇读模型,一种使用CTC损失,另一种使用序列到序列丢失。两种模型都建立在变形金刚自我关注架构之上; (2)我们研究唇读对音频语音识别的重要程度,特别是当音频信号噪声较大时; (3)我们引入并公开发布了一个新的视听语音识别数据集LRS2-BBC,该数据集由来自英国电视的数千个自然语句组成。我们训练的模型在唇读基准数据集上的表现超过了所有前期工作的表现。
translated by 谷歌翻译
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on specific fine-grained domains with limited videos and simple descriptions. While researchers have provided several benchmark datasets for image captioning, we are not aware of any large-scale video description dataset with comprehensive categories yet diverse video content. In this paper we present MSR-VTT (standing for "MSR-Video to Text") which is a new large-scale video benchmark for video understanding, especially the emerging task of translating video to text. This is achieved by collecting 257 popular queries from a commercial video search engine , with 118 videos for each query. In its current version , MSR-VTT provides 10K web video clips with 41.2 hours and 200K clip-sentence pairs in total, covering the most comprehensive categories and diverse visual content, and representing the largest dataset in terms of sentence and vocabulary. Each clip is annotated with about 20 natural sentences by 1,327 AMT workers. We present a detailed analysis of MSR-VTT in comparison to a complete set of existing datasets, together with a summarization of different state-of-the-art video-to-text approaches. We also provide an extensive evaluation of these approaches on this dataset, showing that the hybrid Recurrent Neural Network-based approach, which combines single-frame and motion representations with soft-attention pooling strategy, yields the best generalization capability on MSR-VTT.
translated by 谷歌翻译
The goal of this work is to determine the audio-video syn-chronisation between mouth motion and speech in a video. We propose a two-stream ConvNet architecture that enables a joint embedding between the sound and the mouth images to be learnt from unlabelled data. The trained network is used to determine the lip-sync error in a video. We apply the network to two further tasks: active speaker detection and lip reading. On both tasks we set a new state-of-the-art on standard benchmark datasets.
translated by 谷歌翻译
We propose a new deep network for audio event recognition, called AENet. In contrast to speech, sounds coming from audio events may be produced by a wide variety of sources. Furthermore, distinguishing them often requires analyzing an extended time period due to the lack of clear sub-word units that are present in speech. In order to incorporate this long-time frequency structure of audio events, we introduce a convolutional neural network (CNN) operating on a large temporal input. In contrast to previous works this allows us to train an audio event detection system end-to-end. The combination of our network architecture and a novel data augmentation outperforms previous methods for audio event detection by 16%. Furthermore, we perform transfer learning and show that our model learnt generic audio features, similar to the way CNNs learn generic features on vision tasks. In video analysis, combining visual features and traditional audio features such as MFCC typically only leads to marginal improvements. Instead, combining visual features with our AENet features, which can be computed efficiently on a GPU, leads to significant performance improvements on action recognition and video highlight detection. In video highlight detection, our audio features improve the performance by more than 8% over visual features alone. Index Terms-convolutional neural network, audio feature, large audio event dataset, large input field, highlight detection.
translated by 谷歌翻译
随着面部表情识别(FER)从实验室控制到具有挑战性的野外条件的转变以及深度学习技术在各个领域的重新获得,深度神经网络越来越多地被用于学习自动FER的判别表示。最近的深度FER系统通常关注两个重要问题:由于缺乏足够的训练数据而引起的过度拟合和与表达无关的变化,例如照明,头部姿势和身份。在本文中,我们提供了深度FER的综合调查,包括数据集和算法,提供了对这些内在问题的见解。首先,我们描述了深度FER系统的标准流水线,并提供了相关的背景知识和每个阶段适用实施的建议。然后,我们介绍了在文献中广泛使用的可用数据集,并为这些数据集提供了可接受的数据选择和评估原则。对于深度FER的现有技术,我们回顾了现有的新型深度神经网络和相关的训练策略,这些策略是针对基于静态图像和动态图像序列的FER而设计的,并讨论了优势和局限性。本节还总结了广泛使用的基准测试的竞争性能。然后,我们将调查扩展到其他相关问题和应用场景。最后,我们回顾了该领域的其余挑战和相应的机会,以及强大的深FER系统设计的未来发展方向。
translated by 谷歌翻译
使用递归神经网络(RNN)进行图像描述的最新进展促使人们探索了它们对视频描述的应用。然而,虽然图像是静态的,但处理视频需要建模其动态时间结构,然后将该信息正确地整合到自然语言描述中。在这种情况下,我们提出了一种方法,该方法成功地考虑了视频的本地和全球时间结构,以产生描述。首先,我们的方法包括短时间动态的空间时间3-D卷积神经网络(3-D CNN)表示。 3-D CNN表示在视频动作识别任务上被训练,以便产生被调谐到人类运动和行为的表示。其次,我们提出了一种临时保留机制,它允许超越局部时间建模并学习在给定生成文本的RNN的情况下自动选择最相关的时间段。我们的方法超出了Youtube2Text数据集上的WORU和METEOR指标的当前最新技术水平。我们还提供了一个新的,更大的,更具挑战性的配对视频和自然语言描述数据集的结果。
translated by 谷歌翻译
Recognizing actions in videos is a challenging task as video is an information-intensive media with complex variations. Most existing methods have treated video as a flat data sequence while ignoring the intrinsic hierarchical structure of the video content. In particular, an action may span different granularities in this hierarchy including, from small to large, a single frame, consecutive frames (motion), a short clip, and the entire video. In this paper, we present a novel framework to boost action recognition by learning a deep spatio-temporal video representation at hierarchical multi-granularity. Specifically, we model each granularity as a single stream by 2D (for frame and motion streams) or 3D (for clip and video streams) convolutional neural networks (CNNs). The framework therefore consists of multi-stream 2D or 3D CNNs to learn both the spatial and temporal representations. Furthermore, we employ the Long Short-Term Memory (LSTM) networks on the frame, motion, and clip streams to exploit long-term temporal dynamics. With a softmax layer on the top of each stream, the classification scores can be predicted from all the streams, followed by a novel fusion scheme based on the multi-granular score distribution. Our networks are learned in an end-to-end fashion. On two video action benchmarks of UCF101 and HMDB51, our framework achieves promising performance compared with the state-of-the-art.
translated by 谷歌翻译
这项工作解决了短视频的准确语义标签问题。为此,有许多不同的深网,包括传统的递归神经网络(LSTM,GRU),时间不可知网络(FV,VLAD,BoW),完全连接的神经网络中期AV融合等。此外,我们还提出基于剩余架构的DNN,用于视频分类,具有最先进的分类性能,显着降低了复杂性。此外,我们提出了四种新的方法,即多样性驱动的多网络集成,一种基于快速相关性测量,三种基于DNN的组合器。我们表明,通过整合不同的网络可以实现显着的性能提升,并且我们研究了导致高度多样性的因素。基于广泛的YouTube8Mdataset,我们提供了对其行为的深入评估和分析。 Weshow表明,该系列的性能是最先进的,可以在YouTube-8M Kaggle测试数据上实现最高精度。还在HMDB51和UCF101数据集上评估了分类器的同步性能,并且表明所得到的方法使用类似的输入特征与最先进的方法实现了相当的准确性。
translated by 谷歌翻译
Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THU-MOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include 'background videos' which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013-2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and $ www.thumos.info untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.
translated by 谷歌翻译
本报告描述了18个项目,这些项目探讨了如何在国家实验室中将商业云计算服务用于科学计算。这些演示包括在云环境中部署专有软件,以利用已建立的基于云的分析工作流来处理科学数据集。总的来说,这些项目非常成功,并且他们共同认为云计算可以成为国家实验室科学计算的宝贵计算资源。
translated by 谷歌翻译
深度神经网络(DNN)目前广泛用于许多人工智能(AI)应用,包括计算机视觉,语音识别和机器人技术。虽然DNN在许多AI任务上提供最先进的准确性,但却以高计算复杂性为代价。因此,技术要求能够有效地处理DNN以提高能量效率和吞吐量,而不会牺牲应用精度或增加对AI系统中DNN的广泛部署至关重要的硬件成本。本文旨在提供有关实现DNN高效处理目标的最新进展的综合指南和调查。具体而言,它将提供DNN的概述,讨论支持DNN的各种硬件平台和体系结构,并突出降低计算成本的关键趋势通过联合硬件设计和DNN算法变化,仅通过硬件设计变更或DNN。它还将总结各种开发资源,使研究人员和从业人员能够快速开始这一领域,并突出重要的基准测量指标和设计考虑因素,用于评估快速增长的DNN硬件设计数量,可选择包括算法设计,在学术界和行业。读者将从本文中删除以下概念:了解DNN的关键设计注意事项;能够使用基准和比较指标评估不同的DNN硬件实现;了解各种硬件架构和平台之间的权衡;能够评估各种DNN设计技术在高效处理中的实用性;并了解最近的实施趋势和机会。
translated by 谷歌翻译
LiveSketch是一种使用手绘查询搜索大型图像集的新算法。 LiveSketch通过创建视觉建议来解决草图搜索固有的模糊性,这些建议可以在绘制时增加查询,使查询规范成为迭代而非一次性过程,有助于分析用户的搜索意图。我们的技术贡献是:一个三重网络体系结构,它使用基于RNN的变量自动编码器tosearch,用于使用矢量(基于笔划的)查询的图像;实时聚类以识别可能的搜索意图(以及搜索嵌入中的目标);以及使用来自这些目标的反向传播来干扰输入的中断序列,因此建议对查询进行更改以指导搜索。我们显示了对使用67M图像语料库在当代基线上的准确性和任务时间。
translated by 谷歌翻译