许多感兴趣的活动都是罕见的事件,只有少数标记的例子可用。因此,期望能够从几个示例中容忍的用于时间活动检测的模型。在本文中,我们提出了几乎简单和一般但新颖的几拍时间性检测框架,它检测未修剪视频中的少数输入活动的开始和结束时间。我们的模型是端到端的可训练的,可以从更多的几个例子中受益。在测试时,为每个提议分配与最大相似度得分相对应的少数活动类别的标签。我们的相似性R-C3D方法在几次拍摄设置中优于之前关于时间活动检测的三个基准测试(THUMOS14,ActivityNet1.2和ActivityNet1.3数据集)的工作。我们的代码将可用。
translated by 谷歌翻译
由场景中的对象的交互定义的事件通常具有重要的重要性,但是这样的事件通常是罕见的并且可用的标记示例不足以训练执行良好的交叉预测的对象外观的传统深度模型。大多数深度学习活动识别模型专注于全局上下文聚合,并未明确考虑视频内的对象交互,可能忽略与场景中的解释活动相关的重要线索。在本文中,我们展示了一种用于显式表示对象交互的新模型,显着改善了用于驾驶碰撞检测的深度视频活动分类。我们提出了一个时空行动图(STAG)网络,它结合了对象的空间和时间关系。从数据中自动学习网络,并为任务推断出潜在的图形结构。作为评估碰撞检测任务性能的基准,我们根据从现实生活中驾驶碰撞和近碰撞获得的数据引入了一种新颖的数据集。该数据集反映了在一个变化丰富但高度受限的环境中检测和分类事故的挑战性任务,这与自动驾驶和警报系统的评估非常相关。我们的实验证实,我们的STAG模型为碰撞活动分类提供了显着改进的结果。
translated by 谷歌翻译
视频生成是一项具有挑战性的任务,因为它需要模型同时生成逼真的内容和动作。现有方法使用单个生成器网络一起生成运动和内容,但是这种方法可能在复杂视频上失败。在本文中,我们提出了将内容和运动生成分离为两个并行生成器的双流视频生成模型,称为双流变分对抗网络(TwoStreamVAN)。我们的模型通过使用自适应运动内核逐步生成和融合多尺度上的运动和内容特征,在给定输入动作标签的情况下输出逼真的视频。此外,为了更好地评估视频生成模型,我们设计了一个新的合成人类行动数据集,以弥合过度复杂的人类行为数据集和简单的数据集之间的困难差距。我们的模型明显优于标准Weizmann Human Action和MUG Facial Expression数据集以及我们的newdataset上的现有方法。
translated by 谷歌翻译
We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. Second, to learn a fine-grained similarity metric for retrieval, we use visual features to modulate the processing of query sentences at the word level in a recurrent neural network. A multi-task loss is also employed by adding query re-generation as an auxiliary task. Our approach significantly outperforms prior work on two challenging benchmarks: Charades-STA and ActivityNet Captions.
translated by 谷歌翻译
Dense video captioning is a fine-grained video understanding task that involves two sub-problems: localizing distinct events in a long video stream, and generating captions for the localized events. We propose the Joint Event Detection and Description Network (JEDDi-Net), which solves the dense video captioning task in an end-to-end fashion. Our model continuously encodes the input video stream with three-dimensional convolutional layers, proposes variable-length temporal events based on pooled features , and generates their captions. Proposal features are extracted within each proposal segment through 3D Segment-of-Interest pooling from shared video feature encoding. In order to explicitly model temporal relationships between visual events and their captions in a single video, we also propose a two-level hierarchical captioning module that keeps track of context. On the large-scale Activi-tyNet Captions dataset, JEDDi-Net demonstrates improved results as measured by standard metrics. We also present the first dense captioning results on the TACoS-MultiLevel dataset.
translated by 谷歌翻译
We address the problem of activity detection in continuous, untrimmed videostreams. This is a difficult task that requires extracting meaningfulspatio-temporal features to capture activities, accurately localizing the startand end times of each activity. We introduce a new model, Region Convolutional3D Network (R-C3D), which encodes the video streams using a three-dimensionalfully convolutional network, then generates candidate temporal regionscontaining activities, and finally classifies selected regions into specificactivities. Computation is saved due to the sharing of convolutional featuresbetween the proposal and the classification pipelines. The entire model istrained end-to-end with jointly optimized localization and classificationlosses. R-C3D is faster than existing methods (569 frames per second on asingle Titan X Maxwell GPU) and achieves state-of-the-art results on THUMOS'14.We further demonstrate that our model is a general activity detection frameworkthat does not rely on assumptions about particular dataset properties byevaluating our approach on ActivityNet and Charades. Our code is available athttp://ai.bu.edu/r-c3d/.
translated by 谷歌翻译
Solving the visual symbol grounding problem has long been a goal ofartificial intelligence. The field appears to be advancing closer to this goalwith recent breakthroughs in deep learning for natural language grounding instatic images. In this paper, we propose to translate videos directly tosentences using a unified deep neural network with both convolutional andrecurrent structure. Described video datasets are scarce, and most existingmethods have been applied to toy domains with a small vocabulary of possiblewords. By transferring knowledge from 1.2M+ images with category labels and100,000+ images with captions, our method is able to create sentencedescriptions of open-domain videos with large vocabularies. We compare ourapproach with recent work using language generation metrics, subject, verb, andobject prediction accuracy, and a human evaluation.
translated by 谷歌翻译
事实证明,语言模型预训练对于学习通用语言表示非常有用。作为最先进的语言模型预训练模型,BERT(变形金刚的双向编码器表示)在许多语言理解任务中取得了惊人的成果。在本文中,我们进行了详尽的实验,以研究BERT在文本分类任务上的不同微调方法,并为BERTfine调整提供一般解决方案。最后,所提出的解决方案在八个广泛研究的文本分类数据集上获得了新的最新结果。
translated by 谷歌翻译
在本文中,我们专注于面部表情翻译任务,并提出一个新的表达式条件GAN(ECGAN),它可以学习基于一个额外的表达属性从一个图像域到另一个图像域的映射。所提出的ECGAN是通用框架,并且适用于不同的表达生成任务,其中特定的面部表情可以通过条件属性标签容易地控制。此外,我们还介绍了一种新颖的面膜,以减少背景变化的影响。此外,我们提出了在野外进行面部表情生成和识别的整个框架,其包括两个模块,即生成和识别。最后,我们在几个公共面部数据集上评估我们的框架,其中主体具有不同的种族,光照,遮挡,姿势,颜色,内容和背景条件。尽管这些数据集非常多样化,但定性和定量结果都表明我们的方法能够准确,稳健地生成面部表达。
translated by 谷歌翻译
基于秩的学习与深度神经网络已被广泛用于图像策划。然而,基于排名的方法的表现往往很差,这主要是由于两个原因:1)图像裁剪是一种列表排序任务而不是成对比较; 2)由汇集层引起的重新缩放和视图生成中的变形损害了组合学习的性能。在本文中,我们开发了一个新的模型来克服这些问题。为了解决第一个问题,我们将图像裁剪制定为列表方向问题,以找到最佳视图组合。对于第二个问题,提出了定义视图采样(称为RoIRefine)来提取候选视图生成的精细特征映射。给定一系列候选视图,所提出的模型学习视图的前1概率分布并获得最佳视图。通过整合精确抽样和列表排名,所提出的称为LVRN的网络实现了最先进的性能,包括不准确性和速度。
translated by 谷歌翻译