The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation-to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.
translated by 谷歌翻译
Can we teach a robot to recognize and make predictions for activities that it has never seen before? We tackle this problem by learning models for video from text. This paper presents a hierarchical model that generalizes instructional knowledge from large-scale text corpora and transfers the knowledge to video. Given a portion of an instructional video, our model recognizes and predicts coherent and plausible actions multiple steps into the future, all in rich natural language. To demonstrate the capabilities of our model, we introduce the \emph{Tasty Videos Dataset V2}, a collection of 4022 recipes for zero-shot learning, recognition and anticipation. Extensive experiments with various evaluation metrics demonstrate the potential of our method for generalization, given limited video data for training models.
translated by 谷歌翻译
Most natural videos contain numerous events. For example, in a video of a "man playing a piano", the video might also contain "another man dancing" or "a crowd clapping". We introduce the task of dense-captioning events, which involves both detecting and describing events in a video. We propose a new model that is able to identify all events in a single pass of the video while simultaneously describing the detected events with natural language. Our model introduces a variant of an existing proposal module that is designed to capture both short as well as long events that span minutes. To capture the dependencies between the events in a video, our model introduces a new captioning module that uses contextual information from past and future events to jointly describe all events. We also introduce ActivityNet Captions, a large-scale benchmark for dense-captioning events. ActivityNet Captions contains 20k videos amounting to 849 video hours with 100k total descriptions, each with it's unique start and end time. Finally, we report performances of our model for dense-captioning events, video retrieval and localization.
translated by 谷歌翻译
视频瞬间检索旨在找到给定自然语言查询描述的片刻的开始和结束时间戳(视频的一部分)。全面监督的方法需要完整的时间边界注释才能获得有希望的结果,这是昂贵的,因为注释者需要关注整个时刻。弱监督的方法仅依赖于配对的视频和查询,但性能相对较差。在本文中,我们仔细研究了注释过程,并提出了一种称为“ Glance注释”的新范式。该范式需要一个只有一个随机框架的时间戳,我们将其称为“目光”,在完全监督的对应物的时间边界内。我们认为这是有益的,因为与弱监督相比,添加了琐碎的成本,还提供了更大的潜力。在一眼注释设置下,我们提出了一种基于对比度学习的一眼注释(VIGA),称为视频力矩检索的方法。 Viga将输入视频切成片段,并在剪辑和查询之间形成对比,其中一眼指导的高斯分布重量被分配给所有夹子。我们的广泛实验表明,VIGA通过很大的边距较小的弱监督方法获得了更好的结果,甚至可以在某些情况下与完全监督的方法相媲美。
translated by 谷歌翻译
We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two largescale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.
translated by 谷歌翻译
随着时间的推移,视频活动定位的当前方法隐含地假设标记为模型训练的活动时间边界是确定且精确的。但是,在无脚本的自然视频中,不同的活动主要是顺利进行的,因此确切地确定活动何时随着时间的推移开始和结束,确定在本质上是模棱两可的。目前,在模型培训中,这种时间标签中的这种不确定性被忽略了,从而导致学习错误匹配的视频文本相关性,而测试中的概括不佳。在这项工作中,我们通过引入弹性力矩边界(EMB)来解决此问题,以适应灵活和适应性活动的时间边界,以建模普遍可解释的视频文本相关性与对预固定注释中的时间不确定性的宽容相关性。具体而言,我们通过挖掘和发现框架的时间端点可以适应地构建弹性边界,从而可以最大程度地利用视频片段和查询句子之间的对齐方式。为了启用更健壮的匹配(段内容注意力)和更准确的定位(段弹性边界),我们通过新颖的引导注意力机制优化了框架端点的选择。在三个视频活动定位基准上进行的广泛实验表明,在没有建模不确定性的情况下,EMB比现有方法的优势令人信服。
translated by 谷歌翻译
时间动作本地化的主要挑战是在未修剪的视频中从各种共同出现的成分(例如上下文和背景)中获取细微的人类行为。尽管先前的方法通过设计高级动作探测器取得了重大进展,但它们仍然遭受这些共发生的成分,这些成分通常占据视频中实际动作内容。在本文中,我们探讨了视频片段的两个正交但互补的方面,即动作功能和共存功能。尤其是,我们通过在视频片段中解开这两种功能并重新组合它们来生成具有更明显的动作信息以进行准确的动作本地化的新功能表示形式,从而开发了一项新颖的辅助任务。我们称我们的方法重新处理,该方法首先显式将动作内容分解并正规化其共发生的特征,然后合成新的动作主导的视频表示形式。对Thumos14和ActivityNet V1.3的广泛实验结果和消融研究表明,我们的新表示形式与简单的动作检测器相结合可以显着改善动作定位性能。
translated by 谷歌翻译
基于文本的视频细分旨在通过用文本查询指定演员及其表演动作来细分视频序列中的演员。由于\ emph {emph {语义不对称}的问题,以前的方法无法根据演员及其动作以细粒度的方式将视频内容与文本查询对齐。 \ emph {语义不对称}意味着在多模式融合过程中包含不同量的语义信息。为了减轻这个问题,我们提出了一个新颖的演员和动作模块化网络,该网络将演员及其动作分别定位在两个单独的模块中。具体来说,我们首先从视频和文本查询中学习与参与者相关的内容,然后以对称方式匹配它们以定位目标管。目标管包含所需的参与者和动作,然后将其送入完全卷积的网络,以预测演员的分割掩模。我们的方法还建立了对象的关联,使其与所提出的时间建议聚合机制交叉多个框架。这使我们的方法能够有效地细分视频并保持预测的时间一致性。整个模型允许联合学习参与者的匹配和细分,并在A2D句子和J-HMDB句子数据集上实现单帧细分和完整视频细分的最新性能。
translated by 谷歌翻译
本文解决了自然语言视频本地化(NLVL)的问题。几乎所有现有的作品都遵循“仅一次外观”框架,该框架利用单个模型直接捕获视频疑问对之间的复杂跨和自模式关系并检索相关段。但是,我们认为这些方法忽略了理想本地化方法的两个必不可少的特征:1)帧差异:考虑正/负视频帧的不平衡,在本地化过程中突出显示正帧并削弱负面框架是有效的。 2)边界优先:为了预测确切的段边界,该模型应捕获连续帧之间更细粒度的差异,因为它们的变化通常是平滑的。为此,我们灵感来自于人类如何看待和定位一个细分市场,我们提出了一个两步的人类框架,称为掠夺 - 储存式融合(SLP)。 SLP由脱脂和排列(SL)模块和双向仔细(BP)模块组成。 SL模块首先是指查询语义,并在滤除无关的帧时从视频中选择最佳匹配的帧。然后,BP模块基于此框架构造了初始段,并通过探索其相邻帧来动态更新它,直到没有帧共享相同的活动语义为止。三个具有挑战性的基准测试的实验结果表明,我们的SLP优于最新方法,并将其定位更精确的段边界。
translated by 谷歌翻译
Temporal action proposal generation is an important yet challenging problem, since temporal proposals with rich action content are indispensable for analysing real-world videos with long duration and high proportion irrelevant content. This problem requires methods not only generating proposals with precise temporal boundaries, but also retrieving proposals to cover truth action instances with high recall and high overlap using relatively fewer proposals. To address these difficulties, we introduce an effective proposal generation method, named Boundary-Sensitive Network (BSN), which adopts "local to global" fashion. Locally, BSN first locates temporal boundaries with high probabilities, then directly combines these boundaries as proposals. Globally, with Boundary-Sensitive Proposal feature, BSN retrieves proposals by evaluating the confidence of whether a proposal contains an action within its region. We conduct experiments on two challenging datasets: ActivityNet-1.3 and THUMOS14, where BSN outperforms other state-of-the-art temporal action proposal generation methods with high recall and high temporal precision. Finally, further experiments demonstrate that by combining existing action classifiers, our method significantly improves the state-of-the-art temporal action detection performance.
translated by 谷歌翻译
视频时刻检索旨在搜索与给定语言查询最相关的那一刻。然而,该社区中的大多数现有方法通常需要季节边界注释,这昂贵且耗时地标记。因此,最近仅通过使用粗略视频级标签来提出弱监督的方法。尽管有效,但这些方法通常是独立处理候选人的候选人,同时忽略了不同时间尺度中候选者之间的自然时间依赖性的关键问题。要应对这个问题,我们提出了一种多尺度的2D表示学习方法,用于弱监督视频时刻检索。具体地,我们首先构造每个时间刻度的二维图以捕获候选者之间的时间依赖性。该地图中的两个维度表示这些候选人的开始和结束时间点。然后,我们使用学习卷积神经网络从每个刻度变化的地图中选择Top-K候选。通过新设计的时刻评估模块,我们获得所选候选人的对齐分数。最后,标题和语言查询之间的相似性被用作进一步培训候选者选择器的监督。两个基准数据集Charades-STA和ActivityNet标题的实验表明,我们的方法能够实现最先进的结果。
translated by 谷歌翻译
To balance the annotation labor and the granularity of supervision, single-frame annotation has been introduced in temporal action localization. It provides a rough temporal location for an action but implicitly overstates the supervision from the annotated-frame during training, leading to the confusion between actions and backgrounds, i.e., action incompleteness and background false positives. To tackle the two challenges, in this work, we present the Snippet Classification model and the Dilation-Erosion module. In the Dilation-Erosion module, we expand the potential action segments with a loose criterion to alleviate the problem of action incompleteness and then remove the background from the potential action segments to alleviate the problem of action incompleteness. Relying on the single-frame annotation and the output of the snippet classification, the Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds, which in turn further trains the Snippet Classification model. It forms a cyclic dependency. Furthermore, we propose a new embedding loss to aggregate the features of action instances with the same label and separate the features of actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method. Code has been made publicly available (https://github.com/LingJun123/single-frame-TAL).
translated by 谷歌翻译
密集的视频字幕旨在使用视频的时间位置生成多个相关标题。以前的方法遵循复杂的“本地化 - 然后描述”方案,这些方案严重依赖于众多手工制作的组件。在本文中,通过将密集的标题产生作为设置预测任务,我们提出了一种具有并行解码(PDVC)的端到端密集视频字幕的简单且有效的框架。实际上,通过在变压器解码器顶部堆叠新提出的事件计数器,PDVC在对视频内容的整体理解下,将视频精确地将视频分成多个事件部分,这有效地提高了预测标题的相干性和可读性。与现有技术相比,PDVC具有多种吸引力优势:(1)不依赖于启发式非最大抑制或复发事件序列选择网络以除去冗余,PDVC直接产生具有适当尺寸的事件集; (2)与采用两级方案相比,我们并行地将事件查询的增强型表达送入本地化头和标题头,使这两个子任务深入相互关联,通过优化相互促进; (3)没有贝尔和吹口哨,对ActivityNet标题和YouScook2的广泛实验表明,PDVC能够产生高质量的标题结果,当其本地化准确性与它们相提并如此时,最先进的两级方法。代码可在https://github.com/ttengwang/pdvc提供。
translated by 谷歌翻译
在计算机视觉中长期以来一直研究了时间行动定位。现有的最先进的动作定位方法将每个视频划分为多个动作单位(即,在一级方法中的两级方法和段中的提案),然后单独地对每个视频进行操作,而不明确利用他们在学习期间的关系。在本文中,我们声称,动作单位之间的关系在行动定位中发挥着重要作用,并且更强大的动作探测器不仅应捕获每个动作单元的本地内容,还应允许更广泛的视野与相关的上下文它。为此,我们提出了一般图表卷积模块(GCM),可以轻松插入现有的动作本地化方法,包括两阶段和单级范式。具体而言,我们首先构造一个图形,其中每个动作单元被表示为节点,并且两个动作单元之间作为边缘之间的关系。在这里,我们使用两种类型的关系,一个类型的关系,用于捕获不同动作单位之间的时间连接,另一类是用于表征其语义关系的另一个关系。特别是对于两级方法中的时间连接,我们进一步探索了两种不同的边缘,一个连接重叠动作单元和连接周围但脱节的单元的另一个。在我们构建的图表上,我们将图形卷积网络(GCNS)应用于模拟不同动作单位之间的关系,这能够了解更有信息的表示来增强动作本地化。实验结果表明,我们的GCM始终如一地提高了现有行动定位方法的性能,包括两阶段方法(例如,CBR和R-C3D)和一级方法(例如,D-SSAD),验证我们的一般性和有效性GCM。
translated by 谷歌翻译
密集的视频字幕旨在确定输入视频中感兴趣的事件,并为每个事件生成描述性标题。先前的方法通常遵循两个阶段的生成过程,该过程首先提出了每个事件的段,然后为每个已确定的细分市场提供标题。大规模序列产生预处理的最新进展在统一各种任务的任务制定方面取得了巨大的成功,但是到目前为止,更复杂的任务(例如密集的视频字幕)无法完全利用这种强大的范式。在这项工作中,我们展示了如何将密集视频字幕的两个子任务与一个序列生成任务建模,并同时预测事件和相应的描述。在YouCook2和Vitt上进行的实验表现出令人鼓舞的结果,并表明训练复杂任务的可行性,例如集成到大规模预处理模型中的端到端密集的视频字幕。
translated by 谷歌翻译
Detecting actions in untrimmed videos is an important yet challenging task. In this paper, we present the structured segment network (SSN), a novel framework which models the temporal structure of each action instance via a structured temporal pyramid. On top of the pyramid, we further introduce a decomposed discriminative model comprising two classifiers, respectively for classifying actions and determining completeness. This allows the framework to effectively distinguish positive proposals from background or incomplete ones, thus leading to both accurate recognition and localization. These components are integrated into a unified network that can be efficiently trained in an end-to-end fashion. Additionally, a simple yet effective temporal action proposal scheme, dubbed temporal actionness grouping (TAG) is devised to generate high quality action proposals. On two challenging benchmarks, THUMOS14 and ActivityNet, our method remarkably outperforms previous state-of-the-art methods, demonstrating superior accuracy and strong adaptivity in handling actions with various temporal structures. 1
translated by 谷歌翻译
Given an untrimmed video and natural language query, video sentence grounding aims to localize the target temporal moment in the video. Existing methods mainly tackle this task by matching and aligning semantics of the descriptive sentence and video segments on a single temporal resolution, while neglecting the temporal consistency of video content in different resolutions. In this work, we propose a novel multi-resolution temporal video sentence grounding network: MRTNet, which consists of a multi-modal feature encoder, a Multi-Resolution Temporal (MRT) module, and a predictor module. MRT module is an encoder-decoder network, and output features in the decoder part are in conjunction with Transformers to predict the final start and end timestamps. Particularly, our MRT module is hot-pluggable, which means it can be seamlessly incorporated into any anchor-free models. Besides, we utilize a hybrid loss to supervise cross-modal features in MRT module for more accurate grounding in three scales: frame-level, clip-level and sequence-level. Extensive experiments on three prevalent datasets have shown the effectiveness of MRTNet.
translated by 谷歌翻译
Temporal video segmentation and classification have been advanced greatly by public benchmarks in recent years. However, such research still mainly focuses on human actions, failing to describe videos in a holistic view. In addition, previous research tends to pay much attention to visual information yet ignores the multi-modal nature of videos. To fill this gap, we construct the Tencent `Ads Video Segmentation'~(TAVS) dataset in the ads domain to escalate multi-modal video analysis to a new level. TAVS describes videos from three independent perspectives as `presentation form', `place', and `style', and contains rich multi-modal information such as video, audio, and text. TAVS is organized hierarchically in semantic aspects for comprehensive temporal video segmentation with three levels of categories for multi-label classification, e.g., `place' - `working place' - `office'. Therefore, TAVS is distinguished from previous temporal segmentation datasets due to its multi-modal information, holistic view of categories, and hierarchical granularities. It includes 12,000 videos, 82 classes, 33,900 segments, 121,100 shots, and 168,500 labels. Accompanied with TAVS, we also present a strong multi-modal video segmentation baseline coupled with multi-label class prediction. Extensive experiments are conducted to evaluate our proposed method as well as existing representative methods to reveal key challenges of our dataset TAVS.
translated by 谷歌翻译
寻找特定任务说明的YouTube用户可能会花费很长时间浏览内容,以寻找与他们需求相匹配的正确视频。创建视觉摘要(视频的删节版本)为观众提供了快速概述,并大大减少了搜索时间。在这项工作中,我们专注于总结教学视频,这​​是视频摘要的探索领域。与通用视频相比,可以将教学视频解析为语义上有意义的细分,这些细分与所示任务的重要步骤相对应。现有的视频摘要数据集依靠手动框架级注释,使其主观且大小有限。为了克服这一点,我们首先通过利用两个关键假设来自动为教学视频语料库生成伪摘要:(i)相关步骤可能会出现在相同任务(任务相关性)的多个视频中,并且(ii)它们更重要。可能由示威者口头描述(跨模式显着)。我们提出了一个教学视频摘要网络,该网络结合了上下文感知的时间视频编码器和段评分变压器。使用伪摘要作为弱监督,我们的网络为仅给出视频和转录语音的教学视频构建了视觉摘要。为了评估我们的模型,我们通过刮擦包含视频演示的Wikihow文章和步骤的视觉描绘,从而收集了高质量的测试集,即Wikihow摘要,从而使我们能够获得地面真实性摘要。我们的表现优于几个基线和这个新基准的最先进的视频摘要模型。
translated by 谷歌翻译
我们介绍一种基于复杂事件(例如,分钟)可以分解成更简单的事件(例如,几秒钟)的前提的方法来学习无监督的语义视觉信息,并且这些简单事件在多个复杂事件中共享。我们将一个长视频分成短帧序列,以利用三维卷积神经网络提取它们的潜在表示。群集方法用于对产生视觉码本的组表示(即,长视频由集群标签给出的整数序列表示)。通过对码本条目编码共生概率矩阵来学习密集的表示。我们展示了该表示如何利用浓密视频标题任务的性能,只有视觉功能。由于这种方法,我们能够更换双模变压器(BMT)方法中的音频信号,并产生具有可比性的时间提案。此外,与Vanilla变压器方法中的我们的描述符连接视觉信号,与仅探索视觉功能的方法相比,在标题中实现最先进的性能,以及具有多模态方法的竞争性能。我们的代码可在https://github.com/valterlej/dvcusi获得。
translated by 谷歌翻译