预期未来的事件是智能系统和体现AI的重要功能。但是,与传统的识别任务相比,未来和推理能力要求的不确定性使预期任务非常具有挑战性,并且远远超出了解决。在此文件中,以前的方法通常更关心模型架构设计,或者很少关注如何通过适当的学习政策培训预期模型。为此,在这项工作中,我们提出了一种称为动态上下文删除(DCR)的新型培训方案,该方案动态地安排了学习过程中观察到的未来的可见性。它遵循类似人类的课程学习过程,即逐渐消除事件上下文以增加预期难度,直到满足最终预期目标。我们的学习方案是插件,易于整合包括变压器和LSTM在内的任何推理模型,具有有效性和效率的优势。在广泛的实验中,提出的方法在四个广泛使用的基准上实现了最先进的方法。我们的代码和模型将在https://github.com/allenxuuu/dcr上公开发布。
translated by 谷歌翻译
未来的活动预期是在Egocentric视觉中具有挑战性问题。作为标准的未来活动预期范式,递归序列预测遭受错误的累积。为了解决这个问题,我们提出了一个简单有效的自我监管的学习框架,旨在使中间表现为连续调节中间代表性,以产生表示(a)与先前观察到的对比的当前时间戳框架中的新颖信息内容和(b)反映其与先前观察到的帧的相关性。前者通过最小化对比损失来实现,并且后者可以通过动态重量机制来实现在观察到的内容中的信息帧中,具有当前帧的特征与观察到的帧之间的相似性比较。通过多任务学习可以进一步增强学习的最终视频表示,该多任务学习在目标活动标签上执行联合特征学习和自动检测到的动作和对象类令牌。在大多数自我传统视频数据集和两个第三人称视频数据集中,SRL在大多数情况下急剧表现出现有的现有最先进。通过实验性事实,还可以准确识别支持活动语义的行动和对象概念的实验性。
translated by 谷歌翻译
在Enocentric视频中,行动在快速连续中发生。我们利用了行动的时间背景,并提出了一种学习参加周围行动的方法,以提高识别性能。为了纳入时间上下文,我们提出了一种基于变换器的多模式模型,可将视频和音频作为输入模式摄取,具有显式语言模型,提供动作序列上下文来增强预测。我们在史诗厨房和EGTEA数据集上测试我们的方法,报告最先进的性能。我们的消融展示了利用时间上下文的优势以及将音频输入模态和语言模型结合到Rescore预测。代码和模型在:https://github.com/ekazakos/mtcn。
translated by 谷歌翻译
Anticipating future actions based on video observations is an important task in video understanding, which would be useful for some precautionary systems that require response time to react before an event occurs. Since the input in action anticipation is only pre-action frames, models do not have enough information about the target action; moreover, similar pre-action frames may lead to different futures. Consequently, any solution using existing action recognition models can only be suboptimal. Recently, researchers have proposed using a longer video context to remedy the insufficient information in pre-action intervals, as well as the self-attention to query past relevant moments to address the anticipation problem. However, the indirect use of video input features as the query might be inefficient, as it only serves as the proxy to the anticipation goal. To this end, we propose an inductive attention model, which transparently uses prior prediction as the query to derive the anticipation result by induction from past experience. Our method naturally considers the uncertainty of multiple futures via the many-to-many association. On the large-scale egocentric video datasets, our model not only shows consistently better performance than state of the art using the same backbone, and is competitive to the methods that employ a stronger backbone, but also superior efficiency in less model parameters.
translated by 谷歌翻译
While today's video recognition systems parse snapshots or short clips accurately, they cannot connect the dots and reason across a longer range of time yet. Most existing video architectures can only process <5 seconds of a video without hitting the computation or memory bottlenecks. In this paper, we propose a new strategy to overcome this challenge. Instead of trying to process more frames at once like most existing methods, we propose to process videos in an online fashion and cache "memory" at each iteration. Through the memory, the model can reference prior context for long-term modeling, with only a marginal cost. Based on this idea, we build MeMViT, a Memory-augmented Multiscale Vision Transformer, that has a temporal support 30x longer than existing models with only 4.5% more compute; traditional methods need >3,000% more compute to do the same. On a wide range of settings, the increased temporal support enabled by MeMViT brings large gains in recognition accuracy consistently. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets. Code and models are available at https://github.com/facebookresearch/memvit.
translated by 谷歌翻译
这项工作的目的是学习以对象为中心的视频表示形式,以改善对新任务的可转让性,即与动作分类前训练任务不同的任务。为此,我们介绍了基于变压器体系结构的新的以对象为中心的视频识别模型。该模型学习了视频中以对象为中心的摘要向量,并使用这些向量融合视频剪辑的视觉和时空轨迹“模态”。我们还引入了一种新型的轨迹对比损失,以进一步增强这些摘要矢量的物质性。通过在四个数据集上进行实验 - Somethingsometh-v2,Somethingse,Action Genome和Epickitchens-我们表明,以对象为中心的模型优于先验的视频表示(对象 - 敏捷和对象感知)看不见的对象和看不见的环境; (2)小型学习新课程; (3)线性探测到其他下游任务;以及(4)用于标准动作分类。
translated by 谷歌翻译
预先培训用于学习可转让的视频文本表示的模型,以近年来引起了很多关注。以前的主导作品主要采用两个独立的编码器来有效检索,但忽略视频和文本之间的本地关联。另一种研究使用联合编码器与文本交互视频,但是由于每个文本视频对需要馈送到模型中的低效率。在这项工作中,我们能够通过新颖的借口任务进行微粒视频文本交互,以便通过新颖的借口任务进行检索,称为多项选择题(MCQ),其中参数模块BridgeFormer培训以接受由此构建的“问题”。文本功能通过诉诸视频功能。具体来说,我们利用了文本的丰富语义(即,名词和动词)来构建问题,可以培训视频编码器以捕获更多区域内容和时间动态。以问题和答案的形式,可以正确建立本地视频文本功能之间的语义关联。 BridgeFormer能够删除下游检索,只有两个编码器渲染高效且灵活的模型。我们的方法在具有不同实验设置(即零拍摄和微调)的五个数据集中,在五个数据集中优于最先进的方法,包括不同的实验设置(即零拍摄和微调),包括HOWTO100M(一百万个视频)。我们进一步开展零射击动作识别,可以作为视频到文本检索,我们的方法也显着超越了其对应物。作为额外的好处,我们的方法在单模下游任务中实现了竞争力,在单模下游任务上具有更短的预训练视频,例如,使用线性评估的动作识别。
translated by 谷歌翻译
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
translated by 谷歌翻译
Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced with the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey we analyze main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled as input-level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.
translated by 谷歌翻译
在视频的每一帧中,流式传输视频识别原因及其动作。良好的流识别模型捕获了长期动态和视频的短期变化。不幸的是,在大多数现有方法中,计算复杂性随所考虑的动力学的长度线性或二次增长。此问题在基于变压器的体系结构中特别明显。为了解决这个问题,我们通过内核镜头重新制定了视频变压器中的跨注意事项,并应用了两种暂时的平滑核:盒子内核或拉普拉斯内核。最终的流动注意力可以从框架到框架重新重新计算,并且仅需要恒定的时间更新每个帧。基于这个想法,我们构建了一种时间平滑变压器Testra,它具有恒定的缓存和计算开销的任意输入。具体而言,它的运行$ 6 \ times $ $ $比基于滑动窗口的同等滑动变压器的运行速度快,在流设置中具有2,048帧。此外,由于时间跨度的增加,Testra在Thumos'14和Epic-Kitchen-100上取得了最新的结果,这是两个标准的在线操作检测和动作预期数据集。 Testra的实时版本优于Thumos'14数据集上的所有事先方法。
translated by 谷歌翻译
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-text embedding learned contrastively with these additional auto-generated narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LaViLa obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LaViLa trained with only half the narrations from the Ego4D dataset outperforms baseline models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size.
translated by 谷歌翻译
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
translated by 谷歌翻译
视频识别是由端到端学习范式主导的 - 首先初始化具有预审预周化图像模型的视频识别模型,然后对视频进行端到端培训。这使视频网络能够受益于验证的图像模型。但是,这需要大量的计算和内存资源,以便在视频上进行填充以及直接使用预审计的图像功能的替代方案,而无需填充图像骨架会导致结果不足。幸运的是,在对比视力语言预训练(剪辑)方面的最新进展为视觉识别任务的新途径铺平了道路。这些模型在大型开放式图像文本对数据上进行了预测,以丰富的语义学习强大的视觉表示。在本文中,我们介绍了有效的视频学习(EVL) - 一种有效的框架,用于直接训练具有冷冻剪辑功能的高质量视频识别模型。具体来说,我们采用轻型变压器解码器并学习查询令牌,从剪辑图像编码器中动态收集帧级空间特征。此外,我们在每个解码器层中采用局部时间模块,以发现相邻帧及其注意力图的时间线索。我们表明,尽管有效地使用冷冻的骨干训练,但我们的模型在各种视频识别数据集上学习了高质量的视频表示。代码可在https://github.com/opengvlab/feld-video-rencognition上找到。
translated by 谷歌翻译
最近,视频变压器在视频理解方面取得了巨大成功,超过了CNN性能;然而,现有的视频变换器模型不会明确地模拟对象,尽管对象对于识别操作至关重要。在这项工作中,我们呈现对象区域视频变换器(Orvit),一个\ emph {对象为中心}方法,它与直接包含对象表示的块扩展视频变压器图层。关键的想法是从早期层开始融合以对象形式的表示,并将它们传播到变压器层中,从而影响整个网络的时空表示。我们的orvit块由两个对象级流组成:外观和动态。在外观流中,“对象区域关注”模块在修补程序上应用自我关注和\ emph {对象区域}。以这种方式,Visual对象区域与统一修补程序令牌交互,并通过上下文化对象信息来丰富它们。我们通过单独的“对象 - 动态模块”进一步模型对象动态,捕获轨迹交互,并显示如何集成两个流。我们在四个任务和五个数据集中评估我们的模型:在某事物中的某些问题和几次射击动作识别,以及在AVA上的某些时空动作检测,以及在某种东西上的标准动作识别 - 某种东西 - 东西,潜水48和EPIC-Kitchen100。我们在考虑的所有任务和数据集中展示了强大的性能改进,展示了将对象表示的模型的值集成到变压器体系结构中。对于代码和预用模型,请访问项目页面\ url {https://roeiherz.github.io/orvit/}
translated by 谷歌翻译
最近的动作识别模型通过整合对象,其位置和互动来取得令人印象深刻的结果。但是,为每个框架获得密集的结构化注释是乏味且耗时的,使这些方法的训练昂贵且可扩展性较低。同时,如果可以在感兴趣的域内或之外使用一小部分带注释的图像,我们如何将它们用于下游任务的视频?我们提出了一个学习框架的结构(简称SVIT),该结构证明了仅在训练过程中仅可用的少量图像的结构才能改善视频模型。 SVIT依靠两个关键见解。首先,由于图像和视频都包含结构化信息,因此我们用一组\ emph {对象令牌}丰富了一个可以在图像和视频中使用的\ emph {对象令牌}的模型。其次,视频中各个帧的场景表示应与静止图像的场景表示“对齐”。这是通过\ emph {frame-clip一致性}损失来实现的,该损失可确保图像和视频之间结构化信息的流动。我们探索场景结构的特定实例化,即\ emph {手对象图},由手和对象组成,其位置为节点,以及触点/no-contact的物理关系作为边缘。 SVIT在多个视频理解任务和数据集上显示出强烈的性能改进;它在EGO4D CVPR'22对象状态本地化挑战中赢得了第一名。对于代码和预算模型,请访问\ url {https://eladb3.github.io/svit/}的项目页面
translated by 谷歌翻译
这项工作提出了一个名为TEG的自我监督的学习框架,探讨学习视频表示中的时间粒度。在TEG中,我们从视频中抽出一个长剪辑,以及在长夹内部的短夹。然后我们提取密集的时间嵌入品。培训目标由两部分组成:一个细粒度的时间学习目的,以最大化短夹和长剪辑中的相应时间嵌入之间的相似性,以及持续的时间学习目标,以将两个剪辑的全局嵌入在一起。我们的研究揭示了时间粒度与三个主要发现的影响。 1)不同的视频任务可能需要不同时间粒度的特征。 2)有趣的是,广泛认为需要时间感知的一些任务实际上可以通过时间持久的功能来解决。 3)TEG的灵活性对8个视频基准测试产生最先进的结果,在大多数情况下优于监督预训练。
translated by 谷歌翻译
在本文中,我们考虑了从长时间的视频到几分钟的长视频进行分类的问题(例如,烹饪不同的食谱,烹饪不同的食谱,进行不同的家庭装修,创建各种形式的艺术和手工艺品)。准确地对这些活动进行分类,不仅需要识别构成任务的单个步骤,还需要捕获其时间依赖性。这个问题与传统的动作分类大不相同,在传统的动作分类中,模型通常在跨越几秒钟的视频上进行了优化,并且手动修剪以包含简单的原子动作。虽然步骤注释可以使模型的培训能够识别程序活动的各个步骤,但由于长时间视频中手动注释时间界的超级注释,因此该领域的现有大规模数据集不包括此类段标签。为了解决这个问题,我们建议通过利用文本知识库(Wikihow)的遥远监督来自动确定教学视频中的步骤,其中包括对执行各种复杂活动所需的步骤的详细描述。我们的方法使用语言模型来匹配视频中自动转录的语音,以在知识库中逐步描述。我们证明,经过训练的视频模型可以识别这些自动标记的步骤(无手动监督)产生了在四个下游任务上实现卓越的概括性能的表示:识别程序活动,步骤分类,步骤预测和以自我为中心的视频分类。
translated by 谷歌翻译
Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.
translated by 谷歌翻译
The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos. Moreover, the use of human-generated annotations leads to models with biased learning and poor domain generalization and robustness. As an alternative, self-supervised learning provides a way for representation learning which does not require annotations and has shown promise in both image and video domains. Different from the image domain, learning video representations are more challenging due to the temporal dimension, bringing in motion and other environmental dynamics. This also provides opportunities for video-exclusive ideas that advance self-supervised learning in the video and multimodal domain. In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.
translated by 谷歌翻译
行动预测旨在通过部分观察视频推断即将举行的人类行动,这是由于早期观察结果有限的信息有限。现有方法主要采用重建策略来处理此任务,期望从部分观察到完整视频来学习单个映射函数,以便于预测过程。在这项研究中,我们提出了来自两个新方面的部分视频查询生成“完整视频”功能调节的对抗性记忆网络(AMEMNet)。首先,键值结构化存储器发生器旨在将不同的部分视频存储为键存储器,并在具有门控机制和查询关注的值存储器中动态地写入完整视频。其次,我们开发了一个类感知判别者,以指导内存发生器在对抗训练时不仅提供现实,而且还提供鉴别的完整视频特征。通过RGB和光学流量的晚期融合给出了AMEMNET的最终预测结果。提供两个基准视频数据集,UCF-101和HMDB51的广泛实验结果,以证明所提出的AMEMNET模型在最先进的方法的有效性。
translated by 谷歌翻译