在本报告中,我们将提交的技术细节介绍给2022年Epic-Kitchens无监督的域适应性(UDA)挑战。现有的UDA方法使从源和目标域中的整个视频片段中提取的全局功能对齐,但在视频识别中遇到了功能匹配的空间冗余。通过观察到,在大多数情况下,每个视频框架中的一个小图像区域可以足以满足动作识别任务的信息,我们建议利用信息图像区域以执行有效的域名。具体而言,我们首先使用轻型CNN来提取输入两流视频帧的全局信息,并通过基于可区分的插值选择策略选择信息性的图像补丁。然后,来自视频框架的全局信息和来自图像补丁的本地信息将通过现有的视频适应方法(即TA3N)处理,以便为源域和目标域执行功能对齐。我们的方法(无模型合奏)在今年的Epic-Kitchens-100测试集中排名第四。
translated by 谷歌翻译
在本报告中,我们描述了我们提交给Epic-Kitchens-100无监督的域适应(UDA)挑战的技术细节。为了应对UDA设置下存在的域移位,我们首先利用了最近的域概括(DG)技术,称为相对规范对准(RNA)。其次,我们将这种方法扩展到无标记的目标数据工作,从而使模型更简单地以无监督的方式适应目标分布。为此,我们将UDA算法包括在内,例如多级对抗对准和专心熵。通过分析挑战设置,我们注意到数据中存在二次并发转移,通常称为环境偏见。它是由存在不同环境(即厨房)引起的。为了处理这两个班次(环境和时间段),我们扩展了系统以执行多源多目标域的适应性。最后,我们在最终提案中采用了不同的模型来利用流行视频体系结构的潜力,并为合奏改编介绍了两次损失。我们的提交(条目“ PLNET”)在排行榜上可见,并在“动词”中排名第二,并且在“名词”和“ Action”中都处于第三位。
translated by 谷歌翻译
尽管近年来行动认可取得了令人印象深刻的结果,但视频培训数据的收集和注释仍然很耗时和成本密集。因此,已经提出了图像到视频改编,以利用无标签的Web图像源来适应未标记的目标视频。这提出了两个主要挑战:(1)Web图像和视频帧之间的空间域移动; (2)图像和视频数据之间的模态差距。为了应对这些挑战,我们提出了自行车域的适应(CYCDA),这是一种基于周期的方法,用于通过在图像和视频中利用图像和视频中的联合空间信息来适应无监督的图像到视频域,另一方面,训练一个独立的时空模型,用于弥合模式差距。我们在每个周期中的两者之间的知识转移之间在空间和时空学习之间交替。我们在基准数据集上评估了图像到视频的方法,以及用于实现最新结果的混合源域的适应性,并证明了我们的循环适应性的好处。
translated by 谷歌翻译
现有的视频域改编(DA)方法需要存储视频帧的所有时间组合或配对源和目标视频,这些视频和目标视频成本昂贵,无法扩展到长时间的视频。为了解决这些局限性,我们建议采用以下记忆高效的基于图形的视频DA方法。首先,我们的方法模型每个源或目标视频通过图:节点表示视频帧和边缘表示帧之间的时间或视觉相似性关系。我们使用图形注意力网络来了解单个帧的重量,并同时将源和目标视频对齐到域不变的图形特征空间中。我们的方法没有存储大量的子视频,而是仅构建一个图形,其中一个视频的图形注意机制,从而大大降低了内存成本。广泛的实验表明,与最先进的方法相比,我们在降低内存成本的同时取得了卓越的性能。
translated by 谷歌翻译
Domain adaptation (DA) approaches address domain shift and enable networks to be applied to different scenarios. Although various image DA approaches have been proposed in recent years, there is limited research towards video DA. This is partly due to the complexity in adapting the different modalities of features in videos, which includes the correlation features extracted as long-term dependencies of pixels across spatiotemporal dimensions. The correlation features are highly associated with action classes and proven their effectiveness in accurate video feature extraction through the supervised action recognition task. Yet correlation features of the same action would differ across domains due to domain shift. Therefore we propose a novel Adversarial Correlation Adaptation Network (ACAN) to align action videos by aligning pixel correlations. ACAN aims to minimize the distribution of correlation information, termed as Pixel Correlation Discrepancy (PCD). Additionally, video DA research is also limited by the lack of cross-domain video datasets with larger domain shifts. We, therefore, introduce a novel HMDB-ARID dataset with a larger domain shift caused by a larger statistical difference between domains. This dataset is built in an effort to leverage current datasets for dark video classification. Empirical results demonstrate the state-of-the-art performance of our proposed ACAN for both existing and the new video DA datasets.
translated by 谷歌翻译
当前,根据CNN处理的视频数据,主要执行动作识别。我们研究CNN的表示过程是否也可以通过将基于图像的动作音频表示为任务中的多模式动作识别。为此,我们提出了多模式的音频图像和视频动作识别器(MAIVAR),这是一个基于CNN的音频图像到视频融合模型,以视频和音频方式来实现卓越的动作识别性能。Maivar提取音频的有意义的图像表示,并将其与视频表示形式融合在一起,以获得更好的性能,与大规模动作识别数据集中的两种模式相比。
translated by 谷歌翻译
在过去的几年中,无监督的域适应性(UDA)技术在计算机视觉中具有显着的重要性和流行。但是,与可用于图像的广泛文献相比,视频领域仍然相对尚未探索。另一方面,动作识别模型的性能受到域转移的严重影响。在本文中,我们提出了一种简单新颖的UDA方法,以供视频动作识别。我们的方法利用了时空变压器的最新进展来构建一个强大的源模型,从而更好地概括了目标域。此外,由于引入了来自信息瓶颈原则的新颖对齐损失术语,我们的架构将学习域不变功能。我们报告了UDA的两个视频动作识别基准的结果,显示了HMDB $ \ leftrightArrow $ ucf的最新性能,以及动力学$ \ rightarrow $ nec-Drone,这更具挑战性。这证明了我们方法在处理不同级别的域转移方面的有效性。源代码可在https://github.com/vturrisi/udavt上获得。
translated by 谷歌翻译
视觉和听觉信息对于确定视频中的显着区域都是有价值的。深度卷积神经网络(CNN)展示了应对视听显着性预测任务的强大能力。由于各种因素,例如拍摄场景和天气,源训练数据和目标测试数据之间通常存在适度的分布差异。域差异导致CNN模型目标测试数据的性能降解。本文提前尝试解决视听显着性预测的无监督域适应问题。我们提出了一种双重域交流学习算法,以减轻源数据和目标数据之间的域差异。首先,建立了一个特定的域歧视分支,以对齐听觉功能分布。然后,这些听觉功能通过跨模式自我发项模块融合到视觉特征中。设计了其他域歧视分支,以减少视觉特征的域差异和融合视听特征所隐含的视听相关性的差异。公共基准测试的实验表明,我们的方法可以减轻域差异引起的性能降解。
translated by 谷歌翻译
最近,视频变压器在视频理解方面取得了巨大成功,超过了CNN性能;然而,现有的视频变换器模型不会明确地模拟对象,尽管对象对于识别操作至关重要。在这项工作中,我们呈现对象区域视频变换器(Orvit),一个\ emph {对象为中心}方法,它与直接包含对象表示的块扩展视频变压器图层。关键的想法是从早期层开始融合以对象形式的表示,并将它们传播到变压器层中,从而影响整个网络的时空表示。我们的orvit块由两个对象级流组成:外观和动态。在外观流中,“对象区域关注”模块在修补程序上应用自我关注和\ emph {对象区域}。以这种方式,Visual对象区域与统一修补程序令牌交互,并通过上下文化对象信息来丰富它们。我们通过单独的“对象 - 动态模块”进一步模型对象动态,捕获轨迹交互,并显示如何集成两个流。我们在四个任务和五个数据集中评估我们的模型:在某事物中的某些问题和几次射击动作识别,以及在AVA上的某些时空动作检测,以及在某种东西上的标准动作识别 - 某种东西 - 东西,潜水48和EPIC-Kitchen100。我们在考虑的所有任务和数据集中展示了强大的性能改进,展示了将对象表示的模型的值集成到变压器体系结构中。对于代码和预用模型,请访问项目页面\ url {https://roeiherz.github.io/orvit/}
translated by 谷歌翻译
Modeling the visual changes that an action brings to a scene is critical for video understanding. Currently, CNNs process one local neighbourhood at a time, thus contextual relationships over longer ranges, while still learnable, are indirect. We present TROI, a plug-and-play module for CNNs to reason between mid-level feature representations that are otherwise separated in space and time. The module relates localized visual entities such as hands and interacting objects and transforms their corresponding regions of interest directly in the feature maps of convolutional layers. With TROI, we achieve state-of-the-art action recognition results on the large-scale datasets Something-Something-V2 and EPIC-Kitchens-100.
translated by 谷歌翻译
在大量标记培训数据的监督下,视频语义细分取得了巨大进展。但是,域自适应视频分割,可以通过从标记的源域对未标记的目标域进行调整来减轻数据标记约束,这很大程度上被忽略了。我们设计了时间伪监督(TPS),这是一种简单有效的方法,探讨了从未标记的目标视频学习有效表示的一致性培训的想法。与在空间空间中建立一致性的传统一致性训练不同,我们通过在增强视频框架之间执行模型一致性来探索时空空间中的一致性训练,这有助于从更多样化的目标数据中学习。具体来说,我们设计了跨框架伪标签,以从以前的视频帧中提供伪监督,同时从增强的当前视频帧中学习。跨框架伪标签鼓励网络产生高确定性预测,从而有效地通过跨框架增强来促进一致性训练。对多个公共数据集进行的广泛实验表明,与最先进的ART相比,TPS更容易实现,更稳定,并且可以实现卓越的视频细分精度。
translated by 谷歌翻译
In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.
translated by 谷歌翻译
Sign language recognition (SLR) aims to overcome the communication barrier for the people with deafness or the people with hard hearing. Most existing approaches can be typically divided into two lines, i.e., Skeleton-based and RGB-based methods, but both the two lines of methods have their limitations. RGB-based approaches usually overlook the fine-grained hand structure, while Skeleton-based methods do not take the facial expression into account. In attempts to address both limitations, we propose a new framework named Spatial-temporal Part-aware network (StepNet), based on RGB parts. As the name implies, StepNet consists of two modules: Part-level Spatial Modeling and Part-level Temporal Modeling. Particularly, without using any keypoint-level annotations, Part-level Spatial Modeling implicitly captures the appearance-based properties, such as hands and faces, in the feature space. On the other hand, Part-level Temporal Modeling captures the pertinent properties over time by implicitly mining the long-short term context. Extensive experiments show that our StepNet, thanks to Spatial-temporal modules, achieves competitive Top-1 Per-instance accuracy on three widely-used SLR benchmarks, i.e., 56.89% on WLASL, 77.2% on NMFs-CSL, and 77.1% on BOBSL. Moreover, the proposed method is compatible with the optical flow input, and can yield higher performance if fused. We hope that this work can serve as a preliminary step for the people with deafness.
translated by 谷歌翻译
在这项工作中,我们将解决方案介绍给Epic-Kitchens-100 2022动作检测挑战。提出了一阶段动作检测变压器(OADT)来对视频段的时间连接进行建模。借助OADT,可以同时识别类别和时间边界。在完成了从不同功能训练的多个OADT模型之后,我们的模型可以达到21.28 \%的动作图,并在操作检测挑战的测试集中排名第一。
translated by 谷歌翻译
The increasing number of surveillance cameras and security concerns have made automatic violent activity detection from surveillance footage an active area for research. Modern deep learning methods have achieved good accuracy in violence detection and proved to be successful because of their applicability in intelligent surveillance systems. However, the models are computationally expensive and large in size because of their inefficient methods for feature extraction. This work presents a novel architecture for violence detection called Two-stream Multi-dimensional Convolutional Network (2s-MDCN), which uses RGB frames and optical flow to detect violence. Our proposed method extracts temporal and spatial information independently by 1D, 2D, and 3D convolutions. Despite combining multi-dimensional convolutional networks, our models are lightweight and efficient due to reduced channel capacity, yet they learn to extract meaningful spatial and temporal information. Additionally, combining RGB frames and optical flow yields 2.2% more accuracy than a single RGB stream. Regardless of having less complexity, our models obtained state-of-the-art accuracy of 89.7% on the largest violence detection benchmark dataset.
translated by 谷歌翻译
The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER to Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq to Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.
translated by 谷歌翻译
该报告描述了我们对2022 Epic-Kitchens Action识别挑战的获胜解决方案背后的方法。我们的方法基于我们最近的工作,视频识别的多视图变压器(MTV),并将其适应多模式输入。我们的最终提交由多模式MTV(M&M)模型的合奏组成,它改变了主链尺寸和输入方式。我们的方法在动作类中的测试集上达到了52.8%的TOP-1准确性,比去年的获胜参赛作品高4.1%。
translated by 谷歌翻译
We investigate architectures of discriminatively trained deep Convolutional Networks (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to generalise the best performing hand-crafted features within a data-driven learning framework. Our contribution is three-fold. First, we propose a two-stream ConvNet architecture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multitask learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it is competitive with the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification.
translated by 谷歌翻译
在本文中,我们提出了一种新的视频表示学习方法,名为时间挤压(TS)池,这可以从长期的视频帧中提取基本移动信息,并将其映射到一组名为挤压图像的几个图像中。通过将时间挤压池作为层嵌入到现成的卷积神经网络(CNN)中,我们设计了一个名为Temporal Squeeze网络(TESNet)的新视频分类模型。由此产生的挤压图像包含来自视频帧的基本移动信息,对应于视频分类任务的优化。我们在两个视频分类基准上评估我们的架构,并与最先进的结果进行了比较。
translated by 谷歌翻译
第一人称行动认可是视频理解中有挑战性的任务。由于强烈的自我运动和有限的视野,第一人称视频中的许多背景或嘈杂的帧可以在其学习过程中分散一个动作识别模型。为了编码更多的辨别特征,模型需要能够专注于视频的最相关的动作识别部分。以前的作品通过应用时间关注但未能考虑完整视频的全局背景来解决此问题,这对于确定相对重要的部分至关重要。在这项工作中,我们提出了一种简单而有效的堆叠的临时注意力模块(STAM),以基于跨越剪辑的全球知识来计算时间注意力,以强调最辨别的特征。我们通过堆叠多个自我注意层来实现这一目标。而不是天真的堆叠,这是实验证明是无效的,我们仔细地设计了每个自我关注层的输入,以便在产生时间注意力期间考虑视频的本地和全局背景。实验表明,我们提出的STAM可以基于大多数现有底座的顶部构建,并提高各个数据集中的性能。
translated by 谷歌翻译