体育视频分析是由于各种应用领域的普遍研究课题,从多媒体智能设备带来了用户量身定制的易消化,以分析运动员的表现。体育视频任务是Mediaeval 2021基准测试的一部分。此任务可以从视频中解决细粒度的动作检测和分类。重点是乒乓球比赛的录音。自2019年以来运行,该任务从未在自然条件下录制的未经监测视频提供了分类挑战,每个行程都有已知的时间边界。今年,数据集延长并提供了未经注释的未经监测视频的检测挑战。这项工作旨在为体育教练和玩家创造工具,以分析体育绩效。在这种技术可以建立运动分析和玩家分析,以丰富运动员的培训经验,提高他们的表现。
translated by 谷歌翻译
尽管深度学习已被广​​泛用于视频分析,例如视频分类和动作检测,但与体育视频的快速移动主题进行密集的动作检测仍然具有挑战性。在这项工作中,我们发布了另一个体育视频数据集$ \ textbf {p $^2 $ a} $ for $ \ usewessline {p} $ \ in $ \ usepline {p} $ ong- $ \ $ \ usepline {a} $ ction ction ction检测,由2,721个视频片段组成,这些视频片段从世界乒乓球锦标赛和奥林匹克运动会的专业乒乓球比赛的广播视频中收集。我们与一批乒乓球专业人士和裁判员合作,以获取出现在数据集中的每个乒乓球动作,并提出两组动作检测问题 - 行动定位和行动识别。我们使用$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ fextbf {p $^2 $^2 $^2 $ a^2 $^2 $ a^2 $^2 $ a^2 $ a^2 $^$^2 $ a^2 $^2 $ a^2 $^2 $ a^2 $^2 $ a^2 $^2 $^2 $ a^2 $^2 $ a^2 $^2 $^2 $^2 $^2 $^2 $^2 $ a在各种设置下,这两个问题的$} $。这些模型只能在AR-AN曲线下实现48%的面积,以进行本地化,而识别次数为82%,因为Ping-Pong的动作密集具有快速移动的主题,但广播视频仅为25 fps。结果证实,$ \ textbf {p $^2 $ a} $仍然是一项具有挑战性的任务,可以用作视频中动作检测的基准。
translated by 谷歌翻译
Visual object analysis researchers are increasingly experimenting with video, because it is expected that motion cues should help with detection, recognition, and other analysis tasks. This paper presents the Cambridge-driving Labeled Video Database (CamVid) as the first collection of videos with object class semantic labels, complete with metadata. The database provides ground truth labels that associate each pixel with one of 32 semantic classes. The database addresses the need for experimental data to quantitatively evaluate emerging algorithms. While most videos are filmed with fixed-position CCTV-style cameras, our data was captured from the perspective of a driving automobile. The driving scenario increases the number and heterogeneity of the observed object classes. Over 10 min of high quality 30 Hz footage is being provided, with corresponding semantically labeled images at 1 Hz and in part, 15 Hz. The CamVid Database offers four contributions that are relevant to object analysis researchers. First, the per-pixel semantic segmentation of over 700 images was specified manually, and was then inspected and confirmed by a second person for accuracy. Second, the high-quality and large resolution color video images in the database represent valuable extended duration digitized footage to those interested in driving scenarios or ego-motion. Third, we filmed calibration sequences for the camera color response and intrinsics, and computed a 3D camera pose for each frame in the sequences. Finally, in support of expanding this or other databases, we present custom-made labeling software for assisting users who wish to paint precise class-labels for other images and videos. We evaluate the relevance of the database by measuring the performance of an algorithm from each of three distinct domains: multi-class object recognition, pedestrian detection, and label propagation.
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
translated by 谷歌翻译
设计可以成功部署在日常生活环境中的活动检测系统需要构成现实情况典型挑战的数据集。在本文中,我们介绍了一个新的未修剪日常生存数据集,该数据集具有几个现实世界中的挑战:Toyota Smarthome Untrimmed(TSU)。 TSU包含以自发方式进行的各种活动。数据集包含密集的注释,包括基本的,复合活动和涉及与对象相互作用的活动。我们提供了对数据集所需的现实世界挑战的分析,突出了检测算法的开放问题。我们表明,当前的最新方法无法在TSU数据集上实现令人满意的性能。因此,我们提出了一种新的基线方法,以应对数据集提供的新挑战。此方法利用一种模态(即视线流)生成注意力权重,以指导另一种模态(即RGB)以更好地检测活动边界。这对于检测以高时间差异为特征的活动特别有益。我们表明,我们建议在TSU和另一个受欢迎的挑战数据集Charades上优于最先进方法的方法。
translated by 谷歌翻译
肢体语言是一种引人注目的社交信号,其自动分析可以大大提高人工智能系统,以理解和积极参与社交互动。尽管计算机视觉在诸如头部和身体姿势估计之类的低级任务中取得了令人印象深刻的进步,但探索诸如示意,修饰或摸索之类的更微妙行为的发现尚未得到很好的探索。在本文中,我们介绍了BBSI,这是复杂的身体行为的第一组注释,嵌入了小组环境中的连续社交互动中。根据心理学的先前工作,我们在MpiigroupContraction数据集中手动注释了26个小时的自发人类行为,并具有15种不同的肢体语言类别。我们介绍了所得数据集的全面描述性统计数据以及注释质量评估的结果。为了自动检测这些行为,我们适应了金字塔扩张的注意网络(PDAN),这是一种最新的人类动作检测方法。我们使用四个空间特征的四种变体作为PDAN的输入进行实验:两流膨胀的3D CNN,颞段网络,时间移位模块和SWIN变压器。结果是有希望的,这表明了这项艰巨的任务改进的好空间。 BBSI代表了自动理解社会行为的难题中的关键作品,研究界完全可以使用。
translated by 谷歌翻译
简介:手功能是中风后独立性的中心决定因素。在家庭环境中测量手用途是为了评估新干预措施的影响,并需要新颖的可穿戴技术。以自我为中心的视频可以在上下文中捕获手动相互作用,并显示在双边任务(用于稳定或操纵)过程中如何使用受影响的手。需要自动化方法来提取此信息。目的:使用基于人工智能的计算机视觉来对中风后在家中记录的以自我为中心的视频进行手工使用和手工角色进行分类。方法:21个中风幸存者参加了这项研究。使用随机的森林分类器,慢速神经网络和手对象检测器神经网络来识别在家中的手用和手工作用。剩余的受试者 - 划线验证(LOSOCV)用于评估三种模型的性能。根据Mathews相关系数(MCC)计算模型的组间差异。结果:对于手用检测,手对象检测器的性能明显高于其他模型。使用该模型在LOSOCV中使用该模型的宏平均MCC为受影响更大的手的0.50 +-0.23,而受影响较小的手的宏观MCC为0.58 +-0.18。手部角色分类在LOSOCV中的宏平均MCC对于所有模型而言接近零。结论:使用以自我为中心的视频来捕获家里的中风幸存者的手用途。姿势估计以跟踪手指运动可能有益于将来的手部角色分类。
translated by 谷歌翻译
随着深度学习的最新发展应用于计算机视觉,体育视频的理解引起了很多关注,为体育消费者和联赛提供了更丰富的信息。本文介绍了DeepSportradar-V1,这是一套计算机视觉任务,数据集和基准,以自动化运动。该框架的主要目的是缩小学术研究和现实世界环境之间的差距。为此,数据集提供了高分辨率的原始图像,相机参数和高质量注释。 DeepSportradar目前支持与篮球有关的四项具有挑战性的任务:Ball 3D定位,摄像头校准,播放器实例细分和播放器重新识别。对于四个任务中的每一个,都提供了数据集,目标,性能指标和提议的基线方法的详细说明。为了鼓励对运动理解的先进方法的进一步研究,竞争是在ACM Multimedia 2022会议上的MMSPorts研讨会的一部分组织的,参与者必须开发最先进的方法来解决上述任务。公开可用的四个数据集,开发套件和基线。
translated by 谷歌翻译
This paper introduces a video dataset of spatiotemporally localized Atomic Visual Actions (AVA). The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1.58M action labels with multiple labels per person occurring frequently. The key characteristics of our dataset are: (1) the definition of atomic visual actions, rather than composite actions; (2) precise spatio-temporal annotations with possibly multiple annotations for each person; (3) exhaustive annotation of these atomic actions over 15-minute video clips; (4) people temporally linked across consecutive segments; and (5) using movies to gather a varied set of action representations. This departs from existing datasets for spatio-temporal action recognition, which typically provide sparse annotations for composite actions in short video clips.AVA, with its realistic scene and action complexity, exposes the intrinsic difficulty of action recognition. To benchmark this, we present a novel approach for action localization that builds upon the current state-of-the-art methods, and demonstrates better performance on JHMDB and UCF101-24 categories. While setting a new state of the art on existing datasets, the overall results on AVA are low at 15.6% mAP, underscoring the need for developing new approaches for video understanding.
translated by 谷歌翻译
First-person vision is gaining interest as it offers a unique viewpoint on people's interaction with objects, their attention, and even intention. However, progress in this challenging domain has been relatively slow due to the lack of sufficiently large datasets. In this paper, we introduce EPIC-KITCHENS, a large-scale egocentric video benchmark recorded by 32 participants in their native kitchen environments. Our videos depict non-scripted daily activities: we simply asked each participant to start recording every time they entered their kitchen. Recording took place in 4 cities (in North America and Europe) by participants belonging to 10 different nationalities, resulting in highly diverse cooking styles. Our dataset features 55 hours of video consisting of 11.5M frames, which we densely labelled for a total of 39.6K action segments and 454.3K object bounding boxes. Our annotation is unique in that we had the participants narrate their own videos (after recording), thus reflecting true intention, and we crowd-sourced ground-truths based on these. We describe our object, action and anticipation challenges, and evaluate several baselines over two test splits, seen and unseen kitchens.
translated by 谷歌翻译
从社交媒体共享的静止图像中检测战斗是限制暴力场景分布所需的重要任务,以防止它们的负面影响。出于这个原因,在本研究中,我们解决了从网络和社交媒体收集的静止图像的战斗检测问题。我们探索一个人可以从单个静态图像中检测到战斗的程度。我们还提出了一个新的数据集,名为社交媒体的战斗图像(SMFI),包括现实世界的战斗行为图像。拟议数据集的广泛实验结果表明,可以从静止图像中成功识别战斗行动。也就是说,即使在不利用时间信息,也可以通过仅利用外观来检测高精度的斗争。我们还执行跨数据集实验以评估收集数据集的表示能力。这些实验表明,与其他计算机视觉问题一样,存在用于战斗识别问题的数据集偏差。虽然在同一战斗数据集上训练和测试时,该方法可实现接近100%的精度,但是交叉数据集精度显着降低,即,当更多代表性数据集用于培训时,约为70%。 SMFI数据集被发现是使用的五个战斗数据集中的两个最代表性的数据集之一。
translated by 谷歌翻译
In this paper we present a new computer vision task, named video instance segmentation. The goal of this new task is simultaneous detection, segmentation and tracking of instances in videos. In words, it is the first time that the image instance segmentation problem is extended to the video domain. To facilitate research on this new task, we propose a large-scale benchmark called YouTube-VIS, which consists of 2,883 high-resolution YouTube videos, a 40-category label set and 131k high-quality instance masks.In addition, we propose a novel algorithm called Mask-Track R-CNN for this task. Our new method introduces a new tracking branch to Mask R-CNN to jointly perform the detection, segmentation and tracking tasks simultaneously. Finally, we evaluate the proposed method and several strong baselines on our new dataset. Experimental results clearly demonstrate the advantages of the proposed algorithm and reveal insight for future improvement. We believe the video instance segmentation task will motivate the community along the line of research for video understanding.
translated by 谷歌翻译
translated by 谷歌翻译
We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two largescale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.
translated by 谷歌翻译
translated by 谷歌翻译
我们介绍了遮阳板,一个新的像素注释的新数据集和一个基准套件,用于在以自我为中心的视频中分割手和活动对象。遮阳板注释Epic-kitchens的视频,其中带有当前视频分割数据集中未遇到的新挑战。具体而言,我们需要确保像素级注释作为对象经历变革性相互作用的短期和长期一致性,例如洋葱被剥皮,切成丁和煮熟 - 我们旨在获得果皮,洋葱块,斩波板,刀,锅以及表演手的准确像素级注释。遮阳板引入了一条注释管道,以零件为ai驱动,以进行可伸缩性和质量。总共,我们公开发布257个对象类的272K手册语义面具,990万个插值密集口罩,67K手动关系,涵盖36小时的179个未修剪视频。除了注释外,我们还引入了视频对象细分,互动理解和长期推理方面的三个挑战。有关数据,代码和排行榜:http://epic-kitchens.github.io/visor
translated by 谷歌翻译
Pictionary, the popular sketch-based guessing game, provides an opportunity to analyze shared goal cooperative game play in restricted communication settings. However, some players occasionally draw atypical sketch content. While such content is occasionally relevant in the game context, it sometimes represents a rule violation and impairs the game experience. To address such situations in a timely and scalable manner, we introduce DrawMon, a novel distributed framework for automatic detection of atypical sketch content in concurrently occurring Pictionary game sessions. We build specialized online interfaces to collect game session data and annotate atypical sketch content, resulting in AtyPict, the first ever atypical sketch content dataset. We use AtyPict to train CanvasNet, a deep neural atypical content detection network. We utilize CanvasNet as a core component of DrawMon. Our analysis of post deployment game session data indicates DrawMon's effectiveness for scalable monitoring and atypical sketch content detection. Beyond Pictionary, our contributions also serve as a design guide for customized atypical content response systems involving shared and interactive whiteboards. Code and datasets are available at https://drawm0n.github.io.
translated by 谷歌翻译
translated by 谷歌翻译