在时间动作细分中,时间戳监督只需要每个视频序列的少数标记帧。对于未标记的框架,以前的作品依靠分配硬标签,并且在微妙的违反注释假设的情况下,性能迅速崩溃。我们提出了一种基于新型的期望最大化方法(EM)方法,该方法利用了未标记框架的标签不确定性,并且足够强大以适应可能的注释误差。有了准确的时间戳注释,我们提出的方法会产生SOTA结果,甚至超过了几个指标和数据集中完全监督的设置。当应用于缺少动作段的时间戳注释时,我们的方法呈现出稳定的性能。为了进一步测试我们的配方稳健性,我们介绍了Skip-Tag监督的新挑战性注释设置。此设置会放松约束,并需要对视频中任何固定数量的随机帧进行注释,从而使其比时间戳监督更灵活,同时保持竞争力。
translated by 谷歌翻译
我们为时间动作细分任务提供了半监督的学习方法。该任务的目的是在长时间的未修剪程序视频中暂时检测和细分动作,其中只有一小部分视频被密集标记,并且没有标记的大量视频。为此,我们为未标记的数据提出了两个新的损失函数:动作亲和力损失和动作连续性损失。动作亲和力损失通过施加从标记的集合引起的动作先验来指导未标记的样品学习。动作连续性损失强制执行动作的时间连续性,这也提供了框架分类的监督。此外,我们提出了一种自适应边界平滑(ABS)方法,以建立更粗糙的动作边界,以实现更健壮和可靠的学习。在三个基准上评估了拟议的损失函数和ABS。结果表明,它们以较低的标记数据(5%和10%)的数据显着改善了动作细分性能,并获得了与50%标记数据的全面监督相当的结果。此外,当将ABS整合到完全监督的学习中时,ABS成功地提高了性能。
translated by 谷歌翻译
Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence. For the task of temporal action segmentation, we propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs. The C2F-TCN framework is enhanced with a novel model agnostic temporal feature augmentation strategy formed by the computationally inexpensive strategy of the stochastic max-pooling of segments. It produces more accurate and well-calibrated supervised results on three benchmark action segmentation datasets. We show that the architecture is flexible for both supervised and representation learning. In line with this, we present a novel unsupervised way to learn frame-wise representation from C2F-TCN. Our unsupervised learning approach hinges on the clustering capabilities of the input features and the formation of multi-resolution features from the decoder's implicit structure. Further, we provide the first semi-supervised temporal action segmentation results by merging representation learning with conventional supervised learning. Our semi-supervised learning scheme, called ``Iterative-Contrastive-Classify (ICC)'', progressively improves in performance with more labeled data. The ICC semi-supervised learning in C2F-TCN, with 40% labeled videos, performs similar to fully supervised counterparts.
translated by 谷歌翻译
本文在完全和时间戳监督的设置中介绍了通过序列(SEQ2SEQ)翻译序列(SEQ2SEQ)翻译的统一框架。与当前的最新帧级预测方法相反,我们将动作分割视为SEQ2SEQ翻译任务,即将视频帧映射到一系列动作段。我们提出的方法涉及在标准变压器SEQ2SEQ转换模型上进行一系列修改和辅助损失函数,以应对与短输出序列相对的长输入序列,相对较少的视频。我们通过框架损失为编码器合并了一个辅助监督信号,并在隐式持续时间预测中提出了单独的对齐解码器。最后,我们通过提出的约束K-Medoids算法将框架扩展到时间戳监督设置,以生成伪分段。我们提出的框架在完全和时间戳监督的设置上始终如一地表现,在几个数据集上表现优于或竞争的最先进。
translated by 谷歌翻译
时间动作分割对(长)视频序列中的每个帧的动作进行分类。由于框架明智标签的高成本,我们提出了第一种用于时间动作分割的半监督方法。我们对无监督的代表学习铰接,对于时间动作分割,造成独特的挑战。未经目针视频中的操作长度变化,并且具有未知的标签和开始/结束时间。跨视频的行动订购也可能有所不同。我们提出了一种新颖的方式,通过聚类输入特征来学习来自时间卷积网络(TCN)的帧智表示,其中包含增加的时间接近条件和多分辨率相似性。通过与传统的监督学习合并表示学习,我们开发了一个“迭代 - 对比 - 分类(ICC)”半监督学习计划。通过更多标记的数据,ICC逐步提高性能; ICC半监督学习,具有40%标记的视频,执行类似于完全监督的对应物。我们的ICC分别通过{+1.8,+ 5.6,+2.5}%的{+1.8,+ 5.6,+2.5}%分别改善了100%标记的视频。
translated by 谷歌翻译
Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models with timestamp annotations, where the surgeons are asked to identify only a single timestamp within the temporal boundary of a phase. This annotation can significantly reduce the manual annotation cost compared to the full annotations. To make full use of such timestamp supervisions, we propose a novel method called uncertainty-aware temporal diffusion (UATD) to generate trustworthy pseudo labels for training. Our proposed UATD is motivated by the property of surgical videos, i.e., the phases are long events consisting of consecutive frames. To be specific, UATD diffuses the single labelled timestamp to its corresponding high confident ( i.e., low uncertainty) neighbour frames in an iterative way. Our study uncovers unique insights of surgical phase recognition with timestamp supervisions: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation cost; 3) less is more in surgical phase recognition, i.e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug and play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods.
translated by 谷歌翻译
Video action segmentation aims to slice the video into several action segments. Recently, timestamp supervision has received much attention due to lower annotation costs. We find the frames near the boundaries of action segments are in the transition region between two consecutive actions and have unclear semantics, which we call ambiguous intervals. Most existing methods iteratively generate pseudo-labels for all frames in each video to train the segmentation model. However, ambiguous intervals are more likely to be assigned with noisy and incorrect pseudo-labels, which leads to performance degradation. We propose a novel framework to train the model under timestamp supervision including the following two parts. First, pseudo-label ensembling generates pseudo-label sequences with ambiguous intervals, where the frames have no pseudo-labels. Second, iterative clustering iteratively propagates the pseudo-labels to the ambiguous intervals by clustering, and thus updates the pseudo-label sequences to train the model. We further introduce a clustering loss, which encourages the features of frames within the same action segment more compact. Extensive experiments show the effectiveness of our method.
translated by 谷歌翻译
Recent temporal action segmentation approaches need frame annotations during training to be effective. These annotations are very expensive and time-consuming to obtain. This limits their performances when only limited annotated data is available. In contrast, we can easily collect a large corpus of in-domain unannotated videos by scavenging through the internet. Thus, this paper proposes an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences. Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions. Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos. In the end, our evaluation of the proposed approach on two different datasets demonstrates its capability to achieve comparable performance to the full supervision despite limited annotation.
translated by 谷歌翻译
密集的预期旨在预测未来的行为及其持续的持续时间。现有方法依赖于完全标记的数据,即标有所有未来行动及其持续时间的序列。我们仅使用少量全标记的序列呈现(半)弱监督方法,主要是序列,其中仅标记即将到来的动作。为此,我们提出了一个框架,为未来的行动及其持续时间产生伪标签,并通过细化模块自适应地改进它们。仅考虑到即将到来的动作标签作为输入,这些伪标签指南对未来的动作/持续时间预测。我们进一步设计了注意力机制,以预测背景感知的持续时间。早餐和50salads基准测试的实验验证了我们的方法的效率;与完全监督最先进的模型相比,我们竞争甚至。我们将在:https://github.com/zhanghaotong1/wslvideodenseantication提供我们的代码。
translated by 谷歌翻译
视频中的时间动作细分最近引起了很多关注。时间戳监督是完成此任务的一种经济高效的方式。为了获得更多信息以优化模型,现有方法生成的伪框架根据分割模型和时间戳注释的输出进行了迭代标签。但是,这种做法可能在训练过程中引入噪声和振荡,并导致性能变性。为了解决这个问题,我们通过引入与分割模型平行的教师模型来帮助稳定模型优化的过程,为时间戳监督的暂时行动细分提出了一个新的框架。教师模型可以看作是分割模型的合奏,有助于抑制噪声并提高伪标签的稳定性。我们进一步引入了一个分段平滑的损失,该损失更加集中和凝聚力,以实现动作实例中预测概率的平稳过渡。三个数据集的实验表明,我们的方法的表现优于最新方法,并且以较低的注释成本与完全监督的方法相当地执行。
translated by 谷歌翻译
弱监督的时间行动本地化旨在从视频级标签学习实例级别动作模式,其中重大挑战是动作情境混淆。为了克服这一挑战,最近的一个工作建立了一个动作单击监督框。它需要类似的注释成本,但与传统的弱势监督方法相比,可以稳步提高本地化性能。在本文中,通过揭示现有方法的性能瓶颈主要来自后台错误,我们发现更强大的动作定位器可以在背景视频帧上的标签上培训,而不是动作帧上的标签。为此,我们将动作单击监控转换为背景单击监控,并开发一种名为Backtal的新方法。具体地,背塔在背景视频帧上实现两倍建模,即位置建模和特征建模。在适当的建模中,我们不仅在带注释的视频帧上进行监督学习,而且还设计得分分离模块,以扩大潜在的动作帧和背景之间的分数差异。在特征建模中,我们提出了一个亲和力模块,以在计算时间卷积时测量相邻帧之间的特定于帧特定的相似性,并在计算时间卷积时动态地参加信息邻居。进行了三个基准测试的广泛实验,展示了建立的背部的高性能和所提出的背景下单击监督的合理性。代码可用于https://github.com/vididle/backtal。
translated by 谷歌翻译
To balance the annotation labor and the granularity of supervision, single-frame annotation has been introduced in temporal action localization. It provides a rough temporal location for an action but implicitly overstates the supervision from the annotated-frame during training, leading to the confusion between actions and backgrounds, i.e., action incompleteness and background false positives. To tackle the two challenges, in this work, we present the Snippet Classification model and the Dilation-Erosion module. In the Dilation-Erosion module, we expand the potential action segments with a loose criterion to alleviate the problem of action incompleteness and then remove the background from the potential action segments to alleviate the problem of action incompleteness. Relying on the single-frame annotation and the output of the snippet classification, the Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds, which in turn further trains the Snippet Classification model. It forms a cyclic dependency. Furthermore, we propose a new embedding loss to aggregate the features of action instances with the same label and separate the features of actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method. Code has been made publicly available (https://github.com/LingJun123/single-frame-TAL).
translated by 谷歌翻译
我们介绍了一种新颖的方法,用于使用时间戳监督进行时间戳分割。我们的主要贡献是图形卷积网络,该网络以端到端方式学习,以利用相邻帧之间的帧功能和连接,以从稀疏的时间戳标签中生成密集的框架标签。然后可以使用生成的密集框架标签来训练分割模型。此外,我们为分割模型和图形卷积模型进行交替学习的框架,该模型首先初始化,然后迭代地完善学习模型。在四个公共数据集上进行了详细的实验,包括50种沙拉,GTEA,早餐和桌面组件,表明我们的方法优于多层感知器基线,同时在时间活动中表现出色或更好地表现出色或更好在时间戳监督下。
translated by 谷歌翻译
时间动作细分任务段视频暂时,并预测所有帧的动作标签。充分监督这种细分模型需要密集的框架动作注释,这些注释既昂贵又乏味。这项工作是第一个提出一个组成动作发现(CAD)框架的工作,该框架仅需要视频高级复杂活动标签作为时间动作分割的监督。提出的方法会自动使用活动分类任务发现组成视频动作。具体而言,我们定义了有限数量的潜在作用原型来构建视频级别的双重表示,通过活动分类培训共同学习了这些原型。这种设置赋予我们的方法,可以在多个复杂活动中发现潜在的共享动作。由于缺乏行动水平的监督,我们采用匈牙利匹配算法将潜在的动作原型与地面真理语义类别进行评估联系起来。我们表明,通过高级监督,匈牙利的匹配可以从现有的视频和活动级别扩展到全球水平。全球级别的匹配允许跨活动进行行动共享,这在文献中从未考虑过。广泛的实验表明,我们发现的动作可以帮助执行时间动作细分和活动识别任务。
translated by 谷歌翻译
有许多方法可以使用弱监管来培训网络到分段2D图像。相比之下,现有的3D方法依赖于3D图像卷的2D片的子集的全监督。在本文中,我们提出了一种真正无弱监督的方法,即我们只需要在目标对象的表面上提供一组稀疏的3D点,这是一项可以快速完成的便捷任务。我们使用3D点以使3D模板变形,使其大致与目标对象轮廓匹配,并且我们介绍了利用粗略模板提供的监控以培训网络以找到准确边界的体系结构。我们评估我们在计算机断层扫描(CT),磁共振图像(MRI)和电子显微镜(EM)图像数据集中的方法的性能。我们将表明,在减少监督成本下,它始终以3D弱监管方式表现出更传统的方法。
translated by 谷歌翻译
我们提出了一种在数据样本集合中共同推断标签的方法,其中每个样本都包含一个观察和对标签的先验信念。通过隐式假设存在一种生成模型,可区分预测因子是后部,我们得出了一个训练目标,该目标允许在弱信念下学习。该配方统一了各种机器学习设置;弱信念可以以嘈杂或不完整的标签形式出现,由辅助输入的不同预测机制给出的可能性,或反映出有关手头问题结构的知识的常识性先验。我们证明了有关各种问题的建议算法:通过负面培训示例进行分类,从排名中学习,弱和自我监督的空中成像细分,视频框架的共段以及粗糙的监督文本分类。
translated by 谷歌翻译
The potential for agents, whether embodied or software, to learn by observing other agents performing procedures involving objects and actions is rich. Current research on automatic procedure learning heavily relies on action labels or video subtitles, even during the evaluation phase, which makes them infeasible in real-world scenarios. This leads to our question: can the human-consensus structure of a procedure be learned from a large set of long, unconstrained videos (e.g., instructional videos from YouTube) with only visual evidence? To answer this question, we introduce the problem of procedure segmentation-to segment a video procedure into category-independent procedure segments. Given that no large-scale dataset is available for this problem, we collect a large-scale procedure segmentation dataset with procedure segments temporally localized and described; we use cooking videos and name the dataset YouCook2. We propose a segment-level recurrent network for generating procedure segments by modeling the dependencies across segments. The generated segments can be used as pre-processing for other tasks, such as dense video captioning and event parsing. We show in our experiments that the proposed model outperforms competitive baselines in procedure segmentation.
translated by 谷歌翻译
我们为无监督活动分割提出了一种新方法,它使用视频帧聚类作为借口任务,并同时执行表示学习和在线群集。这与先前作品相反,其中通常顺序地执行表示学习和聚类。我们通过采用时间最优运输来利用视频中的时间信息。特别是,我们纳入了一个时间正则化术语,其将活动的时间顺序保留到用于计算伪标签群集分配的标准最佳传输模块中。时间最优传输模块使我们的方法能够学习无监督活动细分的有效陈述。此外,先前的方法需要在以离线方式培养它们之前对整个数据集的学习功能存储在整个数据集中,而我们的方法在在线方式一次处理一个迷你批次。在三个公共数据集,即50沙拉,YouTube说明和早餐以及我们的数据集,即桌面装配的广泛评估表明,我们的方法在PAR或更优于以前的无监督活动分割方法,尽管内存限制显着较低。
translated by 谷歌翻译
本文介绍了一种基于CNN的完全无监督的方法,用于光流量的运动分段。我们假设输入光流可以表示为分段参数运动模型集,通常,仿射或二次运动模型。这项工作的核心思想是利用期望 - 最大化(EM)框架。它使我们能够以良好的方式设计丢失功能和我们运动分割神经网络的培训程序。然而,与经典迭代的EM相比,一旦培训网络,我们就可以为单个推理步骤中的任何看不见的光学流场提供分割,没有对运动模型参数的初始化,因为它们没有估计推断阶段。已经调查了不同的损失功能,包括强大的功能。我们还提出了一种关于光学流场的新型数据增强技术,对性能显着影响。我们在Davis2016数据集上测试了我们的运动分段网络。我们的方法优于相当的无监督方法,非常有效。实际上,它可以在125fps运行,使其可用于实时应用程序。
translated by 谷歌翻译
The task of topical segmentation is well studied, but previous work has mostly addressed it in the context of structured, well-defined segments, such as segmentation into paragraphs, chapters, or segmenting text that originated from multiple sources. We tackle the task of segmenting running (spoken) narratives, which poses hitherto unaddressed challenges. As a test case, we address Holocaust survivor testimonies, given in English. Other than the importance of studying these testimonies for Holocaust research, we argue that they provide an interesting test case for topical segmentation, due to their unstructured surface level, relative abundance (tens of thousands of such testimonies were collected), and the relatively confined domain that they cover. We hypothesize that boundary points between segments correspond to low mutual information between the sentences proceeding and following the boundary. Based on this hypothesis, we explore a range of algorithmic approaches to the task, building on previous work on segmentation that uses generative Bayesian modeling and state-of-the-art neural machinery. Compared to manually annotated references, we find that the developed approaches show considerable improvements over previous work.
translated by 谷歌翻译