视频中的时间动作细分最近引起了很多关注。时间戳监督是完成此任务的一种经济高效的方式。为了获得更多信息以优化模型,现有方法生成的伪框架根据分割模型和时间戳注释的输出进行了迭代标签。但是,这种做法可能在训练过程中引入噪声和振荡,并导致性能变性。为了解决这个问题,我们通过引入与分割模型平行的教师模型来帮助稳定模型优化的过程,为时间戳监督的暂时行动细分提出了一个新的框架。教师模型可以看作是分割模型的合奏,有助于抑制噪声并提高伪标签的稳定性。我们进一步引入了一个分段平滑的损失,该损失更加集中和凝聚力,以实现动作实例中预测概率的平稳过渡。三个数据集的实验表明,我们的方法的表现优于最新方法,并且以较低的注释成本与完全监督的方法相当地执行。
translated by 谷歌翻译
Video action segmentation aims to slice the video into several action segments. Recently, timestamp supervision has received much attention due to lower annotation costs. We find the frames near the boundaries of action segments are in the transition region between two consecutive actions and have unclear semantics, which we call ambiguous intervals. Most existing methods iteratively generate pseudo-labels for all frames in each video to train the segmentation model. However, ambiguous intervals are more likely to be assigned with noisy and incorrect pseudo-labels, which leads to performance degradation. We propose a novel framework to train the model under timestamp supervision including the following two parts. First, pseudo-label ensembling generates pseudo-label sequences with ambiguous intervals, where the frames have no pseudo-labels. Second, iterative clustering iteratively propagates the pseudo-labels to the ambiguous intervals by clustering, and thus updates the pseudo-label sequences to train the model. We further introduce a clustering loss, which encourages the features of frames within the same action segment more compact. Extensive experiments show the effectiveness of our method.
translated by 谷歌翻译
To balance the annotation labor and the granularity of supervision, single-frame annotation has been introduced in temporal action localization. It provides a rough temporal location for an action but implicitly overstates the supervision from the annotated-frame during training, leading to the confusion between actions and backgrounds, i.e., action incompleteness and background false positives. To tackle the two challenges, in this work, we present the Snippet Classification model and the Dilation-Erosion module. In the Dilation-Erosion module, we expand the potential action segments with a loose criterion to alleviate the problem of action incompleteness and then remove the background from the potential action segments to alleviate the problem of action incompleteness. Relying on the single-frame annotation and the output of the snippet classification, the Dilation-Erosion module mines pseudo snippet-level ground-truth, hard backgrounds and evident backgrounds, which in turn further trains the Snippet Classification model. It forms a cyclic dependency. Furthermore, we propose a new embedding loss to aggregate the features of action instances with the same label and separate the features of actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate the effectiveness of the proposed method. Code has been made publicly available (https://github.com/LingJun123/single-frame-TAL).
translated by 谷歌翻译
Recent temporal action segmentation approaches need frame annotations during training to be effective. These annotations are very expensive and time-consuming to obtain. This limits their performances when only limited annotated data is available. In contrast, we can easily collect a large corpus of in-domain unannotated videos by scavenging through the internet. Thus, this paper proposes an approach for the temporal action segmentation task that can simultaneously leverage knowledge from annotated and unannotated video sequences. Our approach uses multi-stream distillation that repeatedly refines and finally combines their frame predictions. Our model also predicts the action order, which is later used as a temporal constraint while estimating frames labels to counter the lack of supervision for unannotated videos. In the end, our evaluation of the proposed approach on two different datasets demonstrates its capability to achieve comparable performance to the full supervision despite limited annotation.
translated by 谷歌翻译
我们为时间动作细分任务提供了半监督的学习方法。该任务的目的是在长时间的未修剪程序视频中暂时检测和细分动作,其中只有一小部分视频被密集标记,并且没有标记的大量视频。为此,我们为未标记的数据提出了两个新的损失函数:动作亲和力损失和动作连续性损失。动作亲和力损失通过施加从标记的集合引起的动作先验来指导未标记的样品学习。动作连续性损失强制执行动作的时间连续性,这也提供了框架分类的监督。此外,我们提出了一种自适应边界平滑(ABS)方法,以建立更粗糙的动作边界,以实现更健壮和可靠的学习。在三个基准上评估了拟议的损失函数和ABS。结果表明,它们以较低的标记数据(5%和10%)的数据显着改善了动作细分性能,并获得了与50%标记数据的全面监督相当的结果。此外,当将ABS整合到完全监督的学习中时,ABS成功地提高了性能。
translated by 谷歌翻译
本文在完全和时间戳监督的设置中介绍了通过序列(SEQ2SEQ)翻译序列(SEQ2SEQ)翻译的统一框架。与当前的最新帧级预测方法相反,我们将动作分割视为SEQ2SEQ翻译任务,即将视频帧映射到一系列动作段。我们提出的方法涉及在标准变压器SEQ2SEQ转换模型上进行一系列修改和辅助损失函数,以应对与短输出序列相对的长输入序列,相对较少的视频。我们通过框架损失为编码器合并了一个辅助监督信号,并在隐式持续时间预测中提出了单独的对齐解码器。最后,我们通过提出的约束K-Medoids算法将框架扩展到时间戳监督设置,以生成伪分段。我们提出的框架在完全和时间戳监督的设置上始终如一地表现,在几个数据集上表现优于或竞争的最先进。
translated by 谷歌翻译
Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics. While traditional approaches follow a two-step pipeline, by generating framewise probabilities and then feeding them to high-level temporal models, recent approaches use temporal convolutions to directly classify the video frames. In this paper, we introduce a multi-stage architecture for the temporal action segmentation task. Each stage features a set of dilated temporal convolutions to generate an initial prediction that is refined by the next one. This architecture is trained using a combination of a classification loss and a proposed smoothing loss that penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our model achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.
translated by 谷歌翻译
我们介绍了一种新颖的方法,用于使用时间戳监督进行时间戳分割。我们的主要贡献是图形卷积网络,该网络以端到端方式学习,以利用相邻帧之间的帧功能和连接,以从稀疏的时间戳标签中生成密集的框架标签。然后可以使用生成的密集框架标签来训练分割模型。此外,我们为分割模型和图形卷积模型进行交替学习的框架,该模型首先初始化,然后迭代地完善学习模型。在四个公共数据集上进行了详细的实验,包括50种沙拉,GTEA,早餐和桌面组件,表明我们的方法优于多层感知器基线,同时在时间活动中表现出色或更好地表现出色或更好在时间戳监督下。
translated by 谷歌翻译
Surgical phase recognition is a fundamental task in computer-assisted surgery systems. Most existing works are under the supervision of expensive and time-consuming full annotations, which require the surgeons to repeat watching videos to find the precise start and end time for a surgical phase. In this paper, we introduce timestamp supervision for surgical phase recognition to train the models with timestamp annotations, where the surgeons are asked to identify only a single timestamp within the temporal boundary of a phase. This annotation can significantly reduce the manual annotation cost compared to the full annotations. To make full use of such timestamp supervisions, we propose a novel method called uncertainty-aware temporal diffusion (UATD) to generate trustworthy pseudo labels for training. Our proposed UATD is motivated by the property of surgical videos, i.e., the phases are long events consisting of consecutive frames. To be specific, UATD diffuses the single labelled timestamp to its corresponding high confident ( i.e., low uncertainty) neighbour frames in an iterative way. Our study uncovers unique insights of surgical phase recognition with timestamp supervisions: 1) timestamp annotation can reduce 74% annotation time compared with the full annotation, and surgeons tend to annotate those timestamps near the middle of phases; 2) extensive experiments demonstrate that our method can achieve competitive results compared with full supervision methods, while reducing manual annotation cost; 3) less is more in surgical phase recognition, i.e., less but discriminative pseudo labels outperform full but containing ambiguous frames; 4) the proposed UATD can be used as a plug and play method to clean ambiguous labels near boundaries between phases, and improve the performance of the current surgical phase recognition methods.
translated by 谷歌翻译
时间动作细分任务段视频暂时,并预测所有帧的动作标签。充分监督这种细分模型需要密集的框架动作注释,这些注释既昂贵又乏味。这项工作是第一个提出一个组成动作发现(CAD)框架的工作,该框架仅需要视频高级复杂活动标签作为时间动作分割的监督。提出的方法会自动使用活动分类任务发现组成视频动作。具体而言,我们定义了有限数量的潜在作用原型来构建视频级别的双重表示,通过活动分类培训共同学习了这些原型。这种设置赋予我们的方法,可以在多个复杂活动中发现潜在的共享动作。由于缺乏行动水平的监督,我们采用匈牙利匹配算法将潜在的动作原型与地面真理语义类别进行评估联系起来。我们表明,通过高级监督,匈牙利的匹配可以从现有的视频和活动级别扩展到全球水平。全球级别的匹配允许跨活动进行行动共享,这在文献中从未考虑过。广泛的实验表明,我们发现的动作可以帮助执行时间动作细分和活动识别任务。
translated by 谷歌翻译
微创手术中的手术工具检测是计算机辅助干预措施的重要组成部分。当前的方法主要是基于有监督的方法,这些方法需要大量的完全标记的数据来培训监督模型,并且由于阶级不平衡问题而患有伪标签偏见。但是,带有边界框注释的大图像数据集通常几乎无法使用。半监督学习(SSL)最近出现了仅使用适度的注释数据训练大型模型的一种手段。除了降低注释成本。 SSL还显示出希望产生更强大和可推广的模型。因此,在本文中,我们在手术工具检测范式中介绍了半监督学习(SSL)框架,该框架旨在通过知识蒸馏方法来减轻培训数据的稀缺和数据失衡。在拟议的工作中,我们培训了一个标有数据的模型,该模型启动了教师学生的联合学习,在该学习中,学生接受了来自未标记数据的教师生成的伪标签的培训。我们提出了一个多级距离,在检测器的利益区域头部具有基于保证金的分类损失函数,以有效地将前景类别与背景区域隔离。我们在M2CAI16-Tool-locations数据集上的结果表明,我们的方法在不同的监督数据设置(1%,2%,5%,注释数据的10%)上的优越性,其中我们的模型可实现8%,12%和27的总体改善在最先进的SSL方法和完全监督的基线上,MAP中的%(在1%标记的数据上)。该代码可在https://github.com/mansoor-at/semi-supervise-surgical-tool-det上获得
translated by 谷歌翻译
Temporal action segmentation tags action labels for every frame in an input untrimmed video containing multiple actions in a sequence. For the task of temporal action segmentation, we propose an encoder-decoder-style architecture named C2F-TCN featuring a "coarse-to-fine" ensemble of decoder outputs. The C2F-TCN framework is enhanced with a novel model agnostic temporal feature augmentation strategy formed by the computationally inexpensive strategy of the stochastic max-pooling of segments. It produces more accurate and well-calibrated supervised results on three benchmark action segmentation datasets. We show that the architecture is flexible for both supervised and representation learning. In line with this, we present a novel unsupervised way to learn frame-wise representation from C2F-TCN. Our unsupervised learning approach hinges on the clustering capabilities of the input features and the formation of multi-resolution features from the decoder's implicit structure. Further, we provide the first semi-supervised temporal action segmentation results by merging representation learning with conventional supervised learning. Our semi-supervised learning scheme, called ``Iterative-Contrastive-Classify (ICC)'', progressively improves in performance with more labeled data. The ICC semi-supervised learning in C2F-TCN, with 40% labeled videos, performs similar to fully supervised counterparts.
translated by 谷歌翻译
在时间动作细分中,时间戳监督只需要每个视频序列的少数标记帧。对于未标记的框架,以前的作品依靠分配硬标签,并且在微妙的违反注释假设的情况下,性能迅速崩溃。我们提出了一种基于新型的期望最大化方法(EM)方法,该方法利用了未标记框架的标签不确定性,并且足够强大以适应可能的注释误差。有了准确的时间戳注释,我们提出的方法会产生SOTA结果,甚至超过了几个指标和数据集中完全监督的设置。当应用于缺少动作段的时间戳注释时,我们的方法呈现出稳定的性能。为了进一步测试我们的配方稳健性,我们介绍了Skip-Tag监督的新挑战性注释设置。此设置会放松约束,并需要对视频中任何固定数量的随机帧进行注释,从而使其比时间戳监督更灵活,同时保持竞争力。
translated by 谷歌翻译
迄今为止,最强大的半监督对象检测器(SS-OD)基于伪盒,该盒子需要一系列带有微调超参数的后处理。在这项工作中,我们建议用稀疏的伪盒子以伪造的伪标签形式取代稀疏的伪盒。与伪盒相比,我们的密集伪标签(DPL)不涉及任何后处理方法,因此保留了更丰富的信息。我们还引入了一种区域选择技术,以突出关键信息,同时抑制密集标签所携带的噪声。我们将利用DPL作为密集老师的拟议的SS-OD算法命名。在可可和VOC上,密集的老师在各种环境下与基于伪盒的方法相比表现出卓越的表现。
translated by 谷歌翻译
密集的预期旨在预测未来的行为及其持续的持续时间。现有方法依赖于完全标记的数据,即标有所有未来行动及其持续时间的序列。我们仅使用少量全标记的序列呈现(半)弱监督方法,主要是序列,其中仅标记即将到来的动作。为此,我们提出了一个框架,为未来的行动及其持续时间产生伪标签,并通过细化模块自适应地改进它们。仅考虑到即将到来的动作标签作为输入,这些伪标签指南对未来的动作/持续时间预测。我们进一步设计了注意力机制,以预测背景感知的持续时间。早餐和50salads基准测试的实验验证了我们的方法的效率;与完全监督最先进的模型相比,我们竞争甚至。我们将在:https://github.com/zhanghaotong1/wslvideodenseantication提供我们的代码。
translated by 谷歌翻译
时空动作检测是视频理解的重要组成部分。当前的时空动作检测方法将首先使用对象检测器获得人候选建议。然后,该模型将将候选人分为不同的行动类别。所谓的两阶段方法很重,很难在现实世界应用中应用。一些现有的方法使用统一的模型结构,但它们使用香草模型的性能不佳,并且通常需要额外的模块来提高性能。在本文中,我们探讨了建立端到端时空动作探测器的策略,其修改最少。为此,我们提出了一种名为ME-STAD的新方法,该方法以端到端的方式解决了空间 - 周期性动作检测问题。除模型设计外,我们还提出了一种新颖的标签策略,以处理空间数据集中的稀疏注释。提出的ME-STAD比原始的两阶段探测器和减少80%的FLOPS取得更好的结果(2.2%的MAP增强)。此外,我们提出的我的stad仅具有先前方法的最小修改,并且不需要额外的组件。我们的代码将公开。
translated by 谷歌翻译
Left-ventricular ejection fraction (LVEF) is an important indicator of heart failure. Existing methods for LVEF estimation from video require large amounts of annotated data to achieve high performance, e.g. using 10,030 labeled echocardiogram videos to achieve mean absolute error (MAE) of 4.10. Labeling these videos is time-consuming however and limits potential downstream applications to other heart diseases. This paper presents the first semi-supervised approach for LVEF prediction. Unlike general video prediction tasks, LVEF prediction is specifically related to changes in the left ventricle (LV) in echocardiogram videos. By incorporating knowledge learned from predicting LV segmentations into LVEF regression, we can provide additional context to the model for better predictions. To this end, we propose a novel Cyclical Self-Supervision (CSS) method for learning video-based LV segmentation, which is motivated by the observation that the heartbeat is a cyclical process with temporal repetition. Prediction masks from our segmentation model can then be used as additional input for LVEF regression to provide spatial context for the LV region. We also introduce teacher-student distillation to distill the information from LV segmentation masks into an end-to-end LVEF regression model that only requires video inputs. Results show our method outperforms alternative semi-supervised methods and can achieve MAE of 4.17, which is competitive with state-of-the-art supervised performance, using half the number of labels. Validation on an external dataset also shows improved generalization ability from using our method. Our code is available at https://github.com/xmed-lab/CSS-SemiVideo.
translated by 谷歌翻译
弱监督的时间行动本地化旨在从视频级标签学习实例级别动作模式,其中重大挑战是动作情境混淆。为了克服这一挑战,最近的一个工作建立了一个动作单击监督框。它需要类似的注释成本,但与传统的弱势监督方法相比,可以稳步提高本地化性能。在本文中,通过揭示现有方法的性能瓶颈主要来自后台错误,我们发现更强大的动作定位器可以在背景视频帧上的标签上培训,而不是动作帧上的标签。为此,我们将动作单击监控转换为背景单击监控,并开发一种名为Backtal的新方法。具体地,背塔在背景视频帧上实现两倍建模,即位置建模和特征建模。在适当的建模中,我们不仅在带注释的视频帧上进行监督学习,而且还设计得分分离模块,以扩大潜在的动作帧和背景之间的分数差异。在特征建模中,我们提出了一个亲和力模块,以在计算时间卷积时测量相邻帧之间的特定于帧特定的相似性,并在计算时间卷积时动态地参加信息邻居。进行了三个基准测试的广泛实验,展示了建立的背部的高性能和所提出的背景下单击监督的合理性。代码可用于https://github.com/vididle/backtal。
translated by 谷歌翻译
在这项工作中,我们专注于半监督学习的视频动作检测,该学习既利用标签和未标记的数据。我们提出了一种简单的基于端到端一致性的方法,该方法有效地利用了未标记的数据。视频动作检测需要,行动类预测以及动作的时空定位。因此,我们研究了两种类型的约束,分类一致性和时空的一致性。视频中主要背景和静态区域的存在使得利用时空的一致性进行动作检测使其具有挑战性。为了解决这个问题,我们提出了两个新颖的正规化约束,以实现时空的一致性。 1)时间相干性和2)梯度平滑度。这两个方面都利用视频中的动作的时间连续性,并且被发现有效利用未标记的视频进行动作检测。我们证明了所提出的方法对两个不同的动作检测基准数据集的有效性,即UCF101-24和JHMDB-21。此外,我们还展示了YouTube-VOS上提出的视频对象分割方法的有效性,该方法证明了其概括能力,与最近完全监督的方法相比,提出的方法仅在UCF101-24上仅使用20%的注释来实现竞争性能。在UCF101-24上,与监督方法相比,它分别在0.5 F-MAP和V-MAP时分别提高了 +8.9%和 +11%。
translated by 谷歌翻译
在本文中,我们在半监督对象检测(SSOD)中深入研究了两种关键技术,即伪标记和一致性训练。我们观察到,目前,这两种技术忽略了对象检测的一些重要特性,从而阻碍了对未标记数据的有效学习。具体而言,对于伪标记,现有作品仅关注分类得分,但不能保证伪框的本地化精度;为了保持一致性训练,广泛采用的随机训练只考虑了标签级的一致性,但错过了功能级别的训练,这在确保尺度不变性方面也起着重要作用。为了解决嘈杂的伪箱所产生的问题,我们设计了包括预测引导的标签分配(PLA)和正面验证一致性投票(PCV)的嘈杂伪盒学习(NPL)。 PLA依赖于模型预测来分配标签,并使甚至粗糙的伪框都具有鲁棒性。 PCV利用积极建议的回归一致性来反映伪盒的本地化质量。此外,在一致性训练中,我们提出了包括标签和特征水平一致性的机制的多视图尺度不变学习(MSL),其中通过将两个图像之间的移动特征金字塔对准具有相同内容但变化量表的变化来实现特征一致性。在可可基准测试上,我们的方法称为伪标签和一致性训练(PSECO),分别以2.0、1.8、2.0分的1%,5%和10%的标签比优于SOTA(软教师)。它还显着提高了SSOD的学习效率,例如,PSECO将SOTA方法的训练时间减半,但实现了更好的性能。代码可从https://github.com/ligang-cs/pseco获得。
translated by 谷歌翻译