视频对象分割的基本挑战之一是找到目标和背景外观的有效表示。为实现这一目的,最佳执行方法需要对卷积神经网络进行广泛的微调。除了非常昂贵之外,由于在线微调程序没有集成到网络的离线培训中,因此这种策略不能真正地端到端地进行训练。为了解决这些问题,我们提出了一种网络架构,可以在单向传递中学习目标和背景外观的强大表示。引入的外观模块学习目标和背景特征分布的概率生成模型。给定一个新图像,它预测后验类概率,提供高度严格的提示,在后来的网络模块中处理。我们的外观模块的学习和预测阶段都是完全不同的,可以对整个分支管道进行真正的端到端培训。综合实验证明了所提出的方法对三个视频对象分割基准的有效性。我们接近基于DAVIS17在线微调的方法,同时在单个GPU上以15FPS运行。此外,我们的方法优于大型YouTube-VOS数据集上的所有已发布的方法。
translated by 谷歌翻译
目前针对端到端可训练计算机视觉系统的努力对视觉跟踪任务提出了重大挑战。与其他视觉问题相比,跟踪需要在推理阶段在线学习强大的特定于目标的外观模型。为了从头到尾训练,因此需要在跟踪架构本身中嵌入目标模型的在线学习。由于这些困难,流行的暹罗范例只是预测目标特征模板。然而,由于无法整合背景信息,这种模型具有有限的辨别力。我们开发了一种端到端的跟踪架构,能够充分利用目标和背景外观信息进行目标模型预测。我们的架构源于一种歧视性的学习损失,通过设计一个能够预测一个强大的模型的专用优化过程。迭代。此外,我们的方法能够学习判别性损失本身的关键方面。建议的跟踪器在6个跟踪基准测试中设置了最新技术,在VOT2018上获得了0.440的EAO分数,同时运行速度超过40 FPS。
translated by 谷歌翻译
我们提出了一种用于半监督视频对象分割的新颖解决方案。根据问题的性质,可用的提示(例如,具有对象掩模的视频帧)随着中间预测而变得更加丰富。但是,现有的方法无法充分利用这种丰富的信息来源。我们通过利用内存网络解决问题,并学习从所有可用来源中读取相关信息。在我们的框架中,具有对象掩码的过去帧形成外部存储器,并且使用存储器中的掩码信息对作为查询的当前帧进行分段。具体地,查询和存储器在特征空间中是完全匹配的,以前馈方式覆盖所有空时像素定位。与之前的方法相比,指导信息的大量使用使我们能够更好地处理诸如外观变化和遮挡之类的挑战。我们在最新的基准测试集中验证了我们的方法并获得了最先进的性能(Youtube-VOS val集的总分为79.4,分别为DAV.72016 / 2017 val设置的J为88.7和79.2),同时具有快速运行时间( 0.16秒/帧在DAVIS 2016 val set)。
translated by 谷歌翻译
虽然近年来视觉跟踪的稳健性有了惊人的改善,但跟踪精度的提高受到严重限制。由于重点是强大的分类器的开发,准确的目标状态估计问题在很大程度上被忽视了。相反,大多数方法采用简单的多尺度搜索来估计目标边界框。我们认为这种方法基本上是有限的,因为目标估计是一项复杂的任务,需要有关该对象的高级知识。因此,我们解决了跟踪中目标状态估计的问题。我们提出了一种新颖的跟踪架构,包括专用的目标估计和分类组件。由于目标估计的复杂性,我们提出了一个可以在大规模数据集上完全离线训练的组件。训练我们的目标估计组件以预测目标对象与估计的边界框之间的重叠。通过在预测中仔细整合特定于目标的信息,我们的方法可以实现以前看不见的边界框精度。此外,我们整合了在线培训的分类组件,以保证在干扰者存在的情况下具有高度的歧视能力。我们的最终跟踪框架由统一的多任务架构组成,在四个具有挑战性的基准测试中设置了最新的技术。在大型TrackingNetdataset上,我们的跟踪器ATOM实现了15%的相对增益,同时运行速度超过30 FPS。
translated by 谷歌翻译
在本文中,我们将介绍如何使用一个简单的方法实时执行视觉对象跟踪和半监督视频对象分割。我们的方法,称为SiamMask,改进了流行的完全卷积暹罗方法的对象跟踪的离线训练过程,通过二进制分割任务来确定它们的损失。一旦经过训练,SiamMasksolely依赖于单个边界框初始化并在线操作,产生类别不可知的对象分割掩模和每秒35帧的旋转边界框。尽管它的简单性,多功能性和快速性,我们的策略使我们能够在VOT-2018上建立一个新的最先进的实时测试人员,同时展示竞争性能和半监督视频的最佳速度DAVIS-2016和DAVIS-2017上的objectsegmentation任务。项目网站是:http://www.robots.ox.ac.uk/~qwang/SiamMask。
translated by 谷歌翻译
视频对象分割在用于视频分析的各种应用中具有挑战性并且是重要的。最近的作品使用深网将视频对象分割制定为预测任务,以实现最具吸引力的性能。由于作为预测任务的公式,这些方法中的大多数在测试时间期间需要微调,使得深网络在给定视频中表现出感兴趣对象的外观。然而,微调是耗时且计算上昂贵的,因此算法远非实时。为了解决这个问题,我们开发了一种基于新颖匹配的视频对象分割算法。与基于基于分类的分类技术相比,所提出的方法学习将提取的特征匹配到提供的模板而不记住对象的外观。我们验证了所提方法在具有挑战性的DAVIS-16,DAVIS-17,Youtube-Objects和JumpCut数据集上的有效性和稳健性。大量的结果表明,我们的方法在没有微调的情况下实现了相当的性能,并且在计算时间方面更加有利。
translated by 谷歌翻译
Inspired by recent advances of deep learning in instance segmentation andobject tracking, we introduce video object segmentation problem as a concept ofguided instance segmentation. Our model proceeds on a per-frame basis, guidedby the output of the previous frame towards the object of interest in the nextframe. We demonstrate that highly accurate object segmentation in videos can beenabled by using a convnet trained with static images only. The key ingredientof our approach is a combination of offline and online learning strategies,where the former serves to produce a refined mask from the previous frameestimate and the latter allows to capture the appearance of the specific objectinstance. Our method can handle different types of input annotations: boundingboxes and segments, as well as incorporate multiple annotated frames, makingthe system suitable for diverse applications. We obtain competitive results onthree different datasets, independently from the type of input annotation.
translated by 谷歌翻译
Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly. This paper explores the orthogonal approach of processing each frame independently, i.e. disregarding the temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOS S), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance-level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent single-object video segmentation databases, which show that OSVOS S is both the fastest and most accurate method in the state of the art. Experiments on multi-object video segmentation show that OSVOS S obtains competitive results.
translated by 谷歌翻译
The problem of video object segmentation can become extremely challenging when multiple instances co-exist. While each instance may exhibit large scale and pose variations , the problem is compounded when instances occlude each other causing failures in tracking. In this study, we formulate a deep recurrent network that is capable of segmenting and tracking objects in video simultaneously by their temporal continuity, yet able to re-identify them when they reappear after a prolonged occlusion. We combine both temporal propagation and re-identification functional-ities into a single framework that can be trained end-to-end. In particular, we present a re-identification module with template expansion to retrieve missing objects despite their large appearance changes. In addition, we contribute a new attention-based recurrent mask propagation approach that is robust to distractors not belonging to the target segment. Our approach achieves a new state-of-the-art global mean (Region Jaccard and Boundary F measure) of 68.2 on the challenging DAVIS 2017 benchmark [28] (test-dev set), outperforming the winning solution [21] which achieves a global mean of 66.1 on the same partition 1 .
translated by 谷歌翻译
视频对象分割中的一个主要技术是标记用于训练实例的对象掩模。因此,我们建议用视频对象分割训练的运动提示来准备廉价但高质量的伪地面实况。我们的方法使用实例分割网络进行语义分割,然后基于运动信息选择感兴趣的分割对象作为伪地面实况。然后,利用伪周围真值来微调预训练的对象网络,以便在视频的剩余帧中促进对象分割。我们证明伪基础事实可以有效地改善分割性能。这种直观的无监督视频对象分割方法比现有方法更有效。在DAVIS和FBMS上的实验结果表明,所提出的方法在各种基准数据集上优于最先进的无监督分割方法。类别不可知的基础事实具有扩展到多个任意对象跟踪的巨大潜力。
translated by 谷歌翻译
Deep CNNs have achieved superior performance in many tasks of computer vision and image understanding. However, it is still difficult to effectively apply deep C-NNs to video object segmentation(VOS) since treating video frames as separate and static will lose the information hidden in motion. To tackle this problem, we propose a Motion-guided Cascaded Refinement Network for VOS. By assuming the object motion is normally different from the background motion, for a video frame we first apply an active contour model on optical flow to coarsely segment objects of interest. Then, the proposed Cascaded Refinement Net-work(CRN) takes the coarse segmentation as guidance to generate an accurate segmentation of full resolution. In this way, the motion information and the deep CNNs can well complement each other to accurately segment objects from video frames. Furthermore, in CRN we introduce a Single-channel Residual Attention Module to incorporate the coarse segmentation map as attention, making our network effective and efficient in both training and testing. We perform experiments on the popular benchmarks and the results show that our method achieves state-of-the-art performance at a much faster speed.
translated by 谷歌翻译
用于视频对象分割(VOS)的许多最近成功的方法过于复杂,严重依赖于第一帧上的微调和/或低,因此实际使用有限。在这项工作中,我们建议FEELVOS是一种简单快速的方法,不依赖于微调。为了对视频进行分类,对于每个帧,FEELVOS使用语义像素方式嵌入与全局和局部匹配机制一起将信息从视频的第一帧和前一帧传送到当前帧。与以前的工作相比,我们的嵌入仅用作卷积网络的内部指导。我们的新型动态分段头允许训练网络,包括嵌入,端到端的多对象分割任务,具有交叉熵损失。我们在视频对象分割方面实现了新的技术水平,而没有对DAVIS 2017验证集进行微调,J&F测量值为69.1%。
translated by 谷歌翻译
We tackle the task of semi-supervised video object segmentation, i.e. segmenting the pixels belonging to an object in a video using the ground truth pixel mask for the first frame. We build on the recently introduced one-shot video object segmentation (OSVOS) approach which uses a pretrained network and fine-tunes it on the first frame. While achieving impressive performance, at test time OSVOS uses the fine-tuned network in unchanged form and is not able to adapt to large changes in object appearance. To overcome this limitation, we propose Online Adaptive Video Object Segmentation (OnAVOS) which updates the network online using training examples selected based on the confidence of the network and the spatial configuration. Additionally, we add a pre-training step based on objectness, which is learned on PASCAL. Our experiments show that both extensions are highly effective and improve the state of the art on DAVIS to an intersection-over-union score of 85.7%.
translated by 谷歌翻译
This paper addresses the task of segmenting moving objects in unconstrainedvideos. We introduce a novel two-stream neural network with an explicit memorymodule to achieve this. The two streams of the network encode spatial andtemporal features in a video sequence respectively, while the memory modulecaptures the evolution of objects over time. The module to build a "visualmemory" in video, i.e., a joint representation of all the video frames, isrealized with a convolutional recurrent unit learned from a small number oftraining video sequences. Given a video frame as input, our approach assignseach pixel an object or background label based on the learned spatio-temporalfeatures as well as the "visual memory" specific to the video, acquiredautomatically without any manually-annotated frames. The visual memory isimplemented with convolutional gated recurrent units, which allows to propagatespatial information over time. We evaluate our method extensively on twobenchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and showstate-of-the-art results. For example, our approach outperforms the top methodon the DAVIS dataset by nearly 6%. We also provide an extensive ablativeanalysis to investigate the influence of each component in the proposedframework.
translated by 谷歌翻译
This paper proposes an end-to-end trainable network, SegFlow, forsimultaneously predicting pixel-wise object segmentation and optical flow invideos. The proposed SegFlow has two branches where useful information ofobject segmentation and optical flow is propagated bidirectionally in a unifiedframework. The segmentation branch is based on a fully convolutional network,which has been proved effective in image segmentation task, and the opticalflow branch takes advantage of the FlowNet model. The unified framework istrained iteratively offline to learn a generic notion, and fine-tuned onlinefor specific objects. Extensive experiments on both the video objectsegmentation and optical flow datasets demonstrate that introducing opticalflow improves the performance of segmentation and vice versa, against thestate-of-the-art algorithms.
translated by 谷歌翻译
Figure 1. Example result of our technique: The segmentation of the first frame (red) is used to learn the model of the specific object to track, which is segmented in the rest of the frames independently (green). One every 20 frames shown of 90 in total. Abstract This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information , learned on ImageNet, to the task of foreground seg-mentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).
translated by 谷歌翻译
我们讨论半监督视频对象分割,在给定第一帧地面实况注释的情况下,自动生成视频序列中对象的准确且一致的像素掩模的任务。为此,我们提出了PReMVOS算法(用于视频对象分割的Proposal-generation,Refinement和Merging)。我们的方法将这个问题分为两个步骤,首先为每个视频帧生成一组精确的对象分割掩模提示,然后在视频序列中选择并合并这些提议到精确和时间上一致的像素方向对象轨迹。我们开发了一种解决每个子问题的新方法,并将这些方法结合到一种方法中,该方法旨在专门解决在视频序列中分割多个对象所涉及的难题。我们的方法超越了所有先前在DAVIS 2017视频对象分割基准测试中的最先进结果,在测试开发数据集中的J&F平均得分为71.6,并在DAVIS 2018视频对象分割挑战中获得第一名。
translated by 谷歌翻译
预测未来事件是实现智能行为的重要前提。视频预测已经被研究作为实现这一目标的代理任务。最近的工作已经表明,为了预测未来帧的语义分割,在语义层面的预测比预测RGB帧然后分割这些帧更有效。在本文中,我们考虑未来实例分割的更具挑战性的问题,其另外分割出个体对象。为了处理每个图像的不同数量的输出标签,我们在Mask R-CNN实例分割模型的固定大小的卷积特征的空间中开发预测模型。我们将Mask R-CNN的“检测头”应用于预测的特征,以产生未来帧的实例分割。实验表明,这种方法在基于光流和重用实例分割架构的强基线上有显着改善。
translated by 谷歌翻译
高质量的计算机视觉模型通常解决了解真实世界图像的一般分布的问题。然而,大多数相机只观察到这种分布的很小一部分。这提供了通过将紧凑的低成本模型专门用于由单面板观察到的特定分布框架来实现更有效推断的可能性。在本文中,我们采用模型蒸馏技术(使用高成本教师的输出监督低成本学生模型),将精确,低成本的语义分割模型专门化为目标视频流。我们不是从视频流中学习离线数据的专业学生模型,而是通过实时视频在线培训学生,间歇性地运行教师以提供学习目标。 Onlinemodel蒸馏产生语义分割模型,即使目标视频的分布是非静态的,它们也会使Mask R-CNN教师接近7到17倍的推理运行时成本(11到26x FLOP)。我们的方法不需要对目标视频流进行离线预训练,并且比基于流或视频对象分割的解决方案实现更高的准确性和更低的成本。我们还提供了一个新的视频数据集,用于评估长时间运行的视频流的推理效率。
translated by 谷歌翻译
This paper addresses the problem of video object seg-mentation, where the initial object mask is given in the first frame of an input video. We propose a novel spatio-temporal Markov Random Field (MRF) model defined over pixels to handle this problem. Unlike conventional MRF models, the spatial dependencies among pixels in our model are encoded by a Convolutional Neural Network (CNN). Specifically, for a given object, the probability of a labeling to a set of spatially neighboring pixels can be predicted by a CNN trained for this specific object. As a result, higher-order, richer dependencies among pixels in the set can be implicitly modeled by the CNN. With temporal dependencies established by optical flow, the resulting MRF model combines both spatial and temporal cues for tackling video object segmentation. However, performing inference in the MRF model is very difficult due to the very high-order dependencies. To this end, we propose a novel CNN-embedded algorithm to perform approximate inference in the MRF. This algorithm proceeds by alternating between a temporal fusion step and a feed-forward CNN step. When initialized with an appearance-based one-shot segmenta-tion CNN, our model outperforms the winning entries of the DAVIS 2017 Challenge, without resorting to model ensem-bling or any dedicated detectors.
translated by 谷歌翻译