视频对象分割中的一个主要技术是标记用于训练实例的对象掩模。因此,我们建议用视频对象分割训练的运动提示来准备廉价但高质量的伪地面实况。我们的方法使用实例分割网络进行语义分割,然后基于运动信息选择感兴趣的分割对象作为伪地面实况。然后,利用伪周围真值来微调预训练的对象网络,以便在视频的剩余帧中促进对象分割。我们证明伪基础事实可以有效地改善分割性能。这种直观的无监督视频对象分割方法比现有方法更有效。在DAVIS和FBMS上的实验结果表明,所提出的方法在各种基准数据集上优于最先进的无监督分割方法。类别不可知的基础事实具有扩展到多个任意对象跟踪的巨大潜力。
translated by 谷歌翻译
Video Object Segmentation, and video processing in general, has been historically dominated by methods that rely on the temporal consistency and redundancy in consecutive video frames. When the temporal smoothness is suddenly broken, such as when an object is occluded, or some frames are missing in a sequence, the result of these methods can deteriorate significantly. This paper explores the orthogonal approach of processing each frame independently, i.e. disregarding the temporal information. In particular, it tackles the task of semi-supervised video object segmentation: the separation of an object from the background in a video, given its mask in the first frame. We present Semantic One-Shot Video Object Segmentation (OSVOS S), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information, learned on ImageNet, to the task of foreground segmentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one shot). We show that instance-level semantic information, when combined effectively, can dramatically improve the results of our previous method, OSVOS. We perform experiments on two recent single-object video segmentation databases, which show that OSVOS S is both the fastest and most accurate method in the state of the art. Experiments on multi-object video segmentation show that OSVOS S obtains competitive results.
translated by 谷歌翻译
Deep CNNs have achieved superior performance in many tasks of computer vision and image understanding. However, it is still difficult to effectively apply deep C-NNs to video object segmentation(VOS) since treating video frames as separate and static will lose the information hidden in motion. To tackle this problem, we propose a Motion-guided Cascaded Refinement Network for VOS. By assuming the object motion is normally different from the background motion, for a video frame we first apply an active contour model on optical flow to coarsely segment objects of interest. Then, the proposed Cascaded Refinement Net-work(CRN) takes the coarse segmentation as guidance to generate an accurate segmentation of full resolution. In this way, the motion information and the deep CNNs can well complement each other to accurately segment objects from video frames. Furthermore, in CRN we introduce a Single-channel Residual Attention Module to incorporate the coarse segmentation map as attention, making our network effective and efficient in both training and testing. We perform experiments on the popular benchmarks and the results show that our method achieves state-of-the-art performance at a much faster speed.
translated by 谷歌翻译
We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects. We formulate the task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are lacking, we show how to boot-strap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art results for segmenting generic (unseen) objects. Code and pre-trained models are available on the project website.
translated by 谷歌翻译
We propose an end-to-end learning framework for segmenting generic objects invideos. Our method learns to combine appearance and motion information toproduce pixel level segmentation masks for all prominent objects in videos. Weformulate this task as a structured prediction problem and design a two-streamfully convolutional neural network which fuses together motion and appearancein a unified framework. Since large-scale video datasets with pixel levelsegmentations are problematic, we show how to bootstrap weakly annotated videostogether with existing image recognition datasets for training. Throughexperiments on three challenging video segmentation benchmarks, our methodsubstantially improves the state-of-the-art for segmenting generic (unseen)objects. Code and pre-trained models are available on the project website.
translated by 谷歌翻译
This paper proposes an end-to-end trainable network, SegFlow, forsimultaneously predicting pixel-wise object segmentation and optical flow invideos. The proposed SegFlow has two branches where useful information ofobject segmentation and optical flow is propagated bidirectionally in a unifiedframework. The segmentation branch is based on a fully convolutional network,which has been proved effective in image segmentation task, and the opticalflow branch takes advantage of the FlowNet model. The unified framework istrained iteratively offline to learn a generic notion, and fine-tuned onlinefor specific objects. Extensive experiments on both the video objectsegmentation and optical flow datasets demonstrate that introducing opticalflow improves the performance of segmentation and vice versa, against thestate-of-the-art algorithms.
translated by 谷歌翻译
本文解决了半监督视频对象分割的问题,即在第一帧中给出其掩模的序列中的对象分割。这种情景中的主要挑战之一是感兴趣对象的外观变化。另一方面,它们的语义并没有变化。本文通过引入指导外观模型的语义先验来研究如何利用这种不变性。具体而言,给定序列第一帧的分割掩码,对所关注对象的语义进行估计,并在整个序列中传播该知识。根据外观模型改进结果。我们提出了语义引导的视频对象分割(SGV),它使用各种评估指标在两个不同的数据集上改进了对先前技术水平的结果,同时每帧运行半秒。
translated by 谷歌翻译
This paper addresses the task of segmenting moving objects in unconstrainedvideos. We introduce a novel two-stream neural network with an explicit memorymodule to achieve this. The two streams of the network encode spatial andtemporal features in a video sequence respectively, while the memory modulecaptures the evolution of objects over time. The module to build a "visualmemory" in video, i.e., a joint representation of all the video frames, isrealized with a convolutional recurrent unit learned from a small number oftraining video sequences. Given a video frame as input, our approach assignseach pixel an object or background label based on the learned spatio-temporalfeatures as well as the "visual memory" specific to the video, acquiredautomatically without any manually-annotated frames. The visual memory isimplemented with convolutional gated recurrent units, which allows to propagatespatial information over time. We evaluate our method extensively on twobenchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and showstate-of-the-art results. For example, our approach outperforms the top methodon the DAVIS dataset by nearly 6%. We also provide an extensive ablativeanalysis to investigate the influence of each component in the proposedframework.
translated by 谷歌翻译
我们提出了一种端到端的学习框架,用于在图像和视频中分割通用对象。给定一个新颖的图像或视频,我们的方法为所有“类对象”区域生成像素级掩码 - 即使是在训练期间看到的对象类别。我们将任务表示为向每个像素分配对象/背景标签的结构化预测问题,使用深度完全卷积网络实现。当应用于视频时,我们的模式还包含一个运动流,网络学会结合运动和运动,并试图提取所有突出的物体,无论它们是否移动。除了核心模型之外,另外的第二个贡献是它如何利用不同的训练注释强度。像素级注释很难获得,但对于训练深度网络分割方法至关重要。因此,我们提出了利用弱标记数据来学习密集前景分割的方法。对于图像,weshow混合对象类别示例中的值与图像级标签一起与相对较少的具有边界级注释的图像。对于视频,我们将展示如何与网络训练一起引导弱带注释的视频以进行图像分割。通过对多个具有挑战性的图像和视频分割基准的实验,我们的方法提供了始终如一的强大结果,并改进了对通用(看不见)对象的全自动分割的最新技术。此外,我们还展示了我们的方法如何有利于图像检索和图像重定向,当我们获得高质量的前景图时,这两种方法都会蓬勃发展。代码,模型和视频都是:http://vision.cs.utexas.edu/projects/pixelobjectness/
translated by 谷歌翻译
无监督视频对象分割是视频分析中的关键应用,不需要知道关于对象的任何先验信息。当多个对象出现并在给定的视频剪辑中交互时,它变得极具挑战性。本文提出了一种基于干扰物感知在线自适应(DOA)的新型无监督视频对象分割方法。 DOA通过捕获相邻帧的背景依赖性来模拟视频序列中的时空一致性。实例提议由实例分段网络为每个帧生成,然后通过运动信息选择为硬否定(如果它们存在和肯定)。为了采用高质量的硬阴性,然后将块匹配算法应用于追加帧以跟踪相关的硬阴性。如果序列中没有硬阴性,则也会引入一般阴性,并且实验证明两种阴性(干扰物)都是互补的。最后,我们使用正面,负面和阴性面罩进行DOA以更新前景/背景分割。所提出的方法在两个基准数据集DAVIS2016和FBMS-59数据集上实现了最先进的结果。
translated by 谷歌翻译
The problem of determining whether an object is in motion, irrespective ofcamera motion, is far from being solved. We address this challenging task bylearning motion patterns in videos. The core of our approach is a fullyconvolutional network, which is learned entirely from synthetic videosequences, and their ground-truth optical flow and motion segmentation. Thisencoder-decoder style architecture first learns a coarse representation of theoptical flow field features, and then refines it iteratively to produce motionlabels at the original high-resolution. We further improve this labeling withan objectness map and a conditional random field, to account for errors inoptical flow, and also to focus on moving "things" rather than "stuff". Theoutput label of each pixel denotes whether it has undergone independent motion,i.e., irrespective of camera motion. We demonstrate the benefits of thislearning framework on the moving object segmentation task, where the goal is tosegment all objects in motion. Our approach outperforms the top method on therecently released DAVIS benchmark dataset, comprising real-world sequences, by5.6%. We also evaluate on the Berkeley motion segmentation database, achievingstate-of-the-art results.
translated by 谷歌翻译
This paper proposes a deep learning model to efficiently detect salientregions in videos. It addresses two important issues: (1) deep video saliencymodel training with the absence of sufficiently large and pixel-wise annotatedvideo data, and (2) fast video saliency training and detection. The proposeddeep video saliency network consists of two modules, for capturing the spatialand temporal saliency information, respectively. The dynamic saliency model,explicitly incorporating saliency estimates from the static saliency model,directly produces spatiotemporal saliency inference without time-consumingoptical flow computation. We further propose a novel data augmentationtechnique that simulates video training data from existing annotated imagedatasets, which enables our network to learn diverse saliency information andprevents overfitting with the limited number of training videos. Leveraging oursynthetic video data (150K video sequences) and real videos, our deep videosaliency model successfully learns both spatial and temporal saliency cues,thus producing accurate spatiotemporal saliency estimate. We advance thestate-of-the-art on the DAVIS dataset (MAE of .06) and the FBMS dataset (MAE of.07), and do so with much improved speed (2fps with all steps).
translated by 谷歌翻译
视频对象分割的基本挑战之一是找到目标和背景外观的有效表示。为实现这一目的,最佳执行方法需要对卷积神经网络进行广泛的微调。除了非常昂贵之外,由于在线微调程序没有集成到网络的离线培训中,因此这种策略不能真正地端到端地进行训练。为了解决这些问题,我们提出了一种网络架构,可以在单向传递中学习目标和背景外观的强大表示。引入的外观模块学习目标和背景特征分布的概率生成模型。给定一个新图像,它预测后验类概率,提供高度严格的提示,在后来的网络模块中处理。我们的外观模块的学习和预测阶段都是完全不同的,可以对整个分支管道进行真正的端到端培训。综合实验证明了所提出的方法对三个视频对象分割基准的有效性。我们接近基于DAVIS17在线微调的方法,同时在单个GPU上以15FPS运行。此外,我们的方法优于大型YouTube-VOS数据集上的所有已发布的方法。
translated by 谷歌翻译
Figure 1. Example result of our technique: The segmentation of the first frame (red) is used to learn the model of the specific object to track, which is segmented in the rest of the frames independently (green). One every 20 frames shown of 90 in total. Abstract This paper tackles the task of semi-supervised video object segmentation, i.e., the separation of an object from the background in a video, given the mask of the first frame. We present One-Shot Video Object Segmentation (OSVOS), based on a fully-convolutional neural network architecture that is able to successively transfer generic semantic information , learned on ImageNet, to the task of foreground seg-mentation, and finally to learning the appearance of a single annotated object of the test sequence (hence one-shot). Although all frames are processed independently, the results are temporally coherent and stable. We perform experiments on two annotated video segmentation databases, which show that OSVOS is fast and improves the state of the art by a significant margin (79.8% vs 68.0%).
translated by 谷歌翻译
Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we formulate the semantic segmentation problem and define the terminology of this field as well as interesting background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and goals. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. We also devote a part of the paper to review common loss functions and error metrics for this problem. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.
translated by 谷歌翻译
无监督在线视频对象分割(VOS)旨在通过无约束视频自动分割移动对象,而无需任何有关对象或摄像机运动的先验信息的要求。因此,它对于高级视频分析来说是一个非常具有挑战性的问题。到目前为止,在文献中已经报道了有限数量的这种方法,并且它们中的大多数仍然具有令人满意的性能。针对这一具有挑战性的问题,在本文中,我们提出了一种新的无监督在线VOS框架,它将运动属性理解为\ emph {移动}与[emph {一般对象}对于分割区域不一致的意义。通过合并\ emph {显着运动检测}和\ emph {对象提议},开发了逐像素融合策略,以有效地去除诸如背景运动和静止物体之类的检测。此外,通过利用从前一帧获得的分割,提出了一种前向传播算法来处理不可靠的运动检测和对象提议。在DAVIS-2016和SegTrack-v2基准数据集上的实验结果表明,所提出的方法优于其他最先进的无监督在线分割,至少达到5.6%的绝对改进,并且甚至实现了比最佳无监督离线方法更好的性能。 DAVIS-2016数据集。还需要解决的另一个重要优点是,在所有实验中,只有一个用于对象提议的训练模型(COCO数据集上的掩码RCNN)被使用而没有任何精细调整,这是鲁棒性的证明。这项工作的最大贡献可能揭示潜力,并基于特征运动属性激励更多的VOS框架研究。
translated by 谷歌翻译
Online video object segmentation is a challenging task as it entails to process the image sequence timely and accurately. To segment a target object through the video, numerous CNN-based methods have been developed by heavily finetuning on the object mask in the first frame, which is time-consuming for online applications. In this paper, we propose a fast and accurate video object segmentation algorithm that can immediately start the segmentation process once receiving the images. We first utilize a part-based tracking method to deal with challenging factors such as large deformation, occlusion, and cluttered background. Based on the tracked bounding boxes of parts, we construct a region-of-interest segmentation network to generate part masks. Finally, a similarity-based scoring function is adopted to refine these object parts by comparing them to the visual information in the first frame. Our method performs favorably against state-of-the-art algorithms in accuracy on the DAVIS benchmark dataset, while achieving much faster runtime performance.
translated by 谷歌翻译
我们解决了视频对象分割的极具挑战性的问题。 Givenonly是初始掩码,任务是在后续帧中分割目标。为了有效地处理外观变化和类似的背景对象,需要强健的目标表示。以前的方法要么依赖于对第一帧上的分段网络进行微调,要么采用生成外观模型。虽然部分成功,但这些方法经常遭受不切实际的低帧速率或不令人满意的鲁棒性。我们提出了一种新颖的方法,基于专门的目标外观模型,该模型专门在线学习以区分目标和背景图像区域。重要的是,我们设计了专门的损失和定制优化技术,以实现高效的在线培训。我们的轻量级目标模型被集成到精心设计的分割网络中,离线培训以增强目标模型生成的预测。对三个数据集进行了大量实验。 Ourapproach在YouTube-VOS上的总体得分超过70,而每秒25帧。
translated by 谷歌翻译
除了现有的单人和多人解析任务的静态图像之外,本文首次尝试研究一个更现实的视频实例级人类解析,同时细分人员实例并将每个实例解析为更细粒度的部分(例如,头,腿,连衣裙)。我们介绍了一种新颖的自适应时间编码网络(ATEN),它可以交替地在关键帧之间执行时间编码,并从两个关键帧之间的其他连续帧中进行流引导的特征传播。具体来说,ATEN首先结合了Parsing-RCNN来为每个关键帧生成实例级解析结果,它将全局人工解析和实例级人工划分集成到统一模型中。为了在准确性和效率之间取得平衡,流导制特征传播用于根据与关键帧的识别时间一致性直接解析连续帧。另一方面,ATENleverage卷积门控循环单元(convGRU)利用一系列关键帧的时间变化,这些关键帧进一步用于促进帧级实例级解析。通过在关键帧之间交替执行一致帧和时间编码网络之间的直接特征传播,我们的ATEN在帧级精度和时间效率之间实现了良好的平衡,这是视频对象分割研究中的常见关键问题。为了证明我们的ATEN的优越性,在最流行的视频分割基准(DAVIS)和新收集的视频实例级解析(VIP)数据集上进行了大量实验,这是第一个由404个序列组成的视频实例级人类解析数据集。具有实例级和逐像素注释的20kframe。
translated by 谷歌翻译
在介入放射学中,捕获静脉结构运动的短视频序列,以帮助医务人员识别血管问题或计划干预。通过指示血管和器械的准确位置,语义分割可以极大地提高这些视频的实用性,从而减少模糊性。我们提出了一种基于U-Net网络的这些任务的实时分割方法,该网络是在暹罗建筑中训练自动生成的注释。我们利用噪声低级二进制分割和光流来生成多级注释,这些注释在多级分割方法中得到了逐步改进。我们以90fps的处理速度显着提高了最先进的U-Net的性能。
translated by 谷歌翻译