The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics.
translated by 谷歌翻译
As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth. 4 It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, faraway mountains don't move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we start by training a deep network, using fully automatic supervision, to predict relative scene depth from single images. The relative depth training images are automatically derived from simple videos of cars moving through a scene, using recent motion segmentation techniques, and no human-provided labels. The proxy task of predicting relative depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (abso-lute) depth estimation, over a network trained from scratch. The improvement on the semantic segmentation task is greater than that produced by any other automatically supervised methods. Moreover, for monocular depth estimation, our unsupervised pre-training method even outperforms supervised pre-training with ImageNet. In addition, we demonstrate benefits from learning to predict (again, completely unsupervised) relative depth in the specific videos associated with various downstream tasks (e.g., KITTI). We adapt to the specific scenes in those tasks in an unsupervised manner to improve performance. In summary, for semantic segmentation, we present state-of-the-art results among methods that do not use supervised pre-training, and we even exceed the performance of supervised ImageNet pre-trained models for monocular depth estimation, achieving results that are comparable with state-of-the-art methods.
translated by 谷歌翻译
In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular , we contribute two novel methods that constitute a unified parsing framework. (1) Predictive feature learning from nearly unlimited unlabeled video data. Different from existing methods learning features from single frame parsing , we learn spatiotemporal discriminative features by enforcing a parsing network to predict future frames and their parsing maps (if available) given only historical frames. In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations. (2) Prediction steering parsing architecture that effectively adapts the learned spatiotemporal features to scene parsing tasks and provides strong guidance for any off-the-shelf parsing model to achieve better video scene parsing performance. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our methods by showing significant improvement over well-established baselines.
translated by 谷歌翻译
注释训练数据的难度是将CNN用于视频中的低级任务的主要障碍。合成数据通常不会推广到实际视频,而无监督方法需要启发式损失。代理任务可以克服这些问题,并且首先通过培训网络来完成任务,以便更容易或者可以在无人监督的情况下进行培训。然后使用少量的地面真实数据对经过训练的网络进行微调以完成原始任务。在这里,我们研究帧插值作为光流的代理任务。使用真实电影,我们训练CNN无人监督进行时间插值。这样的网络隐含地估计运动,但不能处理特定区域。通过对少量地面实况流进行微调,网络可以学习填充均匀区域并计算全光学流场。使用这种无监督的预训练,我们的网络优于使用合成光流监督训练的模拟体系结构。
translated by 谷歌翻译
The problem of determining whether an object is in motion, irrespective ofcamera motion, is far from being solved. We address this challenging task bylearning motion patterns in videos. The core of our approach is a fullyconvolutional network, which is learned entirely from synthetic videosequences, and their ground-truth optical flow and motion segmentation. Thisencoder-decoder style architecture first learns a coarse representation of theoptical flow field features, and then refines it iteratively to produce motionlabels at the original high-resolution. We further improve this labeling withan objectness map and a conditional random field, to account for errors inoptical flow, and also to focus on moving "things" rather than "stuff". Theoutput label of each pixel denotes whether it has undergone independent motion,i.e., irrespective of camera motion. We demonstrate the benefits of thislearning framework on the moving object segmentation task, where the goal is tosegment all objects in motion. Our approach outperforms the top method on therecently released DAVIS benchmark dataset, comprising real-world sequences, by5.6%. We also evaluate on the Berkeley motion segmentation database, achievingstate-of-the-art results.
translated by 谷歌翻译
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called Net-Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de
translated by 谷歌翻译
In the era of end-to-end deep learning, many advances in computer vision are driven by large amounts of labeled data. In the optical flow setting, however, obtaining dense per-pixel ground truth for real scenes is difficult and thus such data is rare. Therefore, recent end-to-end convolutional networks for optical flow rely on synthetic datasets for supervision , but the domain mismatch between training and test scenarios continues to be a challenge. Inspired by classical energy-based optical flow methods, we design an unsuper-vised loss based on occlusion-aware bidirectional flow estimation and the robust census transform to circumvent the need for ground truth flow. On the KITTI benchmarks, our unsupervised approach outperforms previous unsupervised deep networks by a large margin, and is even more accurate than similar supervised methods trained on synthetic datasets alone. By optionally fine-tuning on the KITTI training data, our method achieves competitive optical flow accuracy on the KITTI 2012 and 2015 benchmarks, thus in addition enabling generic pre-training of supervised networks for datasets with limited amounts of ground truth.
translated by 谷歌翻译
这项工作解决了雾下语义场景理解的问题。虽然在语义场景理解方面取得了显着进步,但它主要集中在晴朗的天气场景上。将语义分割方法扩展到诸如雾的恶劣天气条件对于超应用是至关重要的。在本文中,我们提出了一种名为Curriculum ModelAdaptation(CMAda)的新方法,该方法使用标记的合成雾数据和未标记的真实雾数据逐步适应从轻合成雾到密集真实雾的语义分割模型。该方法基于以下事实:在中度不利条件(轻雾)中的语义分割的结果可以被引导以在高度不利条件(浓雾)中解决相同的问题。 CMAda可扩展到其他不利条件,并为合成数据和未标记的真实数据学习提供了新的范例。此外,我们还提出了另外三个主要的独立贡献:1)使用语义输入将合成雾添加到真实,晴朗的场景的新方法; 2)新的雾密度估算器; 3)一种新颖的雾化致密化方法,在不使用深度的情况下,在真实的雾场景中使雾化; 4)FoggyZurich数据集包括3808个真实有雾图像,在浓雾下有40个图像的像素级语义注释。我们的实验表明:1)我们的雾模拟和雾密度估计在语义模糊场景理解(SFSU)方面优于其最先进的部件; 2)CMAda显着提高了SFS最先进模型的性能,同时受益于我们的合成和真实雾化数据。数据集和代码可在项目网站上获得。
translated by 谷歌翻译
我们提出了一种无监督学习框架,用于使用未标记的视频序列同时训练单视深度预测和光流估计模型。现有的无监督方法通常利用亮度稳定性和空间平滑度先验来训练深度或流动模型。在本文中,我们建议将几何一致性作为额外的supervisorysignals。我们的核心思想是,对于刚性区域,我们可以使用预测的场景深度和相机运动来通过反投影诱导的3D场景流来合成2D光流。刚性流(来自深度预测和相机运动)与估计流量(来自光流模型)之间的差异使我们能够施加跨任务一致性损失。虽然所有网络在训练期间都是共同优化的,但它们可以在测试时独立应用。大量实验表明,我们的深度和流量模型与最先进的无监督方法相比较。
translated by 谷歌翻译
预测未来事件是实现智能行为的重要前提。视频预测已经被研究作为实现这一目标的代理任务。最近的工作已经表明,为了预测未来帧的语义分割,在语义层面的预测比预测RGB帧然后分割这些帧更有效。在本文中,我们考虑未来实例分割的更具挑战性的问题,其另外分割出个体对象。为了处理每个图像的不同数量的输出标签,我们在Mask R-CNN实例分割模型的固定大小的卷积特征的空间中开发预测模型。我们将Mask R-CNN的“检测头”应用于预测的特征,以产生未来帧的实例分割。实验表明,这种方法在基于光流和重用实例分割架构的强基线上有显着改善。
translated by 谷歌翻译
我们提出了一种端到端的学习框架,用于在图像和视频中分割通用对象。给定一个新颖的图像或视频,我们的方法为所有“类对象”区域生成像素级掩码 - 即使是在训练期间看到的对象类别。我们将任务表示为向每个像素分配对象/背景标签的结构化预测问题,使用深度完全卷积网络实现。当应用于视频时,我们的模式还包含一个运动流,网络学会结合运动和运动,并试图提取所有突出的物体,无论它们是否移动。除了核心模型之外,另外的第二个贡献是它如何利用不同的训练注释强度。像素级注释很难获得,但对于训练深度网络分割方法至关重要。因此,我们提出了利用弱标记数据来学习密集前景分割的方法。对于图像,weshow混合对象类别示例中的值与图像级标签一起与相对较少的具有边界级注释的图像。对于视频,我们将展示如何与网络训练一起引导弱带注释的视频以进行图像分割。通过对多个具有挑战性的图像和视频分割基准的实验,我们的方法提供了始终如一的强大结果,并改进了对通用(看不见)对象的全自动分割的最新技术。此外,我们还展示了我们的方法如何有利于图像检索和图像重定向,当我们获得高质量的前景图时,这两种方法都会蓬勃发展。代码,模型和视频都是:http://vision.cs.utexas.edu/projects/pixelobjectness/
translated by 谷歌翻译
视频对象分割中的一个主要技术是标记用于训练实例的对象掩模。因此,我们建议用视频对象分割训练的运动提示来准备廉价但高质量的伪地面实况。我们的方法使用实例分割网络进行语义分割,然后基于运动信息选择感兴趣的分割对象作为伪地面实况。然后,利用伪周围真值来微调预训练的对象网络,以便在视频的剩余帧中促进对象分割。我们证明伪基础事实可以有效地改善分割性能。这种直观的无监督视频对象分割方法比现有方法更有效。在DAVIS和FBMS上的实验结果表明,所提出的方法在各种基准数据集上优于最先进的无监督分割方法。类别不可知的基础事实具有扩展到多个任意对象跟踪的巨大潜力。
translated by 谷歌翻译
对复杂的城市街景进行视觉理解是广泛应用的有利因素。物体检测从大规模数据集中获益匪浅,特别是在深度学习的背景下。然而,对于语义城市场景理解,当前的数据集没有充分捕捉到真实世界城市场景的复杂性。为了解决这个问题,我们引入了Cityscapes,一个基准套件和大规模的基准测试,用于训练和测试像素级和实例级语义标签的方法。城市景观由来自50个不同城市的街道中记录的大量不同的立体视频序列组成。其中5000张图像具有高质量的像素级注释; 20000个额外的图像具有粗略的注释,以启用利用大量弱标记数据的方法。至关重要的是,我们的努力超过了以前在数据集大小,注释丰富度,场景可变性和复杂性方面的尝试。 Ouraccompanying实证研究提供了对数据集特征的深入分析,以及基于我们的基准测试的几种最新方法的性能评估。
translated by 谷歌翻译
我们解决了在现有视频中合成新视频帧的问题,无论是在现有帧之间(插值)还是在它们之后(外推)。这个问题具有挑战性,因为视频外观和运动可能非常复杂。传统的基于光流的解决方案通常在流量估计具有挑战性的情况下失败,而基于神经网络的新方法直接产生像素值的幻觉通常会产生模糊的结果。我们通过训练深度网络来结合这两种方法的优点,该网络通过流动现有像素值来合成视频帧,我们将其称为深体素流。我们的方法不需要人工监督,任何视频都可以作为训练数据,通过丢弃,然后学习topredict,现有的帧。该技术是有效的,并且可以在任何视频分辨率下应用。我们证明了我们的方法产生的结果既可以在现有技术上进行定量和定性的改进。
translated by 谷歌翻译
The ability to predict and therefore to anticipate the future is an importantattribute of intelligence. It is also of utmost importance in real-timesystems, e.g. in robotics or autonomous driving, which depend on visual sceneunderstanding for decision making. While prediction of the raw RGB pixel valuesin future video frames has been studied in previous work, here we introduce thenovel task of predicting semantic segmentations of future frames. Given asequence of video frames, our goal is to predict segmentation maps of not yetobserved video frames that lie up to a second or further in the future. Wedevelop an autoregressive convolutional neural network that learns toiteratively generate multiple frames. Our results on the Cityscapes datasetshow that directly predicting future segmentations is substantially better thanpredicting and then segmenting future RGB frames. Prediction results up to halfa second in the future are visually convincing and are much more accurate thanthose of a baseline based on warping semantic segmentations using optical flow.
translated by 谷歌翻译
运动是自动驾驶系统的主要线索。通常通过计算光流来检测移动物体并使用三角测量来估计深度。在本文中,我们的动机是利用现有的密集流来提高语义分割的性能。为了提供系统研究,我们构建了四种不同的架构,它们仅使用RGB,仅使用流,RGBF级联和双流RGB +流。我们使用最先进的流量估算器FlowNet v2在两个汽车数据集上评估这些网络,即Virtual KITTI和Cityscapes。我们还利用Virtual KITTI中的groundtruth光流作为理想估计器和标准Farneback光流算法来研究噪声的影响。利用Virtual KITTI中的流程真相,双流体系结构实现了最佳结果,IoU提高了4%。正如预期的那样,卡车,厢式货车和汽车等物体的运输有了很大的改进,IoU增加了38%,28%和6%。 FlowNet的平均IoU提高了2.4%,移动物体的数量大幅提升,相当于26%,11%和5%的入口,货车和汽车。在城市景观中,流量增加为摩托车和火车这样的移动物体提供了改进,在IU中增加了17%和7%。
translated by 谷歌翻译
视频语义分割最近一直是计算机视觉的研究热点之一。它作为许多领域的感知基础,如机器人和自动驾驶。语义分割的快速发展极大地归因于大规模数据集,特别是对于与学习相关的方法。目前,已经存在用于复杂城市场景的若干语义分割数据集,例如Cityscapes和CamVid数据集。它们是用于语义分割方法之间比较的标准数据集。在本文中,我们介绍了一种新的高分辨率无人机视频语义分割数据集作为补充,UAVid。 OurUAV数据集由30个捕获高分辨率图像的视频序列组成。总共有300张图像被密集标记为8类,用于城市场景理解任务。我们的数据集带来了新的挑战。我们提供了几种深度学习基线方法,其中提出的新型多尺度扩张网通过多尺度特征提取表现最佳。我们还通过在空间和时间域上利用CRF模型探索了序列数据的可用性。
translated by 谷歌翻译
This paper proposes an end-to-end trainable network, SegFlow, forsimultaneously predicting pixel-wise object segmentation and optical flow invideos. The proposed SegFlow has two branches where useful information ofobject segmentation and optical flow is propagated bidirectionally in a unifiedframework. The segmentation branch is based on a fully convolutional network,which has been proved effective in image segmentation task, and the opticalflow branch takes advantage of the FlowNet model. The unified framework istrained iteratively offline to learn a generic notion, and fine-tuned onlinefor specific objects. Extensive experiments on both the video objectsegmentation and optical flow datasets demonstrate that introducing opticalflow improves the performance of segmentation and vice versa, against thestate-of-the-art algorithms.
translated by 谷歌翻译
训练深度网络以执行语义分割需要大量标记数据。为了减轻注释实物图像的手动工作,研究人员研究了合成数据的使用,这些数据可以自动标记。不幸的是,在合成数据上训练的网络在真实图像上的表现相对较差。虽然这可以通过域适应来解决,但是现有方法都需要在训练期间访问真实图像。在本文中,我们介绍了一种截然不同的处理合成图像的方法,这种方法不需要在任何时候看到任何真实的图像。我们的方法建立在观察前景和背景类不受域移动以相同方式影响的基础上,并且应该区别对待。特别是,前者应该以基于检测的方式处理,以更好地解释这样一个事实:虽然合成图像中的纹理不是照片般逼真的,但它们的形状看起来很自然。我们的实验证明了我们的方法对Cityscapes和CamVid模型的有效性仅对合成数据进行过培训。
translated by 谷歌翻译
For state-of-the-art semantic segmentation task, training con-volutional neural networks (CNNs) requires dense pixelwise ground truth (GT) labeling, which is expensive and involves extensive human effort. In this work, we study the possibility of using auxiliary ground truth, so-called pseudo ground truth (PGT) to improve the performance. The PGT is obtained by propagating the labels of a GT frame to its subsequent frames in the video using a simple CRF-based, cue integration framework. Our main contribution is to demonstrate the use of noisy PGT along with GT to improve the performance of a CNN. We perform a systematic analysis to find the right kind of PGT that needs to be added along with the GT for training a CNN. In this regard, we explore three aspects of PGT which influence the learning of a CNN: i) the PGT labeling has to be of good quality; ii) the PGT images have to be different compared to the GT images; iii) the PGT has to be trusted differently than GT. We conclude that PGT which is diverse from GT images and has good quality of labeling can indeed help improve the performance of a CNN. Also, when PGT is multiple folds larger than GT, weighing down the trust on PGT helps in improving the accuracy. Finally, We show that using PGT along with GT, the performance of Fully Convolutional Network (FCN) on Camvid data is increased by 2.7% on IoU accuracy. We believe such an approach can be used to train CNNs for semantic video segmentation where sequentially labeled image frames are needed. To this end, we provide recommendations for using PGT strategically for semantic segmentation and hence bypass the need for extensive human efforts in labeling.
translated by 谷歌翻译