The problem of determining whether an object is in motion, irrespective ofcamera motion, is far from being solved. We address this challenging task bylearning motion patterns in videos. The core of our approach is a fullyconvolutional network, which is learned entirely from synthetic videosequences, and their ground-truth optical flow and motion segmentation. Thisencoder-decoder style architecture first learns a coarse representation of theoptical flow field features, and then refines it iteratively to produce motionlabels at the original high-resolution. We further improve this labeling withan objectness map and a conditional random field, to account for errors inoptical flow, and also to focus on moving "things" rather than "stuff". Theoutput label of each pixel denotes whether it has undergone independent motion,i.e., irrespective of camera motion. We demonstrate the benefits of thislearning framework on the moving object segmentation task, where the goal is tosegment all objects in motion. Our approach outperforms the top method on therecently released DAVIS benchmark dataset, comprising real-world sequences, by5.6%. We also evaluate on the Berkeley motion segmentation database, achievingstate-of-the-art results.
translated by 谷歌翻译
注释训练数据的难度是将CNN用于视频中的低级任务的主要障碍。合成数据通常不会推广到实际视频,而无监督方法需要启发式损失。代理任务可以克服这些问题,并且首先通过培训网络来完成任务,以便更容易或者可以在无人监督的情况下进行培训。然后使用少量的地面真实数据对经过训练的网络进行微调以完成原始任务。在这里,我们研究帧插值作为光流的代理任务。使用真实电影,我们训练CNN无人监督进行时间插值。这样的网络隐含地估计运动,但不能处理特定区域。通过对少量地面实况流进行微调,网络可以学习填充均匀区域并计算全光学流场。使用这种无监督的预训练,我们的网络优于使用合成光流监督训练的模拟体系结构。
translated by 谷歌翻译
As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth. 4 It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, faraway mountains don't move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we start by training a deep network, using fully automatic supervision, to predict relative scene depth from single images. The relative depth training images are automatically derived from simple videos of cars moving through a scene, using recent motion segmentation techniques, and no human-provided labels. The proxy task of predicting relative depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (abso-lute) depth estimation, over a network trained from scratch. The improvement on the semantic segmentation task is greater than that produced by any other automatically supervised methods. Moreover, for monocular depth estimation, our unsupervised pre-training method even outperforms supervised pre-training with ImageNet. In addition, we demonstrate benefits from learning to predict (again, completely unsupervised) relative depth in the specific videos associated with various downstream tasks (e.g., KITTI). We adapt to the specific scenes in those tasks in an unsupervised manner to improve performance. In summary, for semantic segmentation, we present state-of-the-art results among methods that do not use supervised pre-training, and we even exceed the performance of supervised ImageNet pre-trained models for monocular depth estimation, achieving results that are comparable with state-of-the-art methods.
translated by 谷歌翻译
The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics.
translated by 谷歌翻译
我们解决了在现有视频中合成新视频帧的问题,无论是在现有帧之间(插值)还是在它们之后(外推)。这个问题具有挑战性,因为视频外观和运动可能非常复杂。传统的基于光流的解决方案通常在流量估计具有挑战性的情况下失败,而基于神经网络的新方法直接产生像素值的幻觉通常会产生模糊的结果。我们通过训练深度网络来结合这两种方法的优点,该网络通过流动现有像素值来合成视频帧,我们将其称为深体素流。我们的方法不需要人工监督,任何视频都可以作为训练数据,通过丢弃,然后学习topredict,现有的帧。该技术是有效的,并且可以在任何视频分辨率下应用。我们证明了我们的方法产生的结果既可以在现有技术上进行定量和定性的改进。
translated by 谷歌翻译
视频对象分割中的一个主要技术是标记用于训练实例的对象掩模。因此,我们建议用视频对象分割训练的运动提示来准备廉价但高质量的伪地面实况。我们的方法使用实例分割网络进行语义分割,然后基于运动信息选择感兴趣的分割对象作为伪地面实况。然后,利用伪周围真值来微调预训练的对象网络,以便在视频的剩余帧中促进对象分割。我们证明伪基础事实可以有效地改善分割性能。这种直观的无监督视频对象分割方法比现有方法更有效。在DAVIS和FBMS上的实验结果表明,所提出的方法在各种基准数据集上优于最先进的无监督分割方法。类别不可知的基础事实具有扩展到多个任意对象跟踪的巨大潜力。
translated by 谷歌翻译
我们提出了一种端到端的学习框架,用于在图像和视频中分割通用对象。给定一个新颖的图像或视频,我们的方法为所有“类对象”区域生成像素级掩码 - 即使是在训练期间看到的对象类别。我们将任务表示为向每个像素分配对象/背景标签的结构化预测问题,使用深度完全卷积网络实现。当应用于视频时,我们的模式还包含一个运动流,网络学会结合运动和运动,并试图提取所有突出的物体,无论它们是否移动。除了核心模型之外,另外的第二个贡献是它如何利用不同的训练注释强度。像素级注释很难获得,但对于训练深度网络分割方法至关重要。因此,我们提出了利用弱标记数据来学习密集前景分割的方法。对于图像,weshow混合对象类别示例中的值与图像级标签一起与相对较少的具有边界级注释的图像。对于视频,我们将展示如何与网络训练一起引导弱带注释的视频以进行图像分割。通过对多个具有挑战性的图像和视频分割基准的实验,我们的方法提供了始终如一的强大结果,并改进了对通用(看不见)对象的全自动分割的最新技术。此外,我们还展示了我们的方法如何有利于图像检索和图像重定向,当我们获得高质量的前景图时,这两种方法都会蓬勃发展。代码,模型和视频都是:http://vision.cs.utexas.edu/projects/pixelobjectness/
translated by 谷歌翻译
In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular , we contribute two novel methods that constitute a unified parsing framework. (1) Predictive feature learning from nearly unlimited unlabeled video data. Different from existing methods learning features from single frame parsing , we learn spatiotemporal discriminative features by enforcing a parsing network to predict future frames and their parsing maps (if available) given only historical frames. In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations. (2) Prediction steering parsing architecture that effectively adapts the learned spatiotemporal features to scene parsing tasks and provides strong guidance for any off-the-shelf parsing model to achieve better video scene parsing performance. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our methods by showing significant improvement over well-established baselines.
translated by 谷歌翻译
我们提出了一种无监督学习框架,用于使用未标记的视频序列同时训练单视深度预测和光流估计模型。现有的无监督方法通常利用亮度稳定性和空间平滑度先验来训练深度或流动模型。在本文中,我们建议将几何一致性作为额外的supervisorysignals。我们的核心思想是,对于刚性区域,我们可以使用预测的场景深度和相机运动来通过反投影诱导的3D场景流来合成2D光流。刚性流(来自深度预测和相机运动)与估计流量(来自光流模型)之间的差异使我们能够施加跨任务一致性损失。虽然所有网络在训练期间都是共同优化的,但它们可以在测试时独立应用。大量实验表明,我们的深度和流量模型与最先进的无监督方法相比较。
translated by 谷歌翻译
In the era of end-to-end deep learning, many advances in computer vision are driven by large amounts of labeled data. In the optical flow setting, however, obtaining dense per-pixel ground truth for real scenes is difficult and thus such data is rare. Therefore, recent end-to-end convolutional networks for optical flow rely on synthetic datasets for supervision , but the domain mismatch between training and test scenarios continues to be a challenge. Inspired by classical energy-based optical flow methods, we design an unsuper-vised loss based on occlusion-aware bidirectional flow estimation and the robust census transform to circumvent the need for ground truth flow. On the KITTI benchmarks, our unsupervised approach outperforms previous unsupervised deep networks by a large margin, and is even more accurate than similar supervised methods trained on synthetic datasets alone. By optionally fine-tuning on the KITTI training data, our method achieves competitive optical flow accuracy on the KITTI 2012 and 2015 benchmarks, thus in addition enabling generic pre-training of supervised networks for datasets with limited amounts of ground truth.
translated by 谷歌翻译
我们通过对过去的帧和过去的光流进行条件化来提出一种用于高分辨率视频帧预测的方法。以前接近重新采样过去的帧,由学习的未来光流或像素的直接生成引导。基于流量的重新采样是不够的,因为它不能处理去除错误。生成模型目前导致模糊结果。最近的方法通过将输入补丁与预测的内核进行卷积来合成像素。然而,它们的内存需求随着内核大小的增加而增加。在这里,我们使用空间位移卷积(SDC)模块进行视频帧预测。我们为每个像素学习运动矢量和内核,并通过在源图像中由预测运动矢量定义的位移位置处应用内核来合成像素。我们的方法继承了基于矢量和基于内核的方法的优点,同时改善了它们各自的缺点。我们在428K未标记的1080p视频游戏帧上训练我们的模型。我们的方法产生了最先进的结果,在高清YouTube-8M视频上获得0.904的SSIM评分,在CaltechPedestrian视频上获得0.918。我们的模型有效地处理大运动并合成具有一致运动的重帧。
translated by 谷歌翻译
视频语义分割最近一直是计算机视觉的研究热点之一。它作为许多领域的感知基础,如机器人和自动驾驶。语义分割的快速发展极大地归因于大规模数据集,特别是对于与学习相关的方法。目前,已经存在用于复杂城市场景的若干语义分割数据集,例如Cityscapes和CamVid数据集。它们是用于语义分割方法之间比较的标准数据集。在本文中,我们介绍了一种新的高分辨率无人机视频语义分割数据集作为补充,UAVid。 OurUAV数据集由30个捕获高分辨率图像的视频序列组成。总共有300张图像被密集标记为8类,用于城市场景理解任务。我们的数据集带来了新的挑战。我们提供了几种深度学习基线方法,其中提出的新型多尺度扩张网通过多尺度特征提取表现最佳。我们还通过在空间和时间域上利用CRF模型探索了序列数据的可用性。
translated by 谷歌翻译
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called Net-Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de
translated by 谷歌翻译
This paper proposes an end-to-end trainable network, SegFlow, forsimultaneously predicting pixel-wise object segmentation and optical flow invideos. The proposed SegFlow has two branches where useful information ofobject segmentation and optical flow is propagated bidirectionally in a unifiedframework. The segmentation branch is based on a fully convolutional network,which has been proved effective in image segmentation task, and the opticalflow branch takes advantage of the FlowNet model. The unified framework istrained iteratively offline to learn a generic notion, and fine-tuned onlinefor specific objects. Extensive experiments on both the video objectsegmentation and optical flow datasets demonstrate that introducing opticalflow improves the performance of segmentation and vice versa, against thestate-of-the-art algorithms.
translated by 谷歌翻译
学习预测未来的视频帧是一项具有挑战性的任务。最近的自然场景方法通过推断外观流和使用流导向翘曲来直接预测像素。当运动估计精确时,这样的模型表现优异,但是在许多真实场景​​中运动可能是模糊的或错误的。当场景运动暴露场景的新区域时,基于运动的预测会导致较差的结果。然而,学习直接预测新像素也需要大量的训练。在这项工作中,我们提出了用于视频预测的信息感知空间 - 时间上下文编码器,称为流接地视频预测(FGVP),其中运动传播和新像素生成首先被解开,然后根据计算的流不确定性图进行融合。对于基于运动的预测显示低信度的区域,我们的模型使用条件上下文编码器来产生幻觉内容。我们在标准CalTech Pedestriandataset上测试我们的方法,以及更大的运动和阻塞的更具挑战性的KITTI Flow数据集。与先前的作品相比,我们的方法产生了清晰和自然的预测,在两个数据集上实现了最先进的性能。
translated by 谷歌翻译
通过深度卷积网络观察未标记的视频,学习估计单帧中的3D几何和来自连续帧的光流,最近取得了重大进展。当前最先进的(SOTA)方法独立地处理任务。当前的depthestimation流水线的一个重要假设是场景不包含移动物体,其可以由光流完成。在本文中,我们建议整体解决这两个问题,即共同理解每像素三维几何和运动。这也消除了静态场景假设的需要,并在学习过程中强制实现几何一致性,从而显着改善了两者的结果。任务。我们称我们的方法为“EveryPixel Counts ++”或“EPC ++”。具体来说,在训练期间,给定视频中的两个连续帧,我们采用三个并行网络分别预测两个帧(FlowNet)之间的摄像机运动(MotionNet),密集深度图(DepthNet)和每像素光学流。在KITTI 2012和KITTI 2015数据集上进行了全面的实验。对深度估计,光流估计,测距,运动目标分割和场景流估计这五项任务的表现表明,超越其他SOTA方法,证明了我们提出的每个模块的有效性。方法。
translated by 谷歌翻译
对复杂的城市街景进行视觉理解是广泛应用的有利因素。物体检测从大规模数据集中获益匪浅,特别是在深度学习的背景下。然而,对于语义城市场景理解,当前的数据集没有充分捕捉到真实世界城市场景的复杂性。为了解决这个问题,我们引入了Cityscapes,一个基准套件和大规模的基准测试,用于训练和测试像素级和实例级语义标签的方法。城市景观由来自50个不同城市的街道中记录的大量不同的立体视频序列组成。其中5000张图像具有高质量的像素级注释; 20000个额外的图像具有粗略的注释,以启用利用大量弱标记数据的方法。至关重要的是,我们的努力超过了以前在数据集大小,注释丰富度,场景可变性和复杂性方面的尝试。 Ouraccompanying实证研究提供了对数据集特征的深入分析,以及基于我们的基准测试的几种最新方法的性能评估。
translated by 谷歌翻译
预测未来事件是实现智能行为的重要前提。视频预测已经被研究作为实现这一目标的代理任务。最近的工作已经表明,为了预测未来帧的语义分割,在语义层面的预测比预测RGB帧然后分割这些帧更有效。在本文中,我们考虑未来实例分割的更具挑战性的问题,其另外分割出个体对象。为了处理每个图像的不同数量的输出标签,我们在Mask R-CNN实例分割模型的固定大小的卷积特征的空间中开发预测模型。我们将Mask R-CNN的“检测头”应用于预测的特征,以产生未来帧的实例分割。实验表明,这种方法在基于光流和重用实例分割架构的强基线上有显着改善。
translated by 谷歌翻译
对于3D几何体的视觉感知的无监督学习对于自主系统是非常有利的。最近关于无监督学习的研究在几何感知方面取得了令人瞩目的进展;然而,它们在黑暗和嘈杂的环境中表现不佳的动态物体和场景。相比之下,强健的监督学习算法需要大的标量数据集。本文介绍了SIGNet,这是一种新颖的框架,无需几何信息标签即可提供强大的几何感知。具体而言,SIGNet集成了语义信息,以便在低照度和嘈杂环境中为动态对象进行无监督的强大几何预测。 SIGNet被证明可以将几何感知的艺术监督学习状态提高30%(深度预测的平方相对误差)。特别是,SIGNet在深度预测中将动态对象类性能提高了39%,在流量预测方面提高了29%。
translated by 谷歌翻译
这项工作解决了雾下语义场景理解的问题。虽然在语义场景理解方面取得了显着进步,但它主要集中在晴朗的天气场景上。将语义分割方法扩展到诸如雾的恶劣天气条件对于超应用是至关重要的。在本文中,我们提出了一种名为Curriculum ModelAdaptation(CMAda)的新方法,该方法使用标记的合成雾数据和未标记的真实雾数据逐步适应从轻合成雾到密集真实雾的语义分割模型。该方法基于以下事实:在中度不利条件(轻雾)中的语义分割的结果可以被引导以在高度不利条件(浓雾)中解决相同的问题。 CMAda可扩展到其他不利条件,并为合成数据和未标记的真实数据学习提供了新的范例。此外,我们还提出了另外三个主要的独立贡献:1)使用语义输入将合成雾添加到真实,晴朗的场景的新方法; 2)新的雾密度估算器; 3)一种新颖的雾化致密化方法,在不使用深度的情况下,在真实的雾场景中使雾化; 4)FoggyZurich数据集包括3808个真实有雾图像,在浓雾下有40个图像的像素级语义注释。我们的实验表明:1)我们的雾模拟和雾密度估计在语义模糊场景理解(SFSU)方面优于其最先进的部件; 2)CMAda显着提高了SFS最先进模型的性能,同时受益于我们的合成和真实雾化数据。数据集和代码可在项目网站上获得。
translated by 谷歌翻译