The problem of determining whether an object is in motion, irrespective ofcamera motion, is far from being solved. We address this challenging task bylearning motion patterns in videos. The core of our approach is a fullyconvolutional network, which is learned entirely from synthetic videosequences, and their ground-truth optical flow and motion segmentation. Thisencoder-decoder style architecture first learns a coarse representation of theoptical flow field features, and then refines it iteratively to produce motionlabels at the original high-resolution. We further improve this labeling withan objectness map and a conditional random field, to account for errors inoptical flow, and also to focus on moving "things" rather than "stuff". Theoutput label of each pixel denotes whether it has undergone independent motion,i.e., irrespective of camera motion. We demonstrate the benefits of thislearning framework on the moving object segmentation task, where the goal is tosegment all objects in motion. Our approach outperforms the top method on therecently released DAVIS benchmark dataset, comprising real-world sequences, by5.6%. We also evaluate on the Berkeley motion segmentation database, achievingstate-of-the-art results.
translated by 谷歌翻译
预测未来是机器人或自动驾驶系统决策的一个重要方面,它主要依赖于视觉场景理解。虽然先前的工作试图预测未来的视频像素,预测活动或预测未来的场景语义片段来自前一帧的分段,但是不存在仅在单个端到端可移动模型中预测来自先前帧RGB数据的未来语义分割的方法。在本文中,我们提出了一种temporalencoder-decoder网络架构,它对过去的RGB帧进行编码,并对未来的语义分割进行编码。该网络与专门用于预测任务的新知识蒸馏培训框架相结合。我们的方法,只看到前面的视频帧,隐含地模拟景点片段,同时考​​虑对象动态以包含未来的场景语义片段。我们在Cityscapes上的成绩优于基线和当前最先进的方法。代码可以通过以下网址获得://github.com/eddyhkchiu/segmenting_the_future/。
translated by 谷歌翻译
注释训练数据的难度是将CNN用于视频中的低级任务的主要障碍。合成数据通常不会推广到实际视频,而无监督方法需要启发式损失。代理任务可以克服这些问题,并且首先通过培训网络来完成任务,以便更容易或者可以在无人监督的情况下进行培训。然后使用少量的地面真实数据对经过训练的网络进行微调以完成原始任务。在这里,我们研究帧插值作为光流的代理任务。使用真实电影,我们训练CNN无人监督进行时间插值。这样的网络隐含地估计运动,但不能处理特定区域。通过对少量地面实况流进行微调,网络可以学习填充均匀区域并计算全光学流场。使用这种无监督的预训练,我们的网络优于使用合成光流监督训练的模拟体系结构。
translated by 谷歌翻译
The ability of predicting the future is important for intelligent systems, e.g. autonomous vehicles and robots to plan early and make decisions accordingly. Future scene parsing and optical flow estimation are two key tasks that help agents better understand their environments as the former provides dense semantic information, i.e. what objects will be present and where they will appear, while the latter provides dense motion information, i.e. how the objects will move. In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. In particular, scene parsing enables structured motion prediction by decomposing optical flow into different groups while optical flow estimation brings reliable pixel-wise correspondence to scene parsing. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics.
translated by 谷歌翻译
视频对象分割中的一个主要技术是标记用于训练实例的对象掩模。因此,我们建议用视频对象分割训练的运动提示来准备廉价但高质量的伪地面实况。我们的方法使用实例分割网络进行语义分割,然后基于运动信息选择感兴趣的分割对象作为伪地面实况。然后,利用伪周围真值来微调预训练的对象网络,以便在视频的剩余帧中促进对象分割。我们证明伪基础事实可以有效地改善分割性能。这种直观的无监督视频对象分割方法比现有方法更有效。在DAVIS和FBMS上的实验结果表明,所提出的方法在各种基准数据集上优于最先进的无监督分割方法。类别不可知的基础事实具有扩展到多个任意对象跟踪的巨大潜力。
translated by 谷歌翻译
我们解决了在现有视频中合成新视频帧的问题,无论是在现有帧之间(插值)还是在它们之后(外推)。这个问题具有挑战性,因为视频外观和运动可能非常复杂。传统的基于光流的解决方案通常在流量估计具有挑战性的情况下失败,而基于神经网络的新方法直接产生像素值的幻觉通常会产生模糊的结果。我们通过训练深度网络来结合这两种方法的优点,该网络通过流动现有像素值来合成视频帧,我们将其称为深体素流。我们的方法不需要人工监督,任何视频都可以作为训练数据,通过丢弃,然后学习topredict,现有的帧。该技术是有效的,并且可以在任何视频分辨率下应用。我们证明了我们的方法产生的结果既可以在现有技术上进行定量和定性的改进。
translated by 谷歌翻译
In this work, we address the challenging video scene parsing problem by developing effective representation learning methods given limited parsing annotations. In particular , we contribute two novel methods that constitute a unified parsing framework. (1) Predictive feature learning from nearly unlimited unlabeled video data. Different from existing methods learning features from single frame parsing , we learn spatiotemporal discriminative features by enforcing a parsing network to predict future frames and their parsing maps (if available) given only historical frames. In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations. (2) Prediction steering parsing architecture that effectively adapts the learned spatiotemporal features to scene parsing tasks and provides strong guidance for any off-the-shelf parsing model to achieve better video scene parsing performance. Extensive experiments over two challenging datasets, Cityscapes and Camvid, have demonstrated the effectiveness of our methods by showing significant improvement over well-established baselines.
translated by 谷歌翻译
As an agent moves through the world, the apparent motion of scene elements is (usually) inversely proportional to their depth. 4 It is natural for a learning agent to associate image patterns with the magnitude of their displacement over time: as the agent moves, faraway mountains don't move much; nearby trees move a lot. This natural relationship between the appearance of objects and their motion is a rich source of information about the world. In this work, we start by training a deep network, using fully automatic supervision, to predict relative scene depth from single images. The relative depth training images are automatically derived from simple videos of cars moving through a scene, using recent motion segmentation techniques, and no human-provided labels. The proxy task of predicting relative depth from a single image induces features in the network that result in large improvements in a set of downstream tasks including semantic segmentation, joint road segmentation and car detection, and monocular (abso-lute) depth estimation, over a network trained from scratch. The improvement on the semantic segmentation task is greater than that produced by any other automatically supervised methods. Moreover, for monocular depth estimation, our unsupervised pre-training method even outperforms supervised pre-training with ImageNet. In addition, we demonstrate benefits from learning to predict (again, completely unsupervised) relative depth in the specific videos associated with various downstream tasks (e.g., KITTI). We adapt to the specific scenes in those tasks in an unsupervised manner to improve performance. In summary, for semantic segmentation, we present state-of-the-art results among methods that do not use supervised pre-training, and we even exceed the performance of supervised ImageNet pre-trained models for monocular depth estimation, achieving results that are comparable with state-of-the-art methods.
translated by 谷歌翻译
我们提出了一种端到端的学习框架,用于在图像和视频中分割通用对象。给定一个新颖的图像或视频,我们的方法为所有“类对象”区域生成像素级掩码 - 即使是在训练期间看到的对象类别。我们将任务表示为向每个像素分配对象/背景标签的结构化预测问题,使用深度完全卷积网络实现。当应用于视频时,我们的模式还包含一个运动流,网络学会结合运动和运动,并试图提取所有突出的物体,无论它们是否移动。除了核心模型之外,另外的第二个贡献是它如何利用不同的训练注释强度。像素级注释很难获得,但对于训练深度网络分割方法至关重要。因此,我们提出了利用弱标记数据来学习密集前景分割的方法。对于图像,weshow混合对象类别示例中的值与图像级标签一起与相对较少的具有边界级注释的图像。对于视频,我们将展示如何与网络训练一起引导弱带注释的视频以进行图像分割。通过对多个具有挑战性的图像和视频分割基准的实验,我们的方法提供了始终如一的强大结果,并改进了对通用(看不见)对象的全自动分割的最新技术。此外,我们还展示了我们的方法如何有利于图像检索和图像重定向,当我们获得高质量的前景图时,这两种方法都会蓬勃发展。代码,模型和视频都是:http://vision.cs.utexas.edu/projects/pixelobjectness/
translated by 谷歌翻译
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very little extra computational cost. This module is called Net-Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network representations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to-end training. Experiments validate that the proposed approach incurs only little extra computational cost, while improving performance, when video streams are available. We achieve new state-of-the-art results on the CamVid and Cityscapes benchmark datasets and show consistent improvements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de
translated by 谷歌翻译
我们提出“隐藏式”通用数据增强技术,它是对现有数据增强技术的补充,有利于各种视觉识别任务。关键思想是随机隐藏训练图像中的补丁,以便在隐藏最具辨别力的内容时强制网络寻找其他相关内容。 Ourapproach只需要修改输入图像,并可以与任何网络一起工作以提高其性能。在测试期间,它不需要隐藏任何补丁。“捉迷藏”优于现有数据增强技术的主要优势在于它能够在弱监督设置中提高对象本地化的准确性,因此我们使用此任务来激励该方法。然而,Hide-and-Seek并不仅仅与图像定位任务相关联,并且可以扩展到视频等其他形式的视觉输入,以及其他识别任务,如图像分类,时间动作定位,语义分割,情感识别,年龄/性别估计和人格鉴定。我们进行了大量实验,以展示“捉迷藏”对这些各种视觉识别问题的优势。
translated by 谷歌翻译
预测未来事件是实现智能行为的重要前提。视频预测已经被研究作为实现这一目标的代理任务。最近的工作已经表明,为了预测未来帧的语义分割,在语义层面的预测比预测RGB帧然后分割这些帧更有效。在本文中,我们考虑未来实例分割的更具挑战性的问题,其另外分割出个体对象。为了处理每个图像的不同数量的输出标签,我们在Mask R-CNN实例分割模型的固定大小的卷积特征的空间中开发预测模型。我们将Mask R-CNN的“检测头”应用于预测的特征,以产生未来帧的实例分割。实验表明,这种方法在基于光流和重用实例分割架构的强基线上有显着改善。
translated by 谷歌翻译
It has been recently shown that a convolutional neural network can learn optical flow estimation with unsuper-vised learning. However, the performance of the unsuper-vised methods still has a relatively large gap compared to its supervised counterpart. Occlusion and large motion are some of the major factors that limit the current unsuper-vised learning of optical flow methods. In this work we introduce a new method which models occlusion explicitly and a new warping way that facilitates the learning of large motion. Our method shows promising results on Flying Chairs, MPI-Sintel and KITTI benchmark datasets. Especially on KITTI dataset where abundant unlabeled samples exist, our unsupervised method outperforms its counterpart trained with supervised learning.
translated by 谷歌翻译
这项工作解决了雾下语义场景理解的问题。虽然在语义场景理解方面取得了显着进步,但它主要集中在晴朗的天气场景上。将语义分割方法扩展到诸如雾的恶劣天气条件对于超应用是至关重要的。在本文中,我们提出了一种名为Curriculum ModelAdaptation(CMAda)的新方法,该方法使用标记的合成雾数据和未标记的真实雾数据逐步适应从轻合成雾到密集真实雾的语义分割模型。该方法基于以下事实:在中度不利条件(轻雾)中的语义分割的结果可以被引导以在高度不利条件(浓雾)中解决相同的问题。 CMAda可扩展到其他不利条件,并为合成数据和未标记的真实数据学习提供了新的范例。此外,我们还提出了另外三个主要的独立贡献:1)使用语义输入将合成雾添加到真实,晴朗的场景的新方法; 2)新的雾密度估算器; 3)一种新颖的雾化致密化方法,在不使用深度的情况下,在真实的雾场景中使雾化; 4)FoggyZurich数据集包括3808个真实有雾图像,在浓雾下有40个图像的像素级语义注释。我们的实验表明:1)我们的雾模拟和雾密度估计在语义模糊场景理解(SFSU)方面优于其最先进的部件; 2)CMAda显着提高了SFS最先进模型的性能,同时受益于我们的合成和真实雾化数据。数据集和代码可在项目网站上获得。
translated by 谷歌翻译
This paper proposes an end-to-end trainable network, SegFlow, forsimultaneously predicting pixel-wise object segmentation and optical flow invideos. The proposed SegFlow has two branches where useful information ofobject segmentation and optical flow is propagated bidirectionally in a unifiedframework. The segmentation branch is based on a fully convolutional network,which has been proved effective in image segmentation task, and the opticalflow branch takes advantage of the FlowNet model. The unified framework istrained iteratively offline to learn a generic notion, and fine-tuned onlinefor specific objects. Extensive experiments on both the video objectsegmentation and optical flow datasets demonstrate that introducing opticalflow improves the performance of segmentation and vice versa, against thestate-of-the-art algorithms.
translated by 谷歌翻译
我们提出了一种无监督学习框架,用于使用未标记的视频序列同时训练单视深度预测和光流估计模型。现有的无监督方法通常利用亮度稳定性和空间平滑度先验来训练深度或流动模型。在本文中,我们建议将几何一致性作为额外的supervisorysignals。我们的核心思想是,对于刚性区域,我们可以使用预测的场景深度和相机运动来通过反投影诱导的3D场景流来合成2D光流。刚性流(来自深度预测和相机运动)与估计流量(来自光流模型)之间的差异使我们能够施加跨任务一致性损失。虽然所有网络在训练期间都是共同优化的,但它们可以在测试时独立应用。大量实验表明,我们的深度和流量模型与最先进的无监督方法相比较。
translated by 谷歌翻译
我们通过对过去的帧和过去的光流进行条件化来提出一种用于高分辨率视频帧预测的方法。以前接近重新采样过去的帧,由学习的未来光流或像素的直接生成引导。基于流量的重新采样是不够的,因为它不能处理去除错误。生成模型目前导致模糊结果。最近的方法通过将输入补丁与预测的内核进行卷积来合成像素。然而,它们的内存需求随着内核大小的增加而增加。在这里,我们使用空间位移卷积(SDC)模块进行视频帧预测。我们为每个像素学习运动矢量和内核,并通过在源图像中由预测运动矢量定义的位移位置处应用内核来合成像素。我们的方法继承了基于矢量和基于内核的方法的优点,同时改善了它们各自的缺点。我们在428K未标记的1080p视频游戏帧上训练我们的模型。我们的方法产生了最先进的结果,在高清YouTube-8M视频上获得0.904的SSIM评分,在CaltechPedestrian视频上获得0.918。我们的模型有效地处理大运动并合成具有一致运动的重帧。
translated by 谷歌翻译
Per-pixel ground-truth depth data is challenging to acquire at scale. To overcome this limitation, self-supervised learning has emerged as a promising alternative for training models to perform monocular depth estimation. In this paper, we propose a set of improvements, which together result in both quantitatively and qualitatively improved depth maps compared to competing self-supervised methods. Research on self-supervised monocular training usually explores increasingly complex architectures, loss functions, and image formation models, all of which have recently helped to close the gap with fully-supervised methods. We show that a surprisingly simple model, and associated design choices, lead to superior predictions. In particular, we propose (i) a minimum reprojection loss, designed to robustly handle occlusions, (ii) a full-resolution multi-scale sampling method that reduces visual artifacts, and (iii) an auto-masking loss to ignore training pixels that violate camera motion assumptions. We demonstrate the effectiveness of each component in isolation, and show high quality, state-of-the-art results on the KITTI benchmark.
translated by 谷歌翻译
In the era of end-to-end deep learning, many advances in computer vision are driven by large amounts of labeled data. In the optical flow setting, however, obtaining dense per-pixel ground truth for real scenes is difficult and thus such data is rare. Therefore, recent end-to-end convolutional networks for optical flow rely on synthetic datasets for supervision , but the domain mismatch between training and test scenarios continues to be a challenge. Inspired by classical energy-based optical flow methods, we design an unsuper-vised loss based on occlusion-aware bidirectional flow estimation and the robust census transform to circumvent the need for ground truth flow. On the KITTI benchmarks, our unsupervised approach outperforms previous unsupervised deep networks by a large margin, and is even more accurate than similar supervised methods trained on synthetic datasets alone. By optionally fine-tuning on the KITTI training data, our method achieves competitive optical flow accuracy on the KITTI 2012 and 2015 benchmarks, thus in addition enabling generic pre-training of supervised networks for datasets with limited amounts of ground truth.
translated by 谷歌翻译
遮挡在视差和光流估计中起重要作用,因为遮挡区域中的匹配成本不可用,遮挡表示深度或运动边界。此外,遮挡是相关的形式分割和场景流估计。在本文中,我们提出了一种有效的基于学习的方法来估计与半球或光流相关的遮挡区域。估计的遮挡和运动边界明显改善了现有技术水平。此外,我们在流行的KITTI基准测试和良好的通用性能方面呈现具有最新性能的网络。利用估计的遮挡,我们还显示了运动分割和场景流估计的改进结果。
translated by 谷歌翻译