In a traditional convolutional layer, the learned filters stay fixed after training. In contrast, we introduce a new framework, the Dynamic Filter Network, where filters are generated dynamically conditioned on an input. We show that this architecture is a powerful one, with increased flexibility thanks to its adaptive nature, yet without an excessive increase in the number of model parameters. A wide variety of filtering operations can be learned this way, including local spatial transformations, but also others like selective (de)blurring or adaptive feature extraction. Moreover, multiple such layers can be combined, e.g. in a recurrent architecture. We demonstrate the effectiveness of the dynamic filter network on the tasks of video and stereo prediction, and reach state-of-the-art performance on the moving MNIST dataset with a much smaller model. By visualizing the learned filters, we illustrate that the network has picked up flow information by only looking at unlabelled training data. This suggests that the network can be used to pretrain networks for various supervised tasks in an unsupervised way, like optical flow and depth estimation. * X. Jia and B. De Brabandere contributed equally to this work and listed in alphabetical order.
translated by 谷歌翻译
Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, which explicitly allows the spatial manipulation of data within the network. This differentiable module can be inserted into existing convolutional architectures, giving neural networks the ability to actively spatially transform feature maps, conditional on the feature map itself, without any extra training supervision or modification to the optimisation process. We show that the use of spatial transformers results in models which learn invariance to translation, scale, rotation and more generic warping, resulting in state-of-the-art performance on several benchmarks, and for a number of classes of transformations.
translated by 谷歌翻译
A core challenge for an agent learning to interact with the world is to predict how its actions affect objects in its environment. Many existing methods for learning the dynamics of physical interactions require labeled object information. However, to scale real-world interaction learning to a variety of scenes and objects, acquiring labeled data becomes increasingly impractical. To learn about physical object motion without labels, we develop an action-conditioned video prediction model that explicitly models pixel motion, by predicting a distribution over pixel motion from previous frames. Because our model explicitly predicts motion, it is partially invariant to object appearance, enabling it to generalize to previously unseen objects. To explore video prediction for real-world interactive agents, we also introduce a dataset of 59,000 robot interactions involving pushing motions, including a test set with novel objects. In this dataset, accurate prediction of videos conditioned on the robot's future actions amounts to learning a "visual imagination" of different futures based on different courses of action. Our experiments show that our proposed method produces more accurate video predictions both quantitatively and qualitatively, when compared to prior methods.
translated by 谷歌翻译
近期对抗性生成建模的突破导致了能够生产高质量的视频样本的模型,即使在真实世界视频的大型和复杂的数据集上也是如此。在这项工作中,我们专注于视频预测的任务,其中给出了从视频中提取的一系列帧,目标是生成合理的未来序列。我们首先通过对鉴别器分解进行系统的实证研究并提出产生更快的收敛性和更高性能的系统来提高本领域的最新技术。然后,我们分析发电机中的复发单元,并提出了一种新的复发单元,其根据预测的运动样本来改变其过去的隐藏状态,并改进它以处理DIS闭塞,场景变化和其他复杂行为。我们表明,这种经常性单位始终如一地优于以前的设计。我们的最终模型导致最先进的性能中的飞跃,从大型动力学-600数据集中获得25.7的测试集Frechet视频距离为25.7,下降到69.2。
translated by 谷歌翻译
This paper tackles the problem of motion deblurring of dynamic scenes. Although end-to-end fully convolutional designs have recently advanced the state-of-the-art in nonuniform motion deblurring, their performance-complexity trade-off is still sub-optimal. Existing approaches achieve a large receptive field by increasing the number of generic convolution layers and kernel-size, but this comes at the expense of of the increase in model size and inference speed. In this work, we propose an efficient pixel adaptive and feature attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We also propose an effective content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighboring pixel information. We use a patch-hierarchical attentive architecture composed of the above module that implicitly discovers the spatial variations in the blur present in the input image and in turn, performs local and global modulation of intermediate features. Extensive qualitative and quantitative comparisons with prior art on deblurring benchmarks demonstrate that our design offers significant improvements over the state-of-the-art in accuracy as well as speed.
translated by 谷歌翻译
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译
作为许多自主驾驶和机器人活动的基本组成部分,如自我运动估计,障碍避免和场景理解,单眼深度估计(MDE)引起了计算机视觉和机器人社区的极大关注。在过去的几十年中,已经开发了大量方法。然而,据我们所知,对MDE没有全面调查。本文旨在通过审查1970年至2021年之间发布的197个相关条款来弥补这一差距。特别是,我们为涵盖各种方法的MDE提供了全面的调查,介绍了流行的绩效评估指标并汇总公开的数据集。我们还总结了一些代表方法的可用开源实现,并比较了他们的表演。此外,我们在一些重要的机器人任务中审查了MDE的应用。最后,我们通过展示一些有希望的未来研究方向来结束本文。预计本调查有助于读者浏览该研究领域。
translated by 谷歌翻译
Motion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can take advantage of the abundant information that exists across neighboring frames. As a result the best performing methods rely on the alignment of nearby frames. However, aligning images is a computationally expensive and fragile procedure, and methods that aggregate information must therefore be able to identify which regions have been accurately aligned and which have not, a task that requires high level scene understanding. In this work, we introduce a deep learning solution to video deblurring, where a CNN is trained end-toend to learn how to accumulate information across frames. To train this network, we collected a dataset of real videos recorded with a high frame rate camera, which we use to generate synthetic motion blur for supervision. We show that the features learned from this dataset extend to deblurring motion blur that arises due to camera shake in a wide range of videos, and compare the quality of results to a number of other baselines 1 .
translated by 谷歌翻译
人类自然有效地在复杂的场景中找到突出区域。通过这种观察的动机,引入了计算机视觉中的注意力机制,目的是模仿人类视觉系统的这一方面。这种注意机制可以基于输入图像的特征被视为动态权重调整过程。注意机制在许多视觉任务中取得了巨大的成功,包括图像分类,对象检测,语义分割,视频理解,图像生成,3D视觉,多模态任务和自我监督的学习。在本调查中,我们对计算机愿景中的各种关注机制进行了全面的审查,并根据渠道注意,空间关注,暂时关注和分支注意力进行分类。相关的存储库https://github.com/menghaoguo/awesome-vision-tions致力于收集相关的工作。我们还建议了未来的注意机制研究方向。
translated by 谷歌翻译
This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available 5 .
translated by 谷歌翻译
The goal of precipitation nowcasting is to predict the future rainfall intensity in a local region over a relatively short period of time. Very few previous studies have examined this crucial and challenging weather forecasting problem from the machine learning perspective. In this paper, we formulate precipitation nowcasting as a spatiotemporal sequence forecasting problem in which both the input and the prediction target are spatiotemporal sequences. By extending the fully connected LSTM (FC-LSTM) to have convolutional structures in both the input-to-state and state-to-state transitions, we propose the convolutional LSTM (ConvLSTM) and use it to build an end-to-end trainable model for the precipitation nowcasting problem. Experiments show that our ConvLSTM network captures spatiotemporal correlations better and consistently outperforms FC-LSTM and the state-of-theart operational ROVER algorithm for precipitation nowcasting.
translated by 谷歌翻译
We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or multiple decoder LSTMs to perform different tasks, such as reconstructing the input sequence, or predicting the future sequence. We experiment with two kinds of input sequences -patches of image pixels and high-level representations ("percepts") of video frames extracted using a pretrained convolutional net. We explore different design choices such as whether the decoder LSTMs should condition on the generated output. We analyze the outputs of the model qualitatively to see how well the model can extrapolate the learned video representation into the future and into the past. We try to visualize and interpret the learned features. We stress test the model by running it on longer time scales and on out-of-domain data. We further evaluate the representations by finetuning them for a supervised learning problemhuman action recognition on the UCF-101 and HMDB-51 datasets. We show that the representations help improve classification accuracy, especially when there are only a few training examples. Even models pretrained on unrelated datasets (300 hours of YouTube videos) can help action recognition performance.
translated by 谷歌翻译
Standard video frame interpolation methods first estimate optical flow between input frames and then synthesize an intermediate frame guided by motion. Recent ap-proaches merge these two steps into a single convolution process by convolving input frames with spatially adaptive kernels that account for motion and re-sampling simultaneously. These methods require large kernels to handle large motion, which limits the number of pixels whose kernels can be estimated at once due to the large memory demand. To address this problem, this paper formulates frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. Since our method is able to estimate kernels and synthesizes the whole video frame at once, it allows for the incorporation of perceptual loss to train the neural network to produce visually pleasing frames. This deep neural network is trained end-to-end using widely available video data without any human annotation. Both qualitative and quantitative experiments show that our method provides a practical solution to high-quality video frame interpolation.
translated by 谷歌翻译
尽管运动补偿大大提高了视频质量,但单独执行运动补偿和视频脱张需要大量的计算开销。本文提出了一个实时视频Deblurring框架,该框架由轻巧的多任务单元组成,该单元以有效的方式支持视频脱张和运动补偿。多任务单元是专门设计的,用于使用单个共享网络处理两个任务的大部分,并由多任务详细网络和简单的网络组成,用于消除和运动补偿。多任务单元最大程度地减少了将运动补偿纳入视频Deblurring的成本,并实现了实时脱毛。此外,通过堆叠多个多任务单元,我们的框架在成本和过度质量之间提供了灵活的控制。我们通过实验性地验证了方法的最先进的质量,与以前的方法相比,该方法的运行速度要快得多,并显示了实时的实时性能(在DVD数据集中测量了30.99db@30fps)。
translated by 谷歌翻译
A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation. 1 1 Find the model and other imformation on the project github page: https://github. com/Ravi-Garg/Unsupervised_Depth_Estimation
translated by 谷歌翻译
时间一致的深度估计对于诸如增强现实之类的实时应用至关重要。虽然立体声深度估计已经接受了显着的注意,导致逐帧的改进,虽然相对较少的工作集中在跨越帧的时间一致性。实际上,基于我们的分析,当前立体声深度估计技术仍然遭受不良时间一致性。由于并发对象和摄像机运动,在动态场景中稳定深度是挑战。在在线设置中,此过程进一步加剧,因为只有过去的帧可用。在本文中,我们介绍了一种技术,在线设置中的动态场景中产生时间一致的深度估计。我们的网络增强了具有新颖运动和融合网络的当前每帧立体声网络。通过预测每个像素SE3变换,运动网络占对象和相机运动。融合网络通过用回归权重聚合当前和先前预测来提高预测的一致性。我们在各种数据集中进行广泛的实验(合成,户外,室内和医疗)。在零射泛化和域微调中,我们证明我们所提出的方法在数量和定性的时间稳定和每个帧精度方面优于竞争方法。我们的代码将在线提供。
translated by 谷歌翻译
可以通过定期预测未来的框架以增强虚拟现实应用程序中的用户体验,从而解决了低计算设备上图形渲染高帧速率视频的挑战。这是通过时间视图合成(TVS)的问题来研究的,该问题的目标是预测给定上一个帧的视频的下一个帧以及上一个和下一个帧的头部姿势。在这项工作中,我们考虑了用户和对象正在移动的动态场景的电视。我们设计了一个将运动解散到用户和对象运动中的框架,以在预测下一帧的同时有效地使用可用的用户运动。我们通过隔离和估计过去框架的3D对象运动,然后推断它来预测对象的运动。我们使用多平面图像(MPI)作为场景的3D表示,并将对象运动作为MPI表示中相应点之间的3D位移建模。为了在估计运动时处理MPI中的稀疏性,我们将部分卷积和掩盖的相关层纳入了相应的点。然后将预测的对象运动与给定的用户或相机运动集成在一起,以生成下一帧。使用不合格的填充模块,我们合成由于相机和对象运动而发现的区域。我们为动态场景的电视开发了一个新的合成数据集,该数据集由800个以全高清分辨率组成的视频组成。我们通过数据集和MPI Sintel数据集上的实验表明我们的模型优于文献中的所有竞争方法。
translated by 谷歌翻译
We propose a novel deep learning architecture for regressing disparity from a rectified pair of stereo images. We leverage knowledge of the problem's geometry to form a cost volume using deep feature representations. We learn to incorporate contextual information using 3-D convolutions over this volume. Disparity values are regressed from the cost volume using a proposed differentiable soft argmin operation, which allows us to train our method end-to-end to sub-pixel accuracy without any additional post-processing or regularization. We evaluate our method on the Scene Flow and KITTI datasets and on KITTI we set a new stateof-the-art benchmark, while being significantly faster than competing approaches.
translated by 谷歌翻译
不确定性在未来预测中起关键作用。未来是不确定的。这意味着可能有很多可能的未来。未来的预测方法应涵盖坚固的全部可能性。在自动驾驶中,涵盖预测部分中的多种模式对于做出安全至关重要的决策至关重要。尽管近年来计算机视觉系统已大大提高,但如今的未来预测仍然很困难。几个示例是未来的不确定性,全面理解的要求以及嘈杂的输出空间。在本论文中,我们通过以随机方式明确地对运动进行建模并学习潜在空间中的时间动态,从而提出了解决这些挑战的解决方案。
translated by 谷歌翻译