Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art.
translated by 谷歌翻译
光流CNNS的训练管道由合成数据集的预处理阶段组成,然后在目标数据集上进行微调阶段。但是,从目标视频中获得地面真理需要巨大的努力。本文提出了一种实用的微调方法,将预处理的模型调整到没有地面真相流的目标数据集中,但尚未进行广泛探讨。具体而言,我们为自我划分的流程主管提出了一个流程主管,其中包括参数分离和学生量连接。该设计的目的是稳定的收敛性和更好的准确性,而在微调任务上是不稳定的传统自我实施方法。实验结果表明,与半监督学习的不同自学方法相比,我们方法的有效性。此外,我们通过利用其他未标记的数据集来实现对Sintel和Kitti基准测试的最先进的光流模型的有意义的改进。代码可在https://github.com/iwbn/flow-supervisor上找到。
translated by 谷歌翻译
从视频中获得地面真相标签很具有挑战性,因为在像素流标签的手动注释非常昂贵且费力。此外,现有的方法试图将合成数据集的训练模型调整到真实的视频中,该视频不可避免地遭受了域差异并阻碍了现实世界应用程序的性能。为了解决这些问题,我们提出了RealFlow,这是一个基于期望最大化的框架,可以直接从任何未标记的现实视频中创建大规模的光流数据集。具体而言,我们首先估计一对视频帧之间的光流,然后根据预测流从该对中合成新图像。因此,新图像对及其相应的流可以被视为新的训练集。此外,我们设计了一种逼真的图像对渲染(RIPR)模块,该模块采用软磁性裂口和双向孔填充技术来减轻图像合成的伪像。在E-Step中,RIPR呈现新图像以创建大量培训数据。在M-Step中,我们利用生成的训练数据来训练光流网络,该数据可用于估计下一个E步骤中的光流。在迭代学习步骤中,流网络的能力逐渐提高,流量的准确性以及合成数据集的质量也是如此。实验结果表明,REALFLOW的表现优于先前的数据集生成方法。此外,基于生成的数据集,我们的方法与受监督和无监督的光流方法相比,在两个标准基准测试方面达到了最先进的性能。我们的代码和数据集可从https://github.com/megvii-research/realflow获得
translated by 谷歌翻译
与无监督培训相比,对光流预测因子的监督培训通常会产生更好的准确性。但是,改进的性能通常以较高的注释成本。半监督的培训与注释成本相比,准确性的准确性。我们使用一种简单而有效的半监督训练方法来表明,即使一小部分标签也可以通过无监督的训练来提高流量准确性。此外,我们提出了基于简单启发式方法的主动学习方法,以进一步减少实现相同目标准确性所需的标签数量。我们对合成和真实光流数据集的实验表明,我们的半监督网络通常需要大约50%的标签才能达到接近全标签的精度,而在Sintel上有效学习只有20%左右。我们还分析并展示了有关可能影响主动学习绩效的因素的见解。代码可在https://github.com/duke-vision/optical-flow-active-learning-release上找到。
translated by 谷歌翻译
Recent works have shown that optical flow can be learned by deep networks from unlabelled image pairs based on brightness constancy assumption and smoothness prior. Current approaches additionally impose an augmentation regularization term for continual self-supervision, which has been proved to be effective on difficult matching regions. However, this method also amplify the inevitable mismatch in unsupervised setting, blocking the learning process towards optimal solution. To break the dilemma, we propose a novel mutual distillation framework to transfer reliable knowledge back and forth between the teacher and student networks for alternate improvement. Concretely, taking estimation of off-the-shelf unsupervised approach as pseudo labels, our insight locates at defining a confidence selection mechanism to extract relative good matches, and then add diverse data augmentation for distilling adequate and reliable knowledge from teacher to student. Thanks to the decouple nature of our method, we can choose a stronger student architecture for sufficient learning. Finally, better student prediction is adopted to transfer knowledge back to the efficient teacher without additional costs in real deployment. Rather than formulating it as a supervised task, we find that introducing an extra unsupervised term for multi-target learning achieves best final results. Extensive experiments show that our approach, termed MDFlow, achieves state-of-the-art real-time accuracy and generalization ability on challenging benchmarks. Code is available at https://github.com/ltkong218/MDFlow.
translated by 谷歌翻译
培训细节和数据集对于筏等最新的光流模型有多重要?它们会概括吗?为了探索这些问题,而不是开发新的模型,我们将重新访问三个突出的模型,即PWC-NET,IRR-PWC和RAFT,并采用一组常见的现代培训技术和数据集,并观察到显着的性能增长,证明了重要性和普遍性这些培训细节。我们新训练的PWC-NET和IRR-PWC模型显示出惊人的改进,与Sintel和Kitti 2015 Benchmarks相比,最高30%的结果与原始发布的结果相比。他们的表现胜过2015年Kitti的最新流程1D,而推断过程中的速度快3倍。我们新训练的筏子在2015年的Kitti上获得了4.31%的成绩,比写作时所有已发表的光流方法更准确。我们的结果表明,分析光流方法的性能提高时,分离模型,训练技术和数据集的贡献的好处。我们的源代码将公开可用。
translated by 谷歌翻译
无监督的对光流计算的深度学习取得了令人鼓舞的结果。大多数现有的基于深网的方法都依赖图像亮度一致性和局部平滑度约束来训练网络。他们的性能在发生重复纹理或遮挡的区域降低。在本文中,我们提出了深层的外两极流,这是一种无监督的光流方法,将全局几何约束结合到网络学习中。特别是,我们研究了多种方式在流量估计中强制执行外两极约束。为了减轻在可能存在多个动作的动态场景中遇到的“鸡肉和蛋”类型的问题,我们提出了一个低级别的约束以及对培训的订婚结合的约束。各种基准测试数据集的实验结果表明,与监督方法相比,我们的方法实现了竞争性能,并且优于最先进的无监督深度学习方法。
translated by 谷歌翻译
自我监督的单眼深度估计使机器人能够从原始视频流中学习3D感知。假设世界主要是静态的,这种可扩展的方法利用了投射的几何形状和自我运动来通过视图综合学习。在自主驾驶和人类机器人相互作用中常见的动态场景违反了这一假设。因此,它们需要明确建模动态对象,例如通过估计像素3D运动,即场景流。但是,同时对深度和场景流的自我监督学习是不适合的,因为有许多无限的组合导致相同的3D点。在本文中,我们提出了一种草稿,这是一种通过将合成数据与几何自学意识相结合的新方法,能够共同学习深度,光流和场景流。在木筏架构的基础上,我们将光流作为中间任务,以通过三角剖分来引导深度和场景流量学习。我们的算法还利用任务之间的时间和几何一致性损失来改善多任务学习。我们的草案在标准Kitti基准的自我监督的单眼环境中,同时在所有三个任务中建立了新的最新技术状态。项目页面:https://sites.google.com/tri.global/draft。
translated by 谷歌翻译
We present a compact but effective CNN model for optical flow, called PWC-Net. PWC-Net has been designed according to simple and well-established principles: pyramidal processing, warping, and the use of a cost volume. Cast in a learnable feature pyramid, PWC-Net uses the current optical flow estimate to warp the CNN features of the second image. It then uses the warped features and features of the first image to construct a cost volume, which is processed by a CNN to estimate the optical flow. PWC-Net is 17 times smaller in size and easier to train than the recent FlowNet2 model. Moreover, it outperforms all published optical flow methods on the MPI Sintel final pass and KITTI 2015 benchmarks, running at about 35 fps on Sintel resolution (1024×436) images. Our models are available on https://github.com/NVlabs/PWC-Net.
translated by 谷歌翻译
Synthetic datasets are often used to pretrain end-to-end optical flow networks, due to the lack of a large amount of labeled, real-scene data. But major drops in accuracy occur when moving from synthetic to real scenes. How do we better transfer the knowledge learned from synthetic to real domains? To this end, we propose CLIP-FLow, a semi-supervised iterative pseudo-labeling framework to transfer the pretraining knowledge to the target real domain. We leverage large-scale, unlabeled real data to facilitate transfer learning with the supervision of iteratively updated pseudo-ground truth labels, bridging the domain gap between the synthetic and the real. In addition, we propose a contrastive flow loss on reference features and the warped features by pseudo ground truth flows, to further boost the accurate matching and dampen the mismatching due to motion, occlusion, or noisy pseudo labels. We adopt RAFT as the backbone and obtain an F1-all error of 4.11%, i.e. a 19% error reduction from RAFT (5.10%) and ranking 2$^{nd}$ place at submission on the KITTI 2015 benchmark. Our framework can also be extended to other models, e.g. CRAFT, reducing the F1-all error from 4.79% to 4.66% on KITTI 2015 benchmark.
translated by 谷歌翻译
在本文中,通过以自我监督的方式将基于几何的方法纳入深度学习架构来实现强大的视觉测量(VO)的基本问题。通常,基于纯几何的算法与特征点提取和匹配中的深度学习不那么稳健,但由于其成熟的几何理论,在自我运动估计中表现良好。在这项工作中,首先提出了一种新颖的光学流量网络(PANET)内置于位置感知机构。然后,提出了一种在没有典型网络的情况下共同估计深度,光学流动和自我运动来学习自我运动的新系统。所提出的系统的关键组件是一种改进的束调节模块,其包含多个采样,初始化的自我运动,动态阻尼因子调整和Jacobi矩阵加权。另外,新颖的相对光度损耗函数先进以提高深度估计精度。该实验表明,所提出的系统在基于基于基于基于基于基于基于基于学习的基于学习的方法之间的深度,流量和VO估计方面不仅优于其他最先进的方法,而且与几何形状相比,也显着提高了鲁棒性 - 基于,基于学习和混合VO系统。进一步的实验表明,我们的模型在挑战室内(TMU-RGBD)和室外(KAIST)场景中实现了出色的泛化能力和性能。
translated by 谷歌翻译
通常将视频中的跟踪像素作为光流估计问题进行研究,其中每个像素都用位移向量描述,该位移向量将其定位在下一帧中。即使可以免费获得更广泛的时间上下文,但要考虑到这一点的事先努力仅在2框方法上产生了少量收益。在本文中,我们重新访问Sand and Teller的“粒子视频”方法,并将像素跟踪作为远程运动估计问题,其中每个像素都用轨迹描述,该轨迹将其定位在以后的多个帧中。我们使用该组件重新构建了这种经典方法,这些组件可以驱动流量和对象跟踪中最新的最新方法,例如密集的成本图,迭代优化和学习的外观更新。我们使用从现有的光流数据中挖掘出的远程Amodal点轨迹来训练我们的模型,并通过多帧的遮挡合成增强,这些轨迹会增强。我们在轨迹估计基准和关键点标签传播任务中测试我们的方法,并与最新的光流和功能跟踪方法进行比较。
translated by 谷歌翻译
光流估计的最新方法取决于深度学习,这需要复杂的顺序训练方案才能在现实世界中达到最佳性能。在这项工作中,我们介绍了组合深网,该网络明确利用了传统方法中使用的亮度恒定(BC)模型。由于卑诗省是在几种情况下违反的一个近似物理模型,因此我们建议训练一个与数据驱动网络相辅相成的物理约束网络。我们在物理先验和数据驱动的补体之间引入了独特而有意义的流动分解,包括对BC模型的不确定性量化。我们得出了一个联合培训计划,用于学习分解的不同组成部分,以确保在受监督的情况下,但在半监督的环境中进行最佳合作。实验表明,组合可以改善对最先进的监督网络的性能,例如木筏在几个基准测试中达到最先进的结果。我们强调组合如何利用BC模型并适应其局限性。最后,我们表明我们的半监督方法可以显着简化训练程序。
translated by 谷歌翻译
Given two consecutive frames, video interpolation aims at generating intermediate frame(s) to form both spatially and temporally coherent video sequences. While most existing methods focus on single-frame interpolation, we propose an end-to-end convolutional neural network for variable-length multi-frame video interpolation, where the motion interpretation and occlusion reasoning are jointly modeled. We start by computing bi-directional optical flow between the input images using a U-Net architecture. These flows are then linearly combined at each time step to approximate the intermediate bi-directional optical flows. These approximate flows, however, only work well in locally smooth regions and produce artifacts around motion boundaries. To address this shortcoming, we employ another U-Net to refine the approximated flow and also predict soft visibility maps. Finally, the two input images are warped and linearly fused to form each intermediate frame. By applying the visibility maps to the warped images before fusion, we exclude the contribution of occluded pixels to the interpolated intermediate frame to avoid artifacts. Since none of our learned network parameters are time-dependent, our approach is able to produce as many intermediate frames as needed. To train our network, we use 1,132 240-fps video clips, containing 300K individual video frames. Experimental results on several datasets, predicting different numbers of interpolated frames, demonstrate that our approach performs consistently better than existing methods.
translated by 谷歌翻译
基于深度学习的当前计算机视觉任务需要大量数据,并具有用于模型培训或测试的注释,尤其是在某些密集的估计任务中,例如光流分段和深度估计。实际上,密集估计任务的手动标记非常困难甚至不可能,并且数据集的场景通常仅限于较小的范围,这极大地限制了社区的发展。为了克服这种缺陷,我们提出了一种合成数据集生成方法,以获取无繁重的手动劳动力的可扩展数据集。通过这种方法,我们构建了一个名为Minenavi的数据集,该数据集包含来自飞机的第一镜头视频视频素材,并与准确的地面真相相匹配,以实现飞机导航应用中的深度估算。我们还提供定量实验,以证明通过Minenavi数据集进行预训练可以提高深度估计模型的性能,并加快模型在真实场景数据上的收敛性。由于合成数据集在深层模型的训练过程中与现实世界数据集具有相似的效果,因此我们还提供了具有单眼深度估计方法的其他实验,以证明各种因素在我们的数据集中的影响,例如照明条件和运动模式。
translated by 谷歌翻译
近年来,深度神经网络表明它们在解决包括场景流预测在内的许多计算机视觉任务方面具有超越能力。但是,大多数进步取决于每个像素地面真相注释的大量致密性,这对于现实生活中的情况很难获得。因此,通常依靠合成数据进行监督,从而导致培训和测试数据之间的表示差距。即使有大量未标记的现实世界数据可用,但对于场景流预测的自我监督方法还是很大的缺乏。因此,我们探讨了基于人口普查转换和遮挡意识到的双向位移的自我监督损失的扩展,以解决场景流动预测问题。关于KITTI场景基准,我们的方法优于相同网络的相应监督预培训,并显示出改善的概括功能,同时达到更快的收敛速度。
translated by 谷歌翻译
Convolutional neural networks (CNNs) have recently been very successful in a variety of computer vision tasks, especially on those linked to recognition. Optical flow estimation has not been among the tasks where CNNs were successful. In this paper we construct appropriate CNNs which are capable of solving the optical flow estimation problem as a supervised learning task. We propose and compare two architectures: a generic architecture and another one including a layer that correlates feature vectors at different image locations.Since existing ground truth datasets are not sufficiently large to train a CNN, we generate a synthetic Flying Chairs dataset. We show that networks trained on this unrealistic data still generalize very well to existing datasets such as Sintel and KITTI, achieving competitive accuracy at frame rates of 5 to 10 fps.
translated by 谷歌翻译
了解3D场景是自治代理的关键先决条件。最近,LIDAR和其他传感器已经以点云帧的时间序列形式提供了大量数据。在这项工作中,我们提出了一种新的问题 - 顺序场景流量估计(SSFE) - 该旨在预测给定序列中所有点云的3D场景流。这与先前研究的场景流程估计问题不同,这侧重于两个框架。我们介绍SPCM-NET架构,通过计算相邻点云之间的多尺度时空相关性,然后通过订单不变的复制单元计算多级时空相关性来解决这个问题。我们的实验评估证实,与仅使用两个框架相比,点云序列的复发处理导致SSFE明显更好。另外,我们证明可以有效地修改该方法,用于顺序点云预测(SPF),一种需要预测未来点云帧的相关问题。我们的实验结果是使用SSFE和SPF的新基准进行评估,包括合成和实时数据集。以前,场景流估计的数据集仅限于两个帧。我们为这些数据集提供非琐碎的扩展,用于多帧估计和预测。由于难以获得现实世界数据集的地面真理运动,我们使用自我监督的培训和评估指标。我们认为,该基准将在该领域的未来研究中关键。将可访问基准和型号的所有代码。
translated by 谷歌翻译
The FlowNet demonstrated that optical flow estimation can be cast as a learning problem. However, the state of the art with regard to the quality of the flow has still been defined by traditional methods. Particularly on small displacements and real-world data, FlowNet cannot compete with variational methods. In this paper, we advance the concept of end-to-end learning of optical flow and make it work really well. The large improvements in quality and speed are caused by three major contributions: first, we focus on the training data and show that the schedule of presenting data during training is very important. Second, we develop a stacked architecture that includes warping of the second image with intermediate optical flow. Third, we elaborate on small displacements by introducing a subnetwork specializing on small motions. FlowNet 2.0 is only marginally slower than the original FlowNet but decreases the estimation error by more than 50%. It performs on par with state-of-the-art methods, while running at interactive frame rates. Moreover, we present faster variants that allow optical flow computation at up to 140fps with accuracy matching the original FlowNet.
translated by 谷歌翻译
Recent work has shown that optical flow estimation can be formulated as a supervised learning task and can be successfully solved with convolutional networks. Training of the so-called FlowNet was enabled by a large synthetically generated dataset. The present paper extends the concept of optical flow estimation via convolutional networks to disparity and scene flow estimation. To this end, we propose three synthetic stereo video datasets with sufficient realism, variation, and size to successfully train large networks. Our datasets are the first large-scale datasets to enable training and evaluating scene flow methods. Besides the datasets, we present a convolutional network for real-time disparity estimation that provides state-of-the-art results. By combining a flow and disparity estimation network and training it jointly, we demonstrate the first scene flow estimation with a convolutional network.
translated by 谷歌翻译