The goal of this paper is to detect objects by exploiting their interrelationships. Rather than relying on predefined and labeled graph structures, we infer a graph prior from object co-occurrence statistics. The key idea of our paper is to model object relations as a function of initial class predictions and co-occurrence priors to generate a graph representation of an image for improved classification and bounding box regression. We additionally learn the object-relation joint distribution via energy based modeling. Sampling from this distribution generates a refined graph representation of the image which in turn produces improved detection performance. Experiments on the Visual Genome and MS-COCO datasets demonstrate our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. What is more, we establish a consistent improvement over object detectors like DETR and Faster-RCNN, as well as state-of-the-art methods modeling object interrelationships.
translated by 谷歌翻译
Line segments are ubiquitous in our human-made world and are increasingly used in vision tasks. They are complementary to feature points thanks to their spatial extent and the structural information they provide. Traditional line detectors based on the image gradient are extremely fast and accurate, but lack robustness in noisy images and challenging conditions. Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines. We propose to combine traditional and learned approaches to get the best of both worlds: an accurate and robust line detector that can be trained in the wild without ground truth lines. Our new line segment detector, DeepLSD, processes images with a deep network to generate a line attraction field, before converting it to a surrogate image gradient magnitude and angle, which is then fed to any existing handcrafted line detector. Additionally, we propose a new optimization tool to refine line segments based on the attraction field and vanishing points. This refinement improves the accuracy of current deep detectors by a large margin. We demonstrate the performance of our method on low-level line detection metrics, as well as on several downstream tasks using multiple challenging datasets. The source code and models are available at https://github.com/cvg/DeepLSD.
translated by 谷歌翻译
我们引入了一个可扩展的框架,用于从RGB-D图像中具有很大不完整的场景覆盖率的新型视图合成。尽管生成的神经方法在2D图像上表现出了惊人的结果,但它们尚未达到相似的影像学结果,并结合了场景完成,在这种情况下,空间3D场景的理解是必不可少的。为此,我们提出了一条在基于网格的神经场景表示上执行的生成管道,通过以2.5D-3D-2.5D方式进行场景的分布来完成未观察到的场景部分。我们在3D空间中处理编码的图像特征,并具有几何完整网络和随后的纹理镶嵌网络,以推断缺失区域。最终可以通过与一致性的可区分渲染获得感性图像序列。全面的实验表明,我们方法的图形输出优于最新技术,尤其是在未观察到的场景部分中。
translated by 谷歌翻译
许多手持或混合现实设备与单个传感器一起用于3D重建,尽管它们通常包含多个传感器。多传感器深度融合能够实质上提高3D重建方法的鲁棒性和准确性,但是现有技术不够强大,无法处理具有不同值范围以及噪声范围以及噪声和离群统计数据的传感器。为此,我们介绍了Senfunet,这是一种深度融合方法,它可以学习传感器特定的噪声和离群统计数据,并以在线方式将深度框架的数据流组合在一起。我们的方法融合了多传感器深度流,而不论时间同步和校准如何,并且在很少的训练数据中概括了。我们在现实世界中和scene3D数据集以及副本数据集上使用各种传感器组合进行实验。实验表明,我们的融合策略表现优于传统和最新的在线深度融合方法。此外,多个传感器的组合比使用单个传感器更加可靠的离群处理和更精确的表面重建。源代码和数据可在https://github.com/tfy14esa/senfunet上获得。
translated by 谷歌翻译
建立新型观点综合的最近进展后,我们提出了改善单眼深度估计的应用。特别是,我们提出了一种在三个主要步骤中分开的新颖训练方法。首先,单眼深度网络的预测结果被扭转到额外的视点。其次,我们应用一个额外的图像综合网络,其纠正并提高了翘曲的RGB图像的质量。通过最小化像素-WISE RGB重建误差,该网络的输出需要尽可能类似地查看地面真实性视图。第三,我们将相同的单眼深度估计重新应用于合成的第二视图点,并确保深度预测与相关的地面真理深度一致。实验结果证明,我们的方法在Kitti和Nyu-Deaft-V2数据集上实现了最先进的或可比性,具有轻量级和简单的香草U-Net架构。
translated by 谷歌翻译
我们提出了一种从视频中共同估计3D运动,3D形状和高度运动模糊物体的外观的方法。为此,我们通过参加多个帧的预定时间窗口的持续时间来模拟生成时尚以生成方式模拟快速移动物体的模糊外观。使用可微分渲染,我们能够通过通过在短时间间隔上平均输出来减少对输入视频来实现对输入视频的像素方向刻录误差来估计所有参数。为此目的,我们还估计相同优化内的相机曝光间隙时间。要考虑突然的运动变化,如弹跳,我们将运动轨迹模拟为片断多项式,我们能够在子帧精度下估计反弹的特定时间。建立的基准数据集的实验表明,我们的方法优于先前的快速移动物体去孔和3D重建方法。
translated by 谷歌翻译
在本文中,我们提出了简单的关注机制,我们称之为箱子。它可以实现网格特征之间的空间交互,从感兴趣的框中采样,并提高变压器的学习能力,以获得几个视觉任务。具体而言,我们呈现拳击手,短暂的框变压器,通过从输入特征映射上的参考窗口预测其转换来参加一组框。通过考虑其网格结构,拳击手通过考虑其网格结构来计算这些框的注意力。值得注意的是,Boxer-2D自然有关于其注意模块内容信息的框信息的原因,使其适用于端到端实例检测和分段任务。通过在盒注意模块中旋转的旋转的不变性,Boxer-3D能够从用于3D端到端对象检测的鸟瞰图平面产生识别信息。我们的实验表明,拟议的拳击手-2D在Coco检测中实现了更好的结果,并且在Coco实例分割上具有良好的和高度优化的掩模R-CNN可比性。 Boxer-3D已经为Waymo开放的车辆类别提供了令人信服的性能,而无需任何特定的类优化。代码将被释放。
translated by 谷歌翻译
本文提出了一个实时的在线视觉框架,共同恢复室内场景的3D结构和语义标签。给定嘈杂的深度地图,相机轨迹和火车时间的2D语义标签,所提出的深度神经网络的方法学会融合在场景空间中具有合适的语义标签的框架。我们的方法利用现场特征空间中深度和语义的联合体积表示来解决此任务。对于实时语义标签和几何形状的引人注目的在线融合,我们介绍了一个高效的涡流池块,同时删除了在线深度融合中的路由网络,以保持高频表面细节。我们表明场景的语义提供的上下文信息有助于深度融合网络学习抗噪声功能。不仅如此,它有助于克服当前在线深度融合方法的缺点,在处理薄物体结构,增厚伪像和假表面。 Replica DataSet上的实验评估表明,我们的方法可以在每秒37和10帧中执行深度融合,平均重建F分数分别为88%和91%,具体取决于深度图分辨率。此外,我们的模型在Scannet 3D语义基准排行榜上显示了0.515的平均iou得分。
translated by 谷歌翻译
代表物体粒度的场景是场景理解和决策的先决条件。我们提出PrisMoNet,一种基于先前形状知识的新方法,用于学习多对象3D场景分解和来自单个图像的表示。我们的方法学会在平面曲面上分解具有多个对象的合成场景的图像,进入其组成场景对象,并从单个视图推断它们的3D属性。经常性编码器从输入的RGB图像中回归3D形状,姿势和纹理的潜在表示。通过可差异化的渲染,我们培训我们的模型以自我监督方式从RGB-D图像中分解场景。 3D形状在功能空间中连续表示,作为我们以监督方式从示例形状预先训练的符号距离函数。这些形状的前沿提供弱监管信号,以更好地条件挑战整体学习任务。我们评估我们模型在推断3D场景布局方面的准确性,展示其生成能力,评估其对真实图像的概括,并指出了学习的表示的益处。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译