Autonomous driving is an exciting new industry, posing important research questions. Within the perception module, 3D human pose estimation is an emerging technology, which can enable the autonomous vehicle to perceive and understand the subtle and complex behaviors of pedestrians. While hardware systems and sensors have dramatically improved over the decades -- with cars potentially boasting complex LiDAR and vision systems and with a growing expansion of the available body of dedicated datasets for this newly available information -- not much work has been done to harness these novel signals for the core problem of 3D human pose estimation. Our method, which we coin HUM3DIL (HUMan 3D from Images and LiDAR), efficiently makes use of these complementary signals, in a semi-supervised fashion and outperforms existing methods with a large margin. It is a fast and compact model for onboard deployment. Specifically, we embed LiDAR points into pixel-aligned multi-modal features, which we pass through a sequence of Transformer refinement stages. Quantitative experiments on the Waymo Open Dataset support these claims, where we achieve state-of-the-art results on the task of 3D pose estimation.
translated by 谷歌翻译
We present PhoMoH, a neural network methodology to construct generative models of photorealistic 3D geometry and appearance of human heads including hair, beards, clothing and accessories. In contrast to prior work, PhoMoH models the human head using neural fields, thus supporting complex topology. Instead of learning a head model from scratch, we propose to augment an existing expressive head model with new features. Concretely, we learn a highly detailed geometry network layered on top of a mid-resolution head model together with a detailed, local geometry-aware, and disentangled color field. Our proposed architecture allows us to learn photorealistic human head models from relatively little data. The learned generative geometry and appearance networks can be sampled individually and allow the creation of diverse and realistic human heads. Extensive experiments validate our method qualitatively and across different metrics.
translated by 谷歌翻译
We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications.
translated by 谷歌翻译
In this paper, we propose a new approach to learned optimization. As common in the literature, we represent the computation of the update step of the optimizer with a neural network. The parameters of the optimizer are then learned on a set of training optimization tasks, in order to perform minimisation efficiently. Our main innovation is to propose a new neural network architecture for the learned optimizer inspired by the classic BFGS algorithm. As in BFGS, we estimate a preconditioning matrix as a sum of rank-one updates but use a transformer-based neural network to predict these updates jointly with the step length and direction. In contrast to several recent learned optimization approaches, our formulation allows for conditioning across different dimensions of the parameter space of the target problem while remaining applicable to optimization tasks of variable dimensionality without retraining. We demonstrate the advantages of our approach on a benchmark composed of objective functions traditionally used for evaluation of optimization algorithms, as well as on the real world-task of physics-based reconstruction of articulated 3D human motion.
translated by 谷歌翻译
我们提出了Blazepose Ghum整体,这是一种针对3D人体地标和姿势估计的轻型神经网络管道,专门针对实时的实时推论量身定制。Blazepose Ghum整体可以从单个RGB图像中捕获运动捕获,包括头像控制,健身跟踪和AR/VR效果。我们的主要贡献包括i)一种用于3D地面真相数据获取的新方法,ii)更新了3D身体跟踪,并使用其他手工标记和iii)从单眼图像中进行全身姿势估算。
translated by 谷歌翻译
许多手持或混合现实设备与单个传感器一起用于3D重建,尽管它们通常包含多个传感器。多传感器深度融合能够实质上提高3D重建方法的鲁棒性和准确性,但是现有技术不够强大,无法处理具有不同值范围以及噪声范围以及噪声和离群统计数据的传感器。为此,我们介绍了Senfunet,这是一种深度融合方法,它可以学习传感器特定的噪声和离群统计数据,并以在线方式将深度框架的数据流组合在一起。我们的方法融合了多传感器深度流,而不论时间同步和校准如何,并且在很少的训练数据中概括了。我们在现实世界中和scene3D数据集以及副本数据集上使用各种传感器组合进行实验。实验表明,我们的融合策略表现优于传统和最新的在线深度融合方法。此外,多个传感器的组合比使用单个传感器更加可靠的离群处理和更精确的表面重建。源代码和数据可在https://github.com/tfy14esa/senfunet上获得。
translated by 谷歌翻译
我们研究了一个体现的设置中的终身视觉感知,在那里我们开发了新的模型,并比较了在建筑物中导航的各种代理,偶尔会要求注释,又用于改进他们的视觉感知能力。代理的目的是在结合探索和主动视觉学习的过程结束时识别整个建筑物中的对象和其他语义类。当我们在终身学习背景下研究这项任务时,代理商应该使用早期访问环境中获得的知识,以便在连续访问建筑物中指导他们的勘探和积极学习策略。我们使用语义分割性能作为一般性视觉感知的代理,并研究了几种探索和注释方法的新任务,从使用启发式主动学习的前沿勘探基线到一个完全学习的方法。对于后者,我们介绍了一个基于深度的加强学习(RL)的代理,共同学习了导航和主动学习。一个点目标导航制定,与提供目标的全球计划者耦合,被集成到RL模型中,以便为新颖的场景系统探索提供进一步的激励措施。通过在TAKEPORT3D DataSet上进行广泛的实验,我们展示了拟议的代理商如何利用先前探索的场景的知识,例如,探索新的,例如,通过更少的粒度勘探和较少频繁的注释请求。结果还表明,基于学习的代理能够比启发式替代品更有效地使用其先前的视觉知识。
translated by 谷歌翻译
用于3D人类传感的最新技术的进展目前受到3D地面真理的缺乏视觉数据集的限制,包括多个人,运动,在现实世界环境中运行,具有复杂的照明或遮挡,并且可能观察到移动相机。复杂的场景理解需要估计人类的姿势和形状以及手势,朝着最终将有用的度量和行为信号与自由视点相结合的表示来估计的表示。为了维持进步,我们建立了一个大型的照片 - 现实数据集,人类空间(HSPACE),用于复杂的合成室内和室外环境中的动画人。我们将百种不同的年龄,性别,比例和种族相结合,以及数百个动作和场景,以及身体形状的参数变化(总共1,600种不同的人类),以产生初始数据集超过100万帧。人类的动画是通过拟合表达的人体模型,以单身扫描人们来获得,其次是新的重新定位和定位程序,支持穿着人的人类的现实动画,身体比例的统计变化,以及联合一致的场景放置多个移动的人。资产在规模上自动生成,并与现有的实时渲染和游戏引擎兼容。具有评估服务器的数据集将可用于研究。我们的大规模分析了合成数据的影响,与实际数据和弱监管有关,强调了持续质量改进和限制了这种实际设置,与模型容量增加的实际设定的相当大的潜力。
translated by 谷歌翻译
我们向渲染和时间(4D)重建人类的渲染和时间(4D)重建的神经辐射场,通过稀疏的摄像机捕获或甚至来自单眼视频。我们的方法将思想与神经场景表示,新颖的综合合成和隐式统计几何人称的人类表示相结合,耦合使用新颖的损失功能。在先前使用符号距离功能表示的结构化隐式人体模型,而不是使用统一的占用率来学习具有统一占用的光域字段。这使我们能够从稀疏视图中稳健地融合信息,并概括超出在训练中观察到的姿势或视图。此外,我们应用几何限制以共同学习观察到的主题的结构 - 包括身体和衣服 - 并将辐射场正规化为几何合理的解决方案。在多个数据集上的广泛实验证明了我们方法的稳健性和准确性,其概括能力显着超出了一系列的姿势和视图,以及超出所观察到的形状的统计外推。
translated by 谷歌翻译
本文提出了一个实时的在线视觉框架,共同恢复室内场景的3D结构和语义标签。给定嘈杂的深度地图,相机轨迹和火车时间的2D语义标签,所提出的深度神经网络的方法学会融合在场景空间中具有合适的语义标签的框架。我们的方法利用现场特征空间中深度和语义的联合体积表示来解决此任务。对于实时语义标签和几何形状的引人注目的在线融合,我们介绍了一个高效的涡流池块,同时删除了在线深度融合中的路由网络,以保持高频表面细节。我们表明场景的语义提供的上下文信息有助于深度融合网络学习抗噪声功能。不仅如此,它有助于克服当前在线深度融合方法的缺点,在处理薄物体结构,增厚伪像和假表面。 Replica DataSet上的实验评估表明,我们的方法可以在每秒37和10帧中执行深度融合,平均重建F分数分别为88%和91%,具体取决于深度图分辨率。此外,我们的模型在Scannet 3D语义基准排行榜上显示了0.515的平均iou得分。
translated by 谷歌翻译