Recent progress in geometric computer vision has shown significant advances in reconstruction and novel view rendering from multiple views by capturing the scene as a neural radiance field. Such approaches have changed the paradigm of reconstruction but need a plethora of views and do not make use of object shape priors. On the other hand, deep learning has shown how to use priors in order to infer shape from single images. Such approaches, though, require that the object is reconstructed in a canonical pose or assume that object pose is known during training. In this paper, we address the problem of how to compute equivariant priors for reconstruction from a few images, given the relative poses of the cameras. Our proposed reconstruction is $SE(3)$-gauge equivariant, meaning that it is equivariant to the choice of world frame. To achieve this, we make two novel contributions to light field processing: we define light field convolution and we show how it can be approximated by intra-view $SE(2)$ convolutions because the original light field convolution is computationally and memory-wise intractable; we design a map from the light field to $\mathbb{R}^3$ that is equivariant to the transformation of the world frame and to the rotation of the views. We demonstrate equivariance by obtaining robust results in roto-translated datasets without performing transformation augmentation.
translated by 谷歌翻译
Recent methods for neural surface representation and rendering, for example NeuS, have demonstrated remarkably high-quality reconstruction of static scenes. However, the training of NeuS takes an extremely long time (8 hours), which makes it almost impossible to apply them to dynamic scenes with thousands of frames. We propose a fast neural surface reconstruction approach, called NeuS2, which achieves two orders of magnitude improvement in terms of acceleration without compromising reconstruction quality. To accelerate the training process, we integrate multi-resolution hash encodings into a neural surface representation and implement our whole algorithm in CUDA. We also present a lightweight calculation of second-order derivatives tailored to our networks (i.e., ReLU-based MLPs), which achieves a factor two speed up. To further stabilize training, a progressive learning strategy is proposed to optimize multi-resolution hash encodings from coarse to fine. In addition, we extend our method for reconstructing dynamic scenes with an incremental training strategy. Our experiments on various datasets demonstrate that NeuS2 significantly outperforms the state-of-the-arts in both surface reconstruction accuracy and training speed. The video is available at https://vcai.mpi-inf.mpg.de/projects/NeuS2/ .
translated by 谷歌翻译
We propose a novel method for 3D shape completion from a partial observation of a point cloud. Existing methods either operate on a global latent code, which limits the expressiveness of their model, or autoregressively estimate the local features, which is highly computationally extensive. Instead, our method estimates the entire local feature field by a single feedforward network by formulating this problem as a tensor completion problem on the feature volume of the object. Due to the redundancy of local feature volumes, this tensor completion problem can be further reduced to estimating the canonical factors of the feature volume. A hierarchical variational autoencoder (VAE) with tiny MLPs is used to probabilistically estimate the canonical factors of the complete feature volume. The effectiveness of the proposed method is validated by comparing it with the state-of-the-art method quantitatively and qualitatively. Further ablation studies also show the need to adopt a hierarchical architecture to capture the multimodal distribution of possible shapes.
translated by 谷歌翻译
The ability to capture detailed interactions among individuals in a social group is foundational to our study of animal behavior and neuroscience. Recent advances in deep learning and computer vision are driving rapid progress in methods that can record the actions and interactions of multiple individuals simultaneously. Many social species, such as birds, however, live deeply embedded in a three-dimensional world. This world introduces additional perceptual challenges such as occlusions, orientation-dependent appearance, large variation in apparent size, and poor sensor coverage for 3D reconstruction, that are not encountered by applications studying animals that move and interact only on 2D planes. Here we introduce a system for studying the behavioral dynamics of a group of songbirds as they move throughout a 3D aviary. We study the complexities that arise when tracking a group of closely interacting animals in three dimensions and introduce a novel dataset for evaluating multi-view trackers. Finally, we analyze captured ethogram data and demonstrate that social context affects the distribution of sequential interactions between birds in the aviary.
translated by 谷歌翻译
我们在从傅立叶角度得出的同质空间上引入了一个统一的框架。我们解决了卷积层之前和之后的特征场的情况。我们通过利用提起的特征场的傅立叶系数的稀疏性来提出通过傅立叶域的统一推导。当同质空间的稳定子亚组是一个紧凑的谎言组时,稀疏性就会出现。我们进一步通过元素定位元素非线性引入了一种激活方法,并通过均等卷积抬起并投射回现场。我们表明,其他将特征视为稳定器亚组中傅立叶系数的方法是我们激活的特殊情况。$ SO(3)$和$ SE(3)$进行的实验显示了球形矢量场回归,点云分类和分子完成中的最新性能。
translated by 谷歌翻译
适应数据分布的结构(例如对称性和转型Imarerces)是机器学习中的重要挑战。通过架构设计或通过增强数据集,可以内在学习过程中内置Inhormces。两者都需要先验的了解对称性的确切性质。缺乏这种知识,从业者求助于昂贵且耗时的调整。为了解决这个问题,我们提出了一种新的方法来学习增强变换的分布,以新的\ emph {转换风险最小化}(trm)框架。除了预测模型之外,我们还优化了从假说空间中选择的转换。作为算法框架,我们的TRM方法是(1)有效(共同学习增强和模型,以\ emph {单训练环}),(2)模块化(使用\ emph {任何训练算法),以及(3)一般(处理\ \ ich {离散和连续}增强)。理论上与标准风险最小化的TRM比较,并在其泛化误差上给出PAC-Bayes上限。我们建议通过块组成的新参数化优化富裕的增强空间,导致新的\ EMPH {随机成分增强学习}(SCALE)算法。我们在CIFAR10 / 100,SVHN上使用先前的方法(快速自身自动化和武术器)进行实际比较规模。此外,我们表明规模可以在数据分布中正确地学习某些对称性(恢复旋转Mnist上的旋转),并且还可以改善学习模型的校准。
translated by 谷歌翻译
加强学习是机器人获得从经验中获得技能的强大框架,但通常需要大量的在线数据收集。结果,很难收集机器人概括所需的足够多样化的经验。另一方面,人类的视频是一种易于获得的广泛和有趣的经历来源。在本文中,我们考虑问题:我们可以直接进行强化学习,以便在人类收集的经验吗?这种问题特别困难,因为这种视频没有用动作注释并相对于机器人的实施例展示了大量的视觉畴偏移。为了解决这些挑战,我们提出了一种与视频(RLV)的强化学习框架。 RLV使用人类收集的经验结合机器人收集的数据来了解策略和价值函数。在我们的实验中,我们发现RLV能够利用此类视频来学习基于视觉的愿景技能,以不到一半的样本作为从头开始学习的RL方法。
translated by 谷歌翻译
Model-based human pose estimation is currently approached through two different paradigms. Optimizationbased methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate imagemodel alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins. The project website with videos, results, and code can be found at https://seas.upenn.edu/ ˜nkolot/projects/spin.
translated by 谷歌翻译
This paper addresses the problem of 3D human pose and shape estimation from a single image. Previous approaches consider a parametric model of the human body, SMPL, and attempt to regress the model parameters that give rise to a mesh consistent with image evidence. This parameter regression has been a very challenging task, with modelbased approaches underperforming compared to nonparametric solutions in terms of pose estimation. In our work, we propose to relax this heavy reliance on the model's parameter space. We still retain the topology of the SMPL template mesh, but instead of predicting model parameters, we directly regress the 3D location of the mesh vertices. This is a heavy task for a typical network, but our key insight is that the regression becomes significantly easier using a Graph-CNN. This architecture allows us to explicitly encode the template mesh structure within the network and leverage the spatial locality the mesh has to offer. Image-based features are attached to the mesh vertices and the Graph-CNN is responsible to process them on the mesh structure, while the regression target for each vertex is its 3D location. Having recovered the complete 3D geometry of the mesh, if we still require a specific model parametrization, this can be reliably regressed from the vertices locations. We demonstrate the flexibility and the effectiveness of our proposed graphbased mesh regression by attaching different types of features on the mesh vertices. In all cases, we outperform the comparable baselines relying on model parameter regression, while we also achieve state-of-the-art results among model-based pose estimation approaches. 1
translated by 谷歌翻译
This work addresses the problem of estimating the full body 3D human pose and shape from a single color image. This is a task where iterative optimization-based solutions have typically prevailed, while Convolutional Networks (ConvNets) have suffered because of the lack of training data and their low resolution 3D predictions. Our work aims to bridge this gap and proposes an efficient and effective direct prediction method based on ConvNets. Central part to our approach is the incorporation of a parametric statistical body shape model (SMPL) within our end-to-end framework. This allows us to get very detailed 3D mesh results, while requiring estimation only of a small number of parameters, making it friendly for direct network prediction. Interestingly, we demonstrate that these parameters can be predicted reliably only from 2D keypoints and masks. These are typical outputs of generic 2D human analysis ConvNets, allowing us to relax the massive requirement that images with 3D shape ground truth are available for training. Simultaneously, by maintaining differentiability, at training time we generate the 3D mesh from the estimated parameters and optimize explicitly for the surface using a 3D per-vertex loss. Finally, a differentiable renderer is employed to project the 3D mesh to the image, which enables further refinement of the network, by optimizing for the consistency of the projection with 2D annotations (i.e., 2D keypoints or masks). The proposed approach outperforms previous baselines on this task and offers an attractive solution for direct prediction of 3D shape from a single color image.
translated by 谷歌翻译