Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point's permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios. The code and dataset will be made available upon acceptance.
translated by 谷歌翻译
Pre-trained language models achieve superior performance, but they are computationally expensive due to their large size. Techniques such as pruning and knowledge distillation (KD) have been developed to reduce their size and latency. In most structural pruning methods, the pruning units, such as attention heads and feed-forward hidden dimensions, only span a small model structure space and limit the structures that the pruning algorithm can explore. In this work, we propose Gradient-based Intra-attention pruning (GRAIN), which inspects fine intra-attention structures, and allows different heads to have different sizes. Intra-attention pruning greatly expands the searching space of model structures and yields highly heterogeneous structures. We further propose structure regularization to encourage generating more regular structures, which achieves higher speedups than heterogeneous ones. We also integrate KD into the pruning process with a gradient separation strategy to reduce the interference of KD with the pruning process. GRAIN is evaluated on a variety of tasks. Results show that it notably outperforms other methods at the same or similar model size. Even under extreme compression where only $3\%$ weights in transformers remain, the pruned model is still competitive.
translated by 谷歌翻译
Recent methods for neural surface representation and rendering, for example NeuS, have demonstrated remarkably high-quality reconstruction of static scenes. However, the training of NeuS takes an extremely long time (8 hours), which makes it almost impossible to apply them to dynamic scenes with thousands of frames. We propose a fast neural surface reconstruction approach, called NeuS2, which achieves two orders of magnitude improvement in terms of acceleration without compromising reconstruction quality. To accelerate the training process, we integrate multi-resolution hash encodings into a neural surface representation and implement our whole algorithm in CUDA. We also present a lightweight calculation of second-order derivatives tailored to our networks (i.e., ReLU-based MLPs), which achieves a factor two speed up. To further stabilize training, a progressive learning strategy is proposed to optimize multi-resolution hash encodings from coarse to fine. In addition, we extend our method for reconstructing dynamic scenes with an incremental training strategy. Our experiments on various datasets demonstrate that NeuS2 significantly outperforms the state-of-the-arts in both surface reconstruction accuracy and training speed. The video is available at https://vcai.mpi-inf.mpg.de/projects/NeuS2/ .
translated by 谷歌翻译
Frame Semantic Role Labeling (FSRL) identifies arguments and labels them with frame semantic roles defined in FrameNet. Previous researches tend to divide FSRL into argument identification and role classification. Such methods usually model role classification as naive multi-class classification and treat arguments individually, which neglects label semantics and interactions between arguments and thus hindering performance and generalization of models. In this paper, we propose a query-based framework named ArGument Extractor with Definitions in FrameNet (AGED) to mitigate these problems. Definitions of frames and frame elements (FEs) in FrameNet can be used to query arguments in text. Encoding text-definition pairs can guide models in learning label semantics and strengthening argument interactions. Experiments show that AGED outperforms previous state-of-the-art by up to 1.3 F1-score in two FrameNet datasets and the generalization power of AGED in zero-shot and fewshot scenarios. Our code and technical appendix is available at https://github.com/PKUnlp-icler/AGED.
translated by 谷歌翻译
Depth estimation is usually ill-posed and ambiguous for monocular camera-based 3D multi-person pose estimation. Since LiDAR can capture accurate depth information in long-range scenes, it can benefit both the global localization of individuals and the 3D pose estimation by providing rich geometry features. Motivated by this, we propose a monocular camera and single LiDAR-based method for 3D multi-person pose estimation in large-scale scenes, which is easy to deploy and insensitive to light. Specifically, we design an effective fusion strategy to take advantage of multi-modal input data, including images and point cloud, and make full use of temporal information to guide the network to learn natural and coherent human motions. Without relying on any 3D pose annotations, our method exploits the inherent geometry constraints of point cloud for self-supervision and utilizes 2D keypoints on images for weak supervision. Extensive experiments on public datasets and our newly collected dataset demonstrate the superiority and generalization capability of our proposed method.
translated by 谷歌翻译
Pre-trained Language Model (PLM) has become a representative foundation model in the natural language processing field. Most PLMs are trained with linguistic-agnostic pre-training tasks on the surface form of the text, such as the masked language model (MLM). To further empower the PLMs with richer linguistic features, in this paper, we aim to propose a simple but effective way to learn linguistic features for pre-trained language models. We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original MLM pre-training task, using a linguistically-informed pre-training (LIP) strategy. We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements over various comparable baselines. Furthermore, we also conduct analytical experiments in various linguistic aspects, and the results prove that the design of LERT is valid and effective. Resources are available at https://github.com/ymcui/LERT
translated by 谷歌翻译
Multi-view representation learning has developed rapidly over the past decades and has been applied in many fields. However, most previous works assumed that each view is complete and aligned. This leads to an inevitable deterioration in their performance when encountering practical problems such as missing or unaligned views. To address the challenge of representation learning on partially aligned multi-view data, we propose a new cross-view graph contrastive learning framework, which integrates multi-view information to align data and learn latent representations. Compared with current approaches, the proposed method has the following merits: (1) our model is an end-to-end framework that simultaneously performs view-specific representation learning via view-specific autoencoders and cluster-level data aligning by combining multi-view information with the cross-view graph contrastive learning; (2) it is easy to apply our model to explore information from three or more modalities/sources as the cross-view graph contrastive learning is devised. Extensive experiments conducted on several real datasets demonstrate the effectiveness of the proposed method on the clustering and classification tasks.
translated by 谷歌翻译
最近,有大量的工作致力于研究马尔可夫链随机梯度方法(MC-SGMS),这些方法主要集中于他们解决最小化问题的收敛分析。在本文中,我们通过统计学习理论框架中的算法稳定性镜头对MC-SGM进行了全面的MC-SGMS分析。对于经验风险最小化(ERM)问题,我们通过引入实用的论点稳定性来建立平稳和非平滑案例的最佳人口风险界限。对于最小值问题,我们建立了在平均参数稳定性和概括误差之间的定量连接,该误差扩展了均匀稳定性\ cite {lei2021Staritibal}的现有结果。我们进一步开发了预期和高概率的凸孔问题问题的第一个几乎最佳的收敛速率,这与我们的稳定性结果相结合,表明可以在平滑和非平滑案例中达到最佳的概括界限。据我们所知,这是对梯度从马尔可夫过程采样时对SGM的首次概括分析。
translated by 谷歌翻译
在本文中,通过引入低噪声条件,我们研究了在随机凸出优化(SCO)的环境中,差异私有随机梯度下降(SGD)算法的隐私和效用(概括)表现。对于点心学习,我们建立了订单$ \ Mathcal {o} \ big(\ frac {\ sqrt {\ sqrt {d \ log(1/\ delta)}} {n \ epsilon} \ big)和$ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\ \ \ \ \\ \ \ \ \ \ big(\ frac {\ frac {\ sqrt {\ sqrt {\ sqrt {\ sqrt {\ sqrt {\ sqrt {\ sqrt {\ sqrt {\ sqrt { Mathcal {o} \ big({n^{ - \ frac {1+ \ alpha} {2}}}}}}+\ frac {\ sqrt {d \ log(1/\ delta)}}} )$(\ epsilon,\ delta)$ - 差异化私有SGD算法,分别是较高的和$ \ alpha $ -h \'分别较旧的光滑损失,其中$ n $是样本尺寸,$ d $是维度。对于成对学习,受\ cite {lei2020sharper,lei2021Generalization}的启发,我们提出了一种基于梯度扰动的简单私人SGD算法,该算法满足$(\ epsilon,\ delta)$ - 差异性限制,并开发出了新颖的私密性,并且算法。特别是,我们证明我们的算法可以实现多余的风险利率$ \ MATHCAL {o} \ big(\ frac {1} {\ sqrt {n}}}+\ frac {\ frac {\ sqrt { delta)}}} {n \ epsilon} \ big)$带有梯度复杂性$ \ mathcal {o}(n)$和$ \ mathcal {o} \ big(n^{\ frac {\ frac {2- \ alpha} {1+ alpha} {1+ \ alpha}}}+n \ big)$,用于强烈平滑和$ \ alpha $ -h \'olde R平滑损失。此外,在低噪声环境中建立了更快的学习率,以实现平滑和非平滑损失。据我们所知,这是第一次实用分析,它提供了超过$ \ Mathcal {o} \ big(\ frac {1} {\ sqrt {\ sqrt {n}}+\ frac {\ sqrt {d sqrt {d \ sqrt {d \ sqrt { log(1/\ delta)}}} {n \ epsilon} \ big)$用于隐私提供成对学习。
translated by 谷歌翻译
我们提出了一种新方法,以从多个人的一组稀疏的多视图图像中学习通用的动画神经人类表示。学到的表示形式可用于合成一组稀疏相机的任意人的新型视图图像,并通过用户的姿势控制进一步对它们进行动画。尽管现有方法可以推广到新人,也可以通过用户控制合成动画,但它们都不能同时实现。我们将这一成就归因于用于共享多人人类模型的3D代理,并将不同姿势的空间的扭曲延伸到共享的规范姿势空间,在该空间中,我们在其中学习神经领域并预测个人和人物 - 姿势依赖性变形以及从输入图像中提取的特征的外观。为了应对身体形状,姿势和衣服变形的较大变化的复杂性,我们以分离的几何形状和外观设计神经人类模型。此外,我们在空间点和3D代理的表面点上都利用图像特征来预测人和姿势依赖性特性。实验表明,我们的方法在这两个任务上的最先进都大大优于最先进的方法。该视频和代码可在https://talegqz.github.io/neural_novel_actor上获得。
translated by 谷歌翻译