The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. Moreover, we experiment with competitive baselines and recent methods for the task of language generation and show that, as with neural listeners, 3D neural speakers can also noticeably benefit by training with ScanEnts3D, including improving the SoTA by 13.2 CIDEr points on the Nr3D benchmark. Overall, our carefully conducted experimental studies strongly support the conclusion that, by learning on ScanEnts3D, commonly used visio-linguistic 3D architectures can become more efficient and interpretable in their generalization without needing to provide these newly collected annotations at test time. The project's webpage is https://scanents3d.github.io/ .
translated by 谷歌翻译
创建和编辑3D对象的形状和颜色需要巨大的人类努力和专业知识。与3D接口中的直​​接操作相比,诸如草图和涂鸦之类的2D交互对用户通常更自然和直观。在本文中,我们提出了一个通用的多模式生成模型,该模型通过共享的潜在空间耦合2D模式和隐式3D表示。通过提出的模型,通过简单地通过潜在空间从特定的2D控制模式传播编辑,可以实现多功能3D生成和操纵。例如,通过绘制草图来编辑3D形状,通过绘画颜色在2D渲染上重新色彩,或者在一个或几个参考图像中生成特定类别的3D形状。与先前的作品不同,我们的模型不需要每个编辑任务进行重新训练或微调,并且在概念上也很简单,易于实现,对输入域移动的强大,并且可以在部分2D输入中进行多样化的重建。我们在灰度线草图和渲染颜色图像的两种代表性2D模态上评估了我们的框架,并证明我们的方法可以通过以下2D模态实现各种形状的操纵和生成任务。
translated by 谷歌翻译
扩散概率模型(DPM)由于其有希望的结果和对跨模式合成的支持,已成为有条件产生的流行方法。条件合成中的一个关键逃亡者是在条件输入和生成的输出之间实现高对应。大多数现有方法通过将先验纳入变异下限中,隐含地学习了这种关系。在这项工作中,我们采用了另一条路线 - 我们通过使用对比度学习来最大化其共同信息来增强输入输出连接。为此,我们引入了有条件的离散对比扩散(CDCD)损失,并设计了两种对比扩散机制,以有效地将其纳入剥离过程中。我们通过将CDCD与传统的变分目标联系起来来制定CDCD。我们证明了我们的方法在三种多种多样的条件合成任务中的评估中的功效:舞蹈到音乐的生成,文本到图像综合和班级调节图像综合。在每个方面,我们达到最新的或更高的合成质量并提高输入输出对应关系。此外,提出的方法改善了扩散模型的收敛性,将所需扩散步骤的数量减少了两个基准的35%以上,从而大大提高了推理速度。
translated by 谷歌翻译
我们提出了Dance2Music-Gan(D2M-GAN),这是一种新颖的对抗性多模式框架,生成了以舞蹈视频为条件的复杂音乐样品。我们提出的框架将舞蹈视频框架和人体运动作为输入,并学会生成合理伴随相应输入的音乐样本。与大多数现有的有条件音乐的作品不同,它们使用符号音频表示(例如MIDI)生成特定类型的单乐器声音,并且通常依赖于预定义的音乐合成器,在这项工作中,我们以复杂风格(例如,例如,通过使用量化矢量(VQ)音频表示形式,并利用其符号和连续对应物的高抽象能力来利用POP,BREAKING等)。通过在多个数据集上执行广泛的实验,并遵循全面的评估协议,我们评估了建议针对替代方案的生成品质。所达到的定量结果衡量音乐一致性,击败了对应和音乐多样性,证明了我们提出的方法的有效性。最后但并非最不重要的一点是,我们策划了一个充满挑战的野生式Tiktok视频的舞蹈音乐数据集,我们用来进一步证明我们在现实世界中的方法的功效 - 我们希望它能作为起点进行相关的未来研究。
translated by 谷歌翻译
关于神经辐射场(NERF)的最新研究爆炸表明,具有神经网络的复杂场面具有令人鼓舞的潜力。 NERF的一个主要缺点是它的推理时间:渲染单像素需要数百次查询NERF网络。为了解决它,现有的努力主要试图减少所需的采样点的数量。但是,迭代采样的问题仍然存在。另一方面,神经光场(NELF)在新型视图合成中对NERF提出了更直接的表示 - 像素的渲染相当于一个单一的正向通行,而无需射线建设。在这项工作中,我们提出了一个深层残留的MLP网络(88层),以有效地学习光场。我们展示了成功学习这种深度NELF网络的关键,就是拥有足够的数据,我们通过数据蒸馏从预训练的NERF模型中转移知识。在合成和现实世界场景上进行的广泛实验表明,我们方法比其他对应算法的优点。在合成场景中,我们实现了26-35倍的拖鞋(每个摄像头射线)和28-31倍的运行时加速,同时提供了比NERF的呈现质量(1.4-2.8 dB的平均PSNR改善),而无需任何定制的并行性要求。
translated by 谷歌翻译
我们提出了一种新的方法来获取来自在线图像集合的对象表示,从具有不同摄像机,照明和背景的照片捕获任意物体的高质量几何形状和材料属性。这使得各种以各种对象渲染应用诸如新颖的综合,致密和协调的背景组合物,从疯狂的内部输入。使用多级方法延伸神经辐射场,首先推断表面几何形状并优化粗估计的初始相机参数,同时利用粗糙的前景对象掩模来提高训练效率和几何质量。我们还介绍了一种强大的正常估计技术,其消除了几何噪声的效果,同时保持了重要细节。最后,我们提取表面材料特性和环境照明,以球形谐波表示,具有处理瞬态元素的延伸部,例如,锋利的阴影。这些组件的结合导致高度模块化和有效的对象采集框架。广泛的评估和比较证明了我们在捕获高质量的几何形状和外观特性方面的方法,可用于渲染应用。
translated by 谷歌翻译
While the capabilities of autonomous systems have been steadily improving in recent years, these systems still struggle to rapidly explore previously unknown environments without the aid of GPS-assisted navigation. The DARPA Subterranean (SubT) Challenge aimed to fast track the development of autonomous exploration systems by evaluating their performance in real-world underground search-and-rescue scenarios. Subterranean environments present a plethora of challenges for robotic systems, such as limited communications, complex topology, visually-degraded sensing, and harsh terrain. The presented solution enables long-term autonomy with minimal human supervision by combining a powerful and independent single-agent autonomy stack, with higher level mission management operating over a flexible mesh network. The autonomy suite deployed on quadruped and wheeled robots was fully independent, freeing the human supervision to loosely supervise the mission and make high-impact strategic decisions. We also discuss lessons learned from fielding our system at the SubT Final Event, relating to vehicle versatility, system adaptability, and re-configurable communications.
translated by 谷歌翻译
Language models have become increasingly popular in recent years for tasks like information retrieval. As use-cases become oriented toward specific domains, fine-tuning becomes default for standard performance. To fine-tune these models for specific tasks and datasets, it is necessary to carefully tune the model's hyperparameters and training techniques. In this paper, we present an in-depth analysis of the performance of four transformer-based language models on the task of biomedical information retrieval. The models we consider are DeepMind's RETRO (7B parameters), GPT-J (6B parameters), GPT-3 (175B parameters), and BLOOM (176B parameters). We compare their performance on the basis of relevance, accuracy, and interpretability, using a large corpus of 480000 research papers on protein structure/function prediction as our dataset. Our findings suggest that smaller models, with <10B parameters and fine-tuned on domain-specific datasets, tend to outperform larger language models on highly specific questions in terms of accuracy, relevancy, and interpretability by a significant margin (+50% on average). However, larger models do provide generally better results on broader prompts.
translated by 谷歌翻译
Recent methods demonstrate that data augmentation using counterfactual knowledge can teach models the causal structure of a task, leading to robust and generalizable models. However, such counterfactual data often has a limited scale and diversity if crowdsourced and is computationally expensive to extend to new perturbation types if generated using supervised methods. To address this, we introduce a new framework called DISCO for automatically generating high-quality counterfactual data at scale. DISCO engineers prompts to generate phrasal perturbations with a large general language model. Then, a task-specific teacher model filters the generation to distill high-quality counterfactual data. We show that learning with this counterfactual data yields a comparatively small student model that is 6% (absolute) more robust and generalizes 5% better across distributions than baselines on various challenging evaluations. This model is also 15% more sensitive in differentiating original and counterfactual examples, on three evaluation sets written by human workers and via human-AI collaboration.
translated by 谷歌翻译
Multi-document summarization (MDS) has traditionally been studied assuming a set of ground-truth topic-related input documents is provided. In practice, the input document set is unlikely to be available a priori and would need to be retrieved based on an information need, a setting we call open-domain MDS. We experiment with current state-of-the-art retrieval and summarization models on several popular MDS datasets extended to the open-domain setting. We find that existing summarizers suffer large reductions in performance when applied as-is to this more realistic task, though training summarizers with retrieved inputs can reduce their sensitivity retrieval errors. To further probe these findings, we conduct perturbation experiments on summarizer inputs to study the impact of different types of document retrieval errors. Based on our results, we provide practical guidelines to help facilitate a shift to open-domain MDS. We release our code and experimental results alongside all data or model artifacts created during our investigation.
translated by 谷歌翻译