Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the performance, this paper shows how to incorporate the generative model-based approach with an existing semi-supervised method called a speaker-follower model, and proposes a regularization term that improves inference using unpaired trajectories. Experiments on BabyAI and Room-to-Room (R2R) environments show that the proposed method improves the performance of instruction following by leveraging unpaired data, and improves the performance of the speaker-follower model by 2\% to 4\% in R2R.
translated by 谷歌翻译
多模式学习是建立模型的框架,这些模型可以根据不同类型的方式进行预测。多模式学习中的重要挑战是通过这些表示从任意模式和跨模式产生的共同表示形式推断;但是,实现这一目标需要考虑多模式数据的异质性质。近年来,深层生成模型,即通过深层神经网络参数化的生成模型引起了很多关注,尤其是变异自动编码器,这些自动编码器适合于实现上述挑战,因为它们可以考虑异质性并推断出数据的良好表示。。因此,近年来已经提出了基于变异自动编码器的各种多模式生成模型,称为多模式深生成模型。在本文中,我们提供了对多模式深生成模型研究的分类调查。
translated by 谷歌翻译
视觉和语言导航(VLN)是一种任务,即遵循语言指令以导航到目标位置的语言指令,这依赖于在移动期间与环境的持续交互。最近的基于变压器的VLN方法取得了很大的进步,从视觉观测和语言指令之间的直接连接通过多模式跨关注机制。然而,这些方法通常代表通过使用LSTM解码器或使用手动设计隐藏状态来构建反复变压器的时间上下文作为固定长度矢量。考虑到单个固定长度向量通常不足以捕获长期时间上下文,在本文中,我们通过显式建模时间上下文来引入具有可变长度存储器(MTVM)的多模式变压器,通过模拟时间上下文。具体地,MTVM使代理能够通过直接存储在存储体中的先前激活来跟踪导航轨迹。为了进一步提高性能,我们提出了内存感知的一致性损失,以帮助学习随机屏蔽指令的时间上下文的更好关节表示。我们在流行的R2R和CVDN数据集上评估MTVM,我们的模型在R2R看不见的验证和测试中提高了2%的成功率,并在CVDN测试集上减少了1.6米的目标进程。
translated by 谷歌翻译
事实证明,演讲者的追随者模型在视觉和语言导航中有效,在该导航中,扬声器模型用于合成新的说明,以增强追随者导航模型的培训数据。但是,在以前的许多方法中,生成的指令未直接训练以优化追随者的性能。在本文中,我们介绍\ textsc {foam},a \ textsc {fo} llower- \ textsc {a} ware speaker \ textsc {m} odel,它不断更新给定关注的反馈,以便生成的指令可以是更多的指令。适合当前追随者的学习状态。具体而言,我们使用BI级优化框架优化了扬声器,并通过评估标记数据的跟随器来获得其训练信号。房间对房间和房间的室内数据集中的实验结果表明,我们的方法可以超越跨设置的强大基线模型。分析还表明,我们生成的指示的质量比基线更高。
translated by 谷歌翻译
视觉导航要求代理商遵循自然语言说明以达到特定目标。可见的环境和看不见的环境之间的巨大差异使代理商概括良好的挑战。先前的研究提出了数据增强方法,以明确或隐式地减轻数据偏见并提供概括的改进。但是,他们试图记住增强的轨迹,并在测试时忽略在看不见的环境下的分布变化。在本文中,我们提出了一个看不见的差异,预期视力和语言导航(戴维斯),该差异通过鼓励测试时间的视觉一致性来概括为看不见的环境。具体来说,我们设计了:1)半监督框架戴维斯(Davis),该框架利用类似的语义观测来利用视觉一致性信号。 2)一个两阶段的学习程序,鼓励适应测试时间分布。该框架增强了模仿和强化学习的基本混合物与动量形成对比,以鼓励在联合训练阶段和测试时间适应阶段对类似观察的稳定决策。广泛的实验表明,戴维斯在R2R和RXR基准上实现了与先前最先进的VLN基线相比,取得了模型不合命源性的改进。我们的源代码和数据是补充材料。
translated by 谷歌翻译
Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.
translated by 谷歌翻译
具有潜在变量的深生成模型已被最近用于从多模式数据中学习关节表示和生成过程。但是,这两种学习机制可能相互冲突,表示形式无法嵌入有关数据模式的信息。本研究研究了所有模式和类标签可用于模型培训的现实情况,但是缺少下游任务所需的一些方式和标签。在这种情况下,我们表明,变异下限限制了联合表示和缺失模式之间的相互信息。为了抵消这些问题,我们引入了一种新型的条件多模式判别模型,该模型使用信息性的先验分布并优化了无可能的无可能目标函数,该目标函数可在联合表示和缺失模态之间最大化相互信息。广泛的实验表明了我们提出的模型的好处,这是经验结果表明,我们的模型实现了最新的结果,从而导致了代表性问题,例如下游分类,声音反演和注释产生。
translated by 谷歌翻译
Making sense of multiple modalities can yield a more comprehensive description of real-world phenomena. However, learning the co-representation of diverse modalities is still a long-standing endeavor in emerging machine learning applications and research. Previous generative approaches for multimodal input approximate a joint-modality posterior by uni-modality posteriors as product-of-experts (PoE) or mixture-of-experts (MoE). We argue that these approximations lead to a defective bound for the optimization process and loss of semantic connection among modalities. This paper presents a novel variational method on sets called the Set Multimodal VAE (SMVAE) for learning a multimodal latent space while handling the missing modality problem. By modeling the joint-modality posterior distribution directly, the proposed SMVAE learns to exchange information between multiple modalities and compensate for the drawbacks caused by factorization. In public datasets of various domains, the experimental results demonstrate that the proposed method is applicable to order-agnostic cross-modal generation while achieving outstanding performance compared to the state-of-the-art multimodal methods. The source code for our method is available online https://anonymous.4open.science/r/SMVAE-9B3C/.
translated by 谷歌翻译
原则上,将变异自动编码器(VAE)应用于顺序数据提供了一种用于控制序列生成,操纵和结构化表示学习的方法。但是,训练序列VAE具有挑战性:自回归解码器通常可以解释数据而无需使用潜在空间,即后置倒塌。为了减轻这种情况,最新的模型通过将均匀的随机辍学量应用于解码器输入来削弱强大的解码器。从理论上讲,我们表明,这可以消除解码器输入提供的点式互信息,该信息通过利用潜在空间来补偿。然后,我们提出了一种对抗性训练策略,以实现基于信息的随机辍学。与标准文本基准数据集上的均匀辍学相比,我们的目标方法同时提高了序列建模性能和潜在空间中捕获的信息。
translated by 谷歌翻译
基于变化的AutoEncoder的语音转换(VAE-VC)具有仅需要对培训的发言和扬声器标签的优势。与VAE-VC中的大部分研究不同,专注于利用辅助损失或离散变量,研究了如何增加模型表达式对VAE-VC的益处和影响。具体而言,我们首先将VAE-VC分析到速率 - 失真的角度,并指出模型表达性对于VAE-VC来说意义重大,因为速率和失真反映了转化的演示的相似性和自然度。基于分析,我们提出了一种使用深层等级vae的新型VC方法,具有高模型表达性,并且由于其非自动增加的解码器而具有快速转换速度。此外,我们的分析揭示了另一个问题,当VAE的潜变量具有冗余信息时,相似性可以降级。通过使用$ \ beta $ -vae目标控制潜在变量中包含的信息来解决问题。在使用VCTK Corpus的实验中,所提出的方法在性别间环境中的自然和相似性上实现了高于3.5的平均意见分数,其高于现有的基于AutoEncoder的VC方法的分数。
translated by 谷歌翻译
我们建议通过学习通过构思它预期看到的下一个观察来引导的代理来改善视觉导航的跨目标和跨场景概括。这是通过学习变分贝叶斯模型来实现的,称为Neonav,该模型产生了在试剂和目标视图的当前观察中的下一个预期观察(Neo)。我们的生成模式是通过优化包含两个关键设计的变分目标来了解。首先,潜在分布在当前观察和目标视图上进行调节,导致基于模型的目标驱动导航。其次,潜伏的空间用在当前观察和下一个最佳动作上的高斯的混合物建模。我们使用后医混合物的用途能够有效地减轻过正规化的潜在空间的问题,从而大大提高了新目标和新场景的模型概括。此外,Neo Generation模型代理环境交互的前向动态,从而提高了近似推断的质量,因此提高了数据效率。我们对现实世界和合成基准进行了广泛的评估,并表明我们的模型在成功率,数据效率和泛化方面始终如一地优于最先进的模型。
translated by 谷歌翻译
The standard recurrent neural network language model (rnnlm) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an rnn-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.
translated by 谷歌翻译
Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.
translated by 谷歌翻译
视觉和语言导航(VLN)是一个任务,代理在人类指令下的体现室内环境中导航。以前的作品忽略了样本难度的分布,我们认为这可能会降低他们的代理表现。为了解决这个问题,我们为VLN任务提出了一种基于课程的基于课程的培训范式,可以平衡人类的先验知识和特工关于培训样本的学习进度。我们开发课程设计原则,并重新安排基准房间到室(R2R)数据集,以使其适用于课程培训。实验表明,我们的方法是模型 - 不可知的,可以显着提高当前最先进的导航剂的性能,概括性和培训效率而不会增加模型复杂性。
translated by 谷歌翻译
在没有监督信号的情况下学习简洁的数据表示是机器学习的基本挑战。实现此目标的一种突出方法是基于可能性的模型,例如变异自动编码器(VAE),以基于元元素来学习潜在表示,这是对下游任务有益的一般前提(例如,disentanglement)。但是,这种方法通常偏离原始的可能性体系结构,以应用引入的元优势,从而导致他们的培训不良变化。在本文中,我们提出了一种新颖的表示学习方法,Gromov-Wasserstein自动编码器(GWAE),该方法与潜在和数据分布直接匹配。 GWAE模型不是基于可能性的目标,而是通过最小化Gromov-Wasserstein(GW)度量的训练优化。 GW度量测量了在无与伦比的空间上支持的分布之间的面向结构的差异,例如具有不同的维度。通过限制可训练的先验的家庭,我们可以介绍元主题来控制下游任务的潜在表示。与现有基于VAE的方法的经验比较表明,GWAE模型可以通过更改先前的家族而无需进一步修改GW目标来基于元家庭学习表示。
translated by 谷歌翻译
了解空间和视觉信息对于遵循自然语言说明的导航代理至关重要。当前的基于变压器的VLN代理纠缠了方向和视觉信息,这限制了每个信息源的学习中的增益。在本文中,我们设计了具有明确取向和视觉模块的神经药物。这些模块学会了将空间信息和地标在视觉环境中的说明中提及。为了加强代理的空间推理和视觉感知,我们设计了特定的预训练任务,以进食并更好地利用我们最终导航模型中的相应模块。我们在Room2Room(R2R)和Room4Room(R4R)数据集上评估我们的方法,并在两个基准测试中实现最新结果。
translated by 谷歌翻译
视觉和语言导航(VLN)是人工智能领域的一个具有挑战性的任务。虽然在过去几年中,在这项任务中取得了大规模进展,但由于深远和语言模型的突破,仍然是突破,仍然很难建立可以概括和人类的VLN模型。在本文中,我们提供了一种改进VLN模型的新视角。基于我们发现,即使它们的成功率相对相同,同一VLN模型的快照表现出显着不同,我们提出了一种基于快照的合并解决方案,该解决方案利用了多个快照之间的预测。构建在现有最先进的(SOTA)型号$ \ CirclearRowright $ Bert的快照和我们的过去动作感知修改,我们所提出的集合在导航错误中实现了新的SOTA性能(NE)和成功由路径长度(SPL)加权。
translated by 谷歌翻译
配备具有推断人类意图的能力的机器人是有效合作的重要前提。对于这种目标的大多数计算方法采用了概率的推理,以回收机器人感知状态的“意图”的分布。然而,这些方法通常假设人类意图的特定任务概念(例如标记目标)是先验的。为了克服这一限制,我们提出了解离序列聚类变分性Autiachoder(Discvae),该群集框架可以用于以无监督的方式学习意图的这种分布。 DiscVae利用最近在无监督的学习方面的进步导出了顺序数据的解除不诚格潜在表示,从时间不变的全局方面分离时变化的本地特征。虽然与前面的解剖学框架不同,但是所提出的变体也涉及分立变量,以形成潜在混合模型,并使全局序列概念进行聚类,例如,观察到人类行为的意图。为了评估Discvae,首先使用弹跳数字和2D动画的视频数据集来验证其从未标记序列发现类的容量。然后,我们从机器人轮椅上进行的现实世界机器人交互实验报告结果。我们的调查结果引入了推断离散变量如何与人类意图一致,从而用于改善协作设置的帮助,例如共享控制。
translated by 谷歌翻译
Normalizing flows, autoregressive models, variational autoencoders (VAEs), and deep energy-based models are among competing likelihood-based frameworks for deep generative learning. Among them, VAEs have the advantage of fast and tractable sampling and easy-to-access encoding networks. However, they are currently outperformed by other models such as normalizing flows and autoregressive models. While the majority of the research in VAEs is focused on the statistical challenges, we explore the orthogonal direction of carefully designing neural architectures for hierarchical VAEs. We propose Nouveau VAE (NVAE), a deep hierarchical VAE built for image generation using depth-wise separable convolutions and batch normalization. NVAE is equipped with a residual parameterization of Normal distributions and its training is stabilized by spectral regularization. We show that NVAE achieves state-of-the-art results among non-autoregressive likelihood-based models on the MNIST, CIFAR-10, CelebA 64, and CelebA HQ datasets and it provides a strong baseline on FFHQ. For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2.98 to 2.91 bits per dimension, and it produces high-quality images on CelebA HQ as shown in Fig. 1. To the best of our knowledge, NVAE is the first successful VAE applied to natural images as large as 256×256 pixels. The source code is available at https://github.com/NVlabs/NVAE.34th Conference on Neural Information Processing Systems (NeurIPS 2020),
translated by 谷歌翻译
Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised-Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of "posterior collapse" --where the latents are ignored when they are paired with a powerful autoregressive decoder --typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.
translated by 谷歌翻译