We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results and the code will be open-sourced.
translated by 谷歌翻译
有效的缩放和灵活的任务接口使大型语言模型能够在许多任务中表现出色。帕利(Pali)根据视觉和文本输入生成文本,并使用该界面以许多语言执行许多视觉,语言和多模式任务。为了训练帕利,我们利用了大型的编码器语言模型和视觉变压器(VITS)。这使我们能够利用其现有能力,并利用培训它们的大量成本。我们发现,视觉和语言组成部分的联合缩放很重要。由于现有的语言变压器比其视觉对应物要大得多,因此我们训练迄今为止最大的VIT(VIT-E),以量化甚至大容量视觉模型的好处。为了训练Pali,我们基于一个新的图像文本训练集,其中包含10B图像和文本,以100多种语言来创建大型的多语言组合。帕利(Pali)在多个视觉和语言任务(例如字幕,视觉问题,索方式,场景文本理解)中实现了最新的,同时保留了简单,模块化和可扩展的设计。
translated by 谷歌翻译
我们为视觉和语言变压器模型提供了一种预训练方法,该方法基于各种任务的混合。我们探索了在预训练中使用图像文本字幕数据的使用,这不需要其他监督,以及对象感知的策略来预先培训模型。我们评估了许多文本式视觉+语言任务的方法,例如视觉问题答案,视觉范围和字幕,并证明了对标准预训练方法的巨大收益。
translated by 谷歌翻译
视频问题回答是一项具有挑战性的任务,需要共同理解语言输入,单个视频帧中的视觉信息以及视频中发生的事件的时间信息。在本文中,我们提出了一种新颖的多流视频编码器,用于视频问题回答,它使用多个视频输入和一种新的视频文本迭代迭代式共同指定方法来回答与视频相关的各种问题。我们在几个数据集上进行了实验评估该模型,例如MSRVTT-QA,MSVD-QA,IVQA,超过了大幅度的先前最新时间。同时,我们的模型将所需的Gflops从150-360减少到只有67,从而产生了高效的视频答案模型。
translated by 谷歌翻译
堆叠提高了架子上的存储效率,但是缺乏可见性和可访问性使机器人难以揭示和提取目标对象的机械搜索问题。在本文中,我们将横向访问机械搜索问题扩展到带有堆叠项目的架子,并引入了两种新颖的政策 - 堆叠场景(DARSS)和Monte Carlo Tree搜索堆叠场景(MCTSSS)的分配区域减少 - 使用Destacking和恢复行动。 MCTSS通过在每个潜在行动后考虑未来的状态来改善先前的LookAhead政策。在1200次模拟和18个物理试验中进行的实验,配备了刀片和吸力杯,这表明命令和重新攻击动作可以揭示目标对象的模拟成功率为82---100%,而在物理实验中获得了66----100%对于搜索密集包装的架子至关重要。在仿真实验中,这两种策略的表现都优于基线,并获得相似的成功率,但与具有完整状态信息的Oracle政策相比采取了更多步骤。在模拟和物理实验中,DARS在中位数步骤中的表现优于MCTSS,以揭示目标,但是MCTSS在物理实验中的成功率更高,表明对感知噪声的稳健性。请参阅https://sites.google.com/berkeley.edu/stax-ray,以获取补充材料。
translated by 谷歌翻译
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning. In contrast to previous works using contrastive or generative captioning training, we propose a novel and simple recipe to pre-train a vision-language joint model, which is multi-task as well. The pre-training uses only noisy image captioning data, and is formulated to use the entire architecture end-to-end with both a strong language encoder and decoder. Our results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results across a variety of question answering tasks. Our multi-task mixture training learns from tasks of various question intents and thus generalizes better, including on zero-shot vision-language tasks. We conduct experiments in the challenging multi-task and open-vocabulary settings and across a variety of datasets and tasks, such as VQA2.0, SNLI-VE, NLVR2, GQA. We observe that the proposed approach is able to generalize to unseen tasks and that more diverse mixtures lead to higher accuracy in both known and novel tasks.
translated by 谷歌翻译
我们提出了Findit,这是一个简单而多功能的框架,统一了各种视觉接地和本地化任务,包括引用表达理解,基于文本的本地化和对象检测。我们体系结构的关键是一个有效的多尺度融合模块,该模块统一了整个任务中不同的本地化要求。此外,我们发现标准对象检测器在统一这些任务的无需特定任务设计,损失或预计算检测方面非常有效。我们的端到端可训练框架灵活,准确地响应了零,一个或多个对象的广泛的参考表达,本地化或检测查询。在这些任务上进行了共同培训,发现在引用表达和基于文本的本地化方面,胜过最高的艺术状态,并在对象检测中表现出竞争性的性能。最后,与强大的单任务基准相比,Findit可以更好地推广到分布数据和新型类别。所有这些都是通过一个单一的,统一和有效的模型来完成的。代码将发布。
translated by 谷歌翻译
在本文中,我们介绍了一种新颖的视觉表示学习,它依赖于少数自适应地学习令牌,并且适用于图像和视频理解任务。而不是依靠手工设计的分割策略来获得视觉令牌并处理大量密集采样的补丁进行关注,我们的方法学会在视觉数据中挖掘重要令牌。这导致有效且有效地找到一些重要的视觉令牌,并且可以在这些令牌之间进行成像注意,在更长的视频的时间范围内,或图像中的空间内容。我们的实验表现出对图像和视频识别任务的几个具有挑战性的基准的强烈性能。重要的是,由于我们的令牌适应性,我们在显着减少的计算金额下实现竞争结果。在计算上更有效的同时,我们获得了对想象成的最先进结果的可比结果。我们在多个视频数据集中建立新的最先进的,包括动力学-400,动力学-600,Charades和Avid。代码可在:https://github.com/google-research/scenic/tree/main/scenic/projects/token_learner
translated by 谷歌翻译
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video. Unsupervised learning removes the need for separate supervisory signals (depth or ego-motion ground truth, or multi-view video). Prior work in unsupervised depth learning uses pixel-wise or gradient-based losses, which only consider pixels in small local neighborhoods. Our main contribution is to explicitly consider the inferred 3D geometry of the whole scene, and enforce consistency of the estimated 3D point clouds and ego-motion across consecutive frames. This is a challenging task and is solved by a novel (approximate) backpropagation algorithm for aligning 3D structures.We combine this novel 3D-based loss with 2D losses based on photometric quality of frame reconstructions using estimated depth and ego-motion from adjacent frames. We also incorporate validity masks to avoid penalizing areas in which no useful information exists.We test our algorithm on the KITTI dataset and on a video dataset captured on an uncalibrated mobile phone camera. Our proposed approach consistently improves depth estimates on both datasets, and outperforms the stateof-the-art for both depth and ego-motion. Because we only require a simple video, learning depth and ego-motion on large and varied datasets becomes possible. We demonstrate this by training on the low quality uncalibrated video dataset and evaluating on KITTI, ranking among top performing prior methods which are trained on KITTI itself. 1
translated by 谷歌翻译