智能论文笔记

Open-Vocabulary Temporal Action Detection with Off-the-Shelf Image-Text Features

Vivek Rathod , Bryan Seybold , Sudheendra Vijayanarasimhan , Austin Myers , Xiuye Gu , Vighnesh Birodkar , David A. Ross

分类：计算机视觉

2022-12-20

Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.

translated by 谷歌翻译

Distribution Aware Metrics for Conditional Natural Language Generation

David M Chan , Yiming Ni , Austin Myers , Sudheendra Vijayanarasimhan , David A Ross , John Canny

分类：自然语言处理 | 人工智能 | 计算机视觉 | 机器学习

2022-09-15

用于评估有条件自然语言生成的传统自动化指标使用单个生成的文本和最佳匹配的金标准地面真相文本之间的成对比较。当有多个基础真相可用时，分数将使用参考中的平均或最大操作进行汇总。尽管这种方法在地面真相数据中的多样性（即有条件文本的分布的分散）可以归因于噪声，例如自动语音识别中，但在地面上的多样性的情况下，它不允许进行强有力的评估。真理代表模型的信号。在这项工作中，我们认为现有的指标不适合诸如视觉描述或摘要之类的域，而地面真理在语义上是多样的，并且这些字幕中的多样性捕获了有关上下文的有用的其他信息。我们提出了一种新的范式，用于对条件语言生成模型的多键入评估以及一个新的指标家族，该指标家族使用每种少量样本集比较参考和模型生成的字幕集的分布。我们通过视觉描述中的案例研究证明了方法的实用性：我们在其中证明现有模型优化了单描述质量而不是多样性，并获得了对采样方法和温度影响如何描述质量和多样性的一些见解。

translated by 谷歌翻译

VideoBERT: A Joint Model for Video and Language Representation Learning

Chen Sun , Austin Myers , Carl Vondrick , Kevin Murphy , Cordelia Schmid

分类：

2019-04-03

Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube. Whereas most existing approaches learn low-level representations, we propose a joint visual-linguistic model to learn high-level features without any explicit supervision. In particular, inspired by its recent success in language modeling, we build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens, derived from vector quantization of video data and off-the-shelf speech recognition outputs, respectively. We use VideoBERT in numerous tasks, including action classification and video captioning. We show that it can be applied directly to openvocabulary classification, and confirm that large amounts of training data and cross-modal information are critical to performance. Furthermore, we outperform the state-of-theart on video captioning, and quantitative results verify that the model learns high-level semantic features.

translated by 谷歌翻译

HandsOff: Labeled Dataset Generation With No Additional Human Annotations

Austin Xu , Mariya I. Vasileva , Achal Dave , Arjun Seshadri

分类：计算机视觉 | 机器学习

2022-12-24

Recent work leverages the expressive power of generative adversarial networks (GANs) to generate labeled synthetic datasets. These dataset generation methods often require new annotations of synthetic images, which forces practitioners to seek out annotators, curate a set of synthetic images, and ensure the quality of generated labels. We introduce the HandsOff framework, a technique capable of producing an unlimited number of synthetic images and corresponding labels after being trained on less than 50 pre-existing labeled images. Our framework avoids the practical drawbacks of prior work by unifying the field of GAN inversion with dataset generation. We generate datasets with rich pixel-wise labels in multiple challenging domains such as faces, cars, full-body human poses, and urban driving scenes. Our method achieves state-of-the-art performance in semantic segmentation, keypoint detection, and depth estimation compared to prior dataset generation approaches and transfer learning baselines. We additionally showcase its ability to address broad challenges in model development which stem from fixed, hand-annotated datasets, such as the long-tail problem in semantic segmentation.

translated by 谷歌翻译

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan , Noah Brown , Justice Carbajal , Yevgen Chebotar , Joseph Dabis , Chelsea Finn , Keerthana Gopalakrishnan , Karol Hausman , Alex Herzog , Jasmine Hsu

分类：机器人 | 人工智能 | 自然语言处理 | 计算机视觉 | 机器学习

2022-12-13

By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io

translated by 谷歌翻译

Self-supervised AutoFlow

Hsin-Ping Huang , Charles Herrmann , Junhwa Hur , Erika Lu , Kyle Sargent , Austin Stone , Ming-Hsuan Yang , Deqing Sun

分类：计算机视觉

2022-12-04

Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art.

translated by 谷歌翻译

Deep Surrogate Docking: Accelerating Automated Drug Discovery with Graph Neural Networks

Ryien Hosseini , Filippo Simini , Austin Clyde , Arvind Ramanathan

分类：机器学习

2022-11-04

The process of screening molecules for desirable properties is a key step in several applications, ranging from drug discovery to material design. During the process of drug discovery specifically, protein-ligand docking, or chemical docking, is a standard in-silico scoring technique that estimates the binding affinity of molecules with a specific protein target. Recently, however, as the number of virtual molecules available to test has rapidly grown, these classical docking algorithms have created a significant computational bottleneck. We address this problem by introducing Deep Surrogate Docking (DSD), a framework that applies deep learning-based surrogate modeling to accelerate the docking process substantially. DSD can be interpreted as a formalism of several earlier surrogate prefiltering techniques, adding novel metrics and practical training practices. Specifically, we show that graph neural networks (GNNs) can serve as fast and accurate estimators of classical docking algorithms. Additionally, we introduce FiLMv2, a novel GNN architecture which we show outperforms existing state-of-the-art GNN architectures, attaining more accurate and stable performance by allowing the model to filter out irrelevant information from data more efficiently. Through extensive experimentation and analysis, we show that the DSD workflow combined with the FiLMv2 architecture provides a 9.496x speedup in molecule screening with a <3% recall error rate on an example docking task. Our open-source code is available at https://github.com/ryienh/graph-dock.

translated by 谷歌翻译

A 3D-Shape Similarity-based Contrastive Approach to Molecular Representation Learning

Austin Atsango , Nathaniel L. Diamant , Ziqing Lu , Tommaso Biancalani , Gabriele Scalia , Kangway V. Chuang

分类：机器学习

2022-11-03

Molecular shape and geometry dictate key biophysical recognition processes, yet many graph neural networks disregard 3D information for molecular property prediction. Here, we propose a new contrastive-learning procedure for graph neural networks, Molecular Contrastive Learning from Shape Similarity (MolCLaSS), that implicitly learns a three-dimensional representation. Rather than directly encoding or targeting three-dimensional poses, MolCLaSS matches a similarity objective based on Gaussian overlays to learn a meaningful representation of molecular shape. We demonstrate how this framework naturally captures key aspects of three-dimensionality that two-dimensional representations cannot and provides an inductive framework for scaffold hopping.

translated by 谷歌翻译

A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning

Aishwarya Kamath , Peter Anderson , Su Wang , Jing Yu Koh , Alexander Ku , Austin Waters , Yinfei Yang , Jason Baldridge , Zarana Parekh

分类：机器学习 | 自然语言处理 | 计算机视觉 | 机器人

2022-10-06

Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pretraining on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale imitation learning and the development of synthetic instruction generation capabilities.

translated by 谷歌翻译

Barrier functions enable safety-conscious force-feedback control

Charles Dawson , Austin Garrett , Falk Pollok , Yang Zhang , Chuchu Fan

分类：机器人

2022-09-25

为了成为人类的有效伴侣，机器人必须越来越舒适地与环境接触。不幸的是，机器人很难区分``足够的''和``太多''力：完成任务需要一些力量，但太多可能会损害设备或伤害人类。设计合规的反馈控制器（例如刚度控制）的传统方法需要对控制参数进行手工调整，并使建立安全，有效的机器人合作者变得困难。在本文中，我们提出了一种新颖而易于实现的力反馈控制器，该反馈控制器使用控制屏障功能（CBF）直接从用户的最大允许力和扭矩的用户规格中得出合并的控制器。我们比较了传统僵硬控制的方法，以证明控制架构的潜在优势，并在人类机器人协作任务中证明了控制器的有效性：对笨重对象的合作操纵。

translated by 谷歌翻译