文本VQA旨在回答需要了解图像中文本提示的问题。尽管现有的文本VQA方法取得了长足的进步,但它们的性能仍遭受了人类标记的问题解答(QA)对不足。但是,我们观察到,通常在现有数据集中没有完全利用场景文本 - 每个图像中只有一小部分文本参与了带注释的QA活动。这导致大量有用的信息浪费。为了解决这种缺陷,我们开发了一种新方法来通过明确利用每个图像的场景上下文中可用的现有文本来生成高质量和多样化的质量质量对。具体而言,我们建议,TAG是一种文本感知的视觉问题 - 答案生成的结构,该结构学会使用多模式变压器来生成有意义且准确的QA样品。该体系结构通过将生成的QA对与初始培训数据相结合,从而利用了未充满激光的场景文本信息,并增强了文本VQA模型的场景理解。对两个众所周知的Text-VQA基准(TextVQA和ST-VQA)的广泛实验结果表明,我们提议的标签有效地扩大了训练数据,有助于提高文本VQA性能而无需额外的标签努力。此外,我们的模型优于预先通过大规模数据进行训练的最先进方法。代码将公开可用。
translated by 谷歌翻译
我们提出了Clip-Lite,一种通过与文本注释的特征对齐方式进行视觉表示学习的信息有效方法。与先前提出的剪辑模型相比,剪辑液在优化其对比学学习目标期间只需要一个负图像文本样本对。我们通过利用信息有效的较低限制来实现这一点,以最大化两个输入模态之间的相互信息。这允许剪辑Lite培训,在获得比夹子的更好的性能的同时具有显着减少的数据和批量尺寸。我们通过在Coco-Tablions数据集上预先绘制来评估剪贴画并对其他数据集进行测试传输。 Clip-Lite在Pascal VOC分类上获得+ 15.4%的映射绝对增益,并在ImageNet上获得A + 22.1%的前1个精度增益,同时与其他更复杂,文本监督模型相当或优越。 Clip-Lite还优于剪辑图像和文本检索,零拍分类和视觉接地。最后,通过在表示学习期间执行显式图像文本对齐,我们显示Clip-Lite可以利用语言语义来鼓励可以在下游任务中使用的无偏见的视觉表示。
translated by 谷歌翻译
由于存在对象的自然时间转换,视频是一种具有自我监督学习(SSL)的丰富来源。然而,目前的方法通常是随机采样用于学习的视频剪辑,这导致监督信号差。在这项工作中,我们提出了预先使用无监督跟踪信号的SSL框架,用于选择包含相同对象的剪辑,这有助于更好地利用对象的时间变换。预先使用跟踪信号在空间上限制帧区域以学习并通过在Grad-CAM注意图上提供监督来定位模型以定位有意义的物体。为了评估我们的方法,我们在VGG-Sound和Kinetics-400数据集上培训势头对比(MOCO)编码器,预先使用预先。使用Previts的培训优于Moco在图像识别和视频分类下游任务中独自学习的表示,从而获得了行动分类的最先进的性能。预先帮助学习更强大的功能表示,以便在背景和视频数据集上进行背景和上下文更改。从大规模未婚视频中学习具有预算的大规模未能视频可能会导致更准确和强大的视觉功能表示。
translated by 谷歌翻译
Large-scale vision and language representation learning has shown promising improvements on various vision-language tasks. Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens. Because the visual tokens and word tokens are unaligned, it is challenging for the multimodal encoder to learn image-text interactions. In this paper, we introduce a contrastive loss to ALign the image and text representations BEfore Fusing (ALBEF) them through cross-modal attention, which enables more grounded vision and language representation learning. Unlike most existing methods, our method does not require bounding box annotations nor high-resolution images. To improve learning from noisy web data, we propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model. We provide a theoretical analysis of ALBEF from a mutual information maximization perspective, showing that different training tasks can be interpreted as different ways to generate views for an image-text pair. ALBEF achieves state-of-the-art performance on multiple downstream visionlanguage tasks. On image-text retrieval, ALBEF outperforms methods that are pre-trained on orders of magnitude larger datasets. On VQA and NLVR 2 , ALBEF achieves absolute improvements of 2.37% and 3.84% compared to the state-ofthe-art, while enjoying faster inference speed. Code and models are available at https://github.com/salesforce/ALBEF.
translated by 谷歌翻译
We propose a technique for producing 'visual explanations' for decisions from a large class of Convolutional Neural Network (CNN)-based models, making them more transparent and explainable.Our approach -Gradient-weighted Class Activation Mapping (Grad-CAM), uses the gradients of any target concept (say 'dog' in a classification network or a sequence of words in captioning network) flowing into the final convolutional layer to produce a coarse localization map highlighting the important regions in the image for predicting the concept.Unlike previous approaches, Grad-CAM is applicable to a wide variety of CNN model-families: (1) CNNs with fullyconnected layers (e.g. VGG), (2) CNNs used for structured outputs (e.g. captioning), (3) CNNs used in tasks with multimodal inputs (e.g. visual question answering) or reinforcement learning, all without architectural changes or re-training. We combine Grad-CAM with existing fine-grained visualizations to create a high-resolution class-discriminative vi-
translated by 谷歌翻译
The recent increase in public and academic interest in preserving biodiversity has led to the growth of the field of conservation technology. This field involves designing and constructing tools that utilize technology to aid in the conservation of wildlife. In this article, we will use case studies to demonstrate the importance of designing conservation tools with human-wildlife interaction in mind and provide a framework for creating successful tools. These case studies include a range of complexities, from simple cat collars to machine learning and game theory methodologies. Our goal is to introduce and inform current and future researchers in the field of conservation technology and provide references for educating the next generation of conservation technologists. Conservation technology not only has the potential to benefit biodiversity but also has broader impacts on fields such as sustainability and environmental protection. By using innovative technologies to address conservation challenges, we can find more effective and efficient solutions to protect and preserve our planet's resources.
translated by 谷歌翻译
A Digital Twin (DT) is a simulation of a physical system that provides information to make decisions that add economic, social or commercial value. The behaviour of a physical system changes over time, a DT must therefore be continually updated with data from the physical systems to reflect its changing behaviour. For resource-constrained systems, updating a DT is non-trivial because of challenges such as on-board learning and the off-board data transfer. This paper presents a framework for updating data-driven DTs of resource-constrained systems geared towards system health monitoring. The proposed solution consists of: (1) an on-board system running a light-weight DT allowing the prioritisation and parsimonious transfer of data generated by the physical system; and (2) off-board robust updating of the DT and detection of anomalous behaviours. Two case studies are considered using a production gas turbine engine system to demonstrate the digital representation accuracy for real-world, time-varying physical systems.
translated by 谷歌翻译
We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure, meaning that certain parts of the state space move "fast" (and in a sense, are more influential) while other parts transition more "slowly." Such structure is common in real-world problems where sequential decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. Examples include: (1) service allocation for a multi-class queue with (slowly varying) stochastic costs, (2) a restless multi-armed bandit with an environmental state, and (3) energy demand response, where both day-ahead and real-time prices play a role in the firm's revenue. Models that fully capture these problems often result in MDPs with large state spaces and large effective time horizons (due to frequent decisions), rendering them computationally intractable. We propose an approximate dynamic programming algorithmic framework based on the idea of "freezing" the slow states, solving a set of simpler finite-horizon MDPs (the lower-level MDPs), and applying value iteration (VI) to an auxiliary MDP that transitions on a slower timescale (the upper-level MDP). We also extend the technique to a function approximation setting, where a feature-based linear architecture is used. On the theoretical side, we analyze the regret incurred by each variant of our frozen-state approach. Finally, we give empirical evidence that the frozen-state approach generates effective policies using just a fraction of the computational cost, while illustrating that simply omitting slow states from the decision modeling is often not a viable heuristic.
translated by 谷歌翻译
While the capabilities of autonomous systems have been steadily improving in recent years, these systems still struggle to rapidly explore previously unknown environments without the aid of GPS-assisted navigation. The DARPA Subterranean (SubT) Challenge aimed to fast track the development of autonomous exploration systems by evaluating their performance in real-world underground search-and-rescue scenarios. Subterranean environments present a plethora of challenges for robotic systems, such as limited communications, complex topology, visually-degraded sensing, and harsh terrain. The presented solution enables long-term autonomy with minimal human supervision by combining a powerful and independent single-agent autonomy stack, with higher level mission management operating over a flexible mesh network. The autonomy suite deployed on quadruped and wheeled robots was fully independent, freeing the human supervision to loosely supervise the mission and make high-impact strategic decisions. We also discuss lessons learned from fielding our system at the SubT Final Event, relating to vehicle versatility, system adaptability, and re-configurable communications.
translated by 谷歌翻译
Machine learning is the dominant approach to artificial intelligence, through which computers learn from data and experience. In the framework of supervised learning, for a computer to learn from data accurately and efficiently, some auxiliary information about the data distribution and target function should be provided to it through the learning model. This notion of auxiliary information relates to the concept of regularization in statistical learning theory. A common feature among real-world datasets is that data domains are multiscale and target functions are well-behaved and smooth. In this paper, we propose a learning model that exploits this multiscale data structure and discuss its statistical and computational benefits. The hierarchical learning model is inspired by the logical and progressive easy-to-hard learning mechanism of human beings and has interpretable levels. The model apportions computational resources according to the complexity of data instances and target functions. This property can have multiple benefits, including higher inference speed and computational savings in training a model for many users or when training is interrupted. We provide a statistical analysis of the learning mechanism using multiscale entropies and show that it can yield significantly stronger guarantees than uniform convergence bounds.
translated by 谷歌翻译