From a visual scene containing multiple people, human is able to distinguish each individual given the context descriptions about what happened before, their mental/physical states or intentions, etc. Above ability heavily relies on human-centric commonsense knowledge and reasoning. For example, if asked to identify the "person who needs healing" in an image, we need to first know that they usually have injuries or suffering expressions, then find the corresponding visual clues before finally grounding the person. We present a new commonsense task, Human-centric Commonsense Grounding, that tests the models' ability to ground individuals given the context descriptions about what happened before, and their mental/physical states or intentions. We further create a benchmark, HumanCog, a dataset with 130k grounded commonsensical descriptions annotated on 67k images, covering diverse types of commonsense and visual scenes. We set up a context-object-aware method as a strong baseline that outperforms previous pre-trained and non-pretrained models. Further analysis demonstrates that rich visual commonsense and powerful integration of multi-modal commonsense are essential, which sheds light on future works. Data and code will be available https://github.com/Hxyou/HumanCog.
translated by 谷歌翻译
Deep neural networks still struggle on long-tailed image datasets, and one of the reasons is that the imbalance of training data across categories leads to the imbalance of trained model parameters. Motivated by the empirical findings that trained classifiers yield larger weight norms in head classes, we propose to reformulate the recognition probabilities through included angles without re-balancing the classifier weights. Specifically, we calculate the angles between the data feature and the class-wise classifier weights to obtain angle-based prediction results. Inspired by the performance improvement of the predictive form reformulation and the outstanding performance of the widely used two-stage learning framework, we explore the different properties of this angular prediction and propose novel modules to improve the performance of different components in the framework. Our method is able to obtain the best performance among peer methods without pretraining on CIFAR10/100-LT and ImageNet-LT. Source code will be made publicly available.
translated by 谷歌翻译
Visual commonsense understanding requires Vision Language (VL) models to not only understand image and text but also cross-reference in-between to fully integrate and achieve comprehension of the visual scene described. Recently, various approaches have been developed and have achieved high performance on visual commonsense benchmarks. However, it is unclear whether the models really understand the visual scene and underlying commonsense knowledge due to limited evaluation data resources. To provide an in-depth analysis, we present a Multimodal Evaluation (ME) pipeline to automatically generate question-answer pairs to test models' understanding of the visual scene, text, and related knowledge. We then take a step further to show that training with the ME data boosts the model's performance in standard VCR evaluation. Lastly, our in-depth analysis and comparison reveal interesting findings: (1) semantically low-level information can assist the learning of high-level information but not the opposite; (2) visual information is generally under utilization compared with text.
translated by 谷歌翻译
Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding the tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain-shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In this work, we seek to answer these questions through two key contributions. First, we introduce an evaluation protocol that includes Visual Commonsense Reasoning (VCR), Visual Entailment (SNLI-VE), and Visual Question Answering (VQA), across a variety of data availability constraints and conditions of domain shift. Second, we propose an approach, named CLIP Targeted Distillation (CLIP-TD), to intelligently distill knowledge from CLIP into existing architectures using a dynamically weighted objective applied to adaptively selected tokens per instance. Experiments demonstrate that our proposed CLIP-TD leads to exceptional gains in the low-shot (up to 51.9%) and domain-shifted (up to 71.3%) conditions of VCR, while simultaneously improving performance under standard fully-supervised conditions (up to 2%), achieving state-of-art performance on VCR compared to other single models that are pretrained with image-text data only. On SNLI-VE, CLIP-TD produces significant gains in low-shot conditions (up to 6.6%) as well as fully supervised (up to 3%). On VQA, CLIP-TD provides improvement in low-shot (up to 9%), and in fully-supervised (up to 1.3%). Finally, CLIP-TD outperforms concurrent works utilizing CLIP for finetuning, as well as baseline naive distillation approaches. Code will be made available.
translated by 谷歌翻译
回答有关图像的复杂问题是机器智能的雄心勃勃的目标,这需要联合了解图像,文本和致料知识,以及强烈的推理能力。最近,多模式变压器通过联合了解视觉对象和文本令牌,通过跨模型关注的层次,通过跨模板的关注来实现了巨大的进展。然而,这些方法不利用现场的丰富结构和对象之间的相互作用,这在回答复杂的致辞问题方面至关重要。我们提出了一个场景图,增强了图像文本学习(SGEITL)框架,以在致辞中合并视觉场景图。为了利用场景图结构,在模型结构级别,我们提出了一个多彩色图形变压器,用于规范跳跃之间的注意力。至于预训练,提出了一种场景图感知的预训练方法,以利用在视觉场景图中提取的结构知识。此外,我们介绍一种以虚弱的方式使用文本注释训练和生成域相关视野图的方法。与最先进的方法相比,对VCR和其他任务的广泛实验表明了显着的性能提升,并证明了每个提出的组分的功效。
translated by 谷歌翻译
我们提出了一个学习域移位的校准不确定性的框架。我们考虑源(训练)分布与目标(测试)分布不同的情况。我们通过使用二进制域分类器来检测此类域移位,并将其与任务网络集成并将其联合结束到底。二进制域分类器产生密度比,其反映目标(测试)样本的近距离源(训练)分布。我们雇用它来调整任务网络预测的不确定性。这种使用密度比的思想基于分布稳健的学习(DRL)框架,其通过对抗风险最小化来占域移位。我们证明我们的方法产生校准的不确定性,这些不确定性有利于许多下游任务,例如无监督的域适应(UDA)和半监督学习(SSL)。在这些任务中,像自我训练和纤维型等方法使用不确定性选择自信的伪标签进行重新培训。我们的实验表明,DRL的引入导致跨域性能的显着改善。我们还证明估计的密度比率与人类选择频率达成协议,表明与人类感知的不确定性的代理有正相关。
translated by 谷歌翻译
信号时间逻辑的鲁棒性不仅评估信号是否遵守规范,而且还提供了对公式的满足或违反的量度。鲁棒性的计算基于评估潜在谓词的鲁棒性。但是,通常以无模型方式(即不包括系统动力学)定义谓词的鲁棒性。此外,精确定义复杂谓词的鲁棒性通常是不平凡的。为了解决这些问题,我们提出了模型预测鲁棒性的概念,该概念通过考虑基于模型的预测,它与以前的方法相比提供了一种更系统的评估鲁棒性的方法。特别是,我们使用高斯过程回归来基于预定的预测来学习鲁棒性,以便可以在线上有效地计算鲁棒性值。我们评估了对自动驾驶用例的方法,该案例用在记录的数据集上使用形式的交通规则中使用的谓词来评估我们的方法,这与传统方法相比,在表达性方面相比,我们的方法优势。通过将我们的鲁棒性定义纳入轨迹规划师,自动驾驶汽车比数据集中的人类驾驶员更强大地遵守交通规则。
translated by 谷歌翻译
大规模的多模式对比预训练已经证明了通过将多种模式映射到共享嵌入空间中的一系列下游任务的可转移功能。通常,这对每种模式都采用了单独的编码器。但是,最近的工作表明,变形金刚可以支持跨多种方式学习并允许知识共享。受此启发,我们研究了各种模式共享的对比语言图像预训练(MS-CLIP)框架。更具体地说,我们质疑在对比预训练期间可以在跨模态共享变压器模型的多少个参数,并严格检查建筑设计选择,以将沿频谱共享的参数比例定位。在研究的条件下,我们观察到,视觉和语言信号的主要统一编码器优于所有其他分离更多参数的变体。此外,我们发现特定于特定于模态的平行模块进一步提高了性能。实验结果表明,所提出的MS-CLIP方法在零摄像机分类中(在YFCC-100M上进行了预训练)中,最多可超过13 \%相对的香草夹,同时支持降低参数。此外,在24个下游视觉任务的集合中,我们的方法在线性探测中优于Vanilla剪辑。此外,我们发现共享参数导致语义概念来自不同方式在嵌入空间中更接近地编码,从而促进了共同的语义结构(例如注意力模式)从语言到视觉的传递。代码可在\ href {https://github.com/hxyou/msclip} {url}中获得。
translated by 谷歌翻译
视频场景图(Vidsgg)旨在将视频内容解析到场景图中,其中涉及对视频中的时尚上下文信息进行建模。但是,由于数据集中的长尾训练数据,现有Vidsgg模型的概括性能可能会受到时空条件偏置问题的影响。在这项工作中,从元学习的角度来看,我们提出了一个新颖的元视频场景图(MVSGG)框架来解决这种偏见问题。具体而言,要处理各种类型的时空条件偏差,我们的框架首先构建了一个支持集和一组查询集,其中每个查询集的数据分布与支持集W.R.T.的数据分布不同。一种条件偏见。然后,通过执行新颖的元训练和测试过程,以优化模型,以在支持集的训练后在这些查询集上获得良好的测试性能,我们的框架可以有效地指导该模型学会对偏见进行良好的概括。广泛的实验证明了我们提出的框架的功效。
translated by 谷歌翻译
对比性语言图像预训练(剪辑)模型是最近提出的大规模训练模型,它吸引了计算机视觉社区越来越多的关注。从其巨大的图像文本训练集中受益,剪辑模型在零拍学习和图像文本匹配方面学习了出色的功能。为了提高剪辑在某些目标视觉概念上的识别性能,通常希望通过在额外的培训数据上微调一些利益来进一步更新剪辑模型。但是,此操作引起了一个重要的关注:更新会损害零镜头学习或剪辑的图像文本匹配能力,即灾难性的遗忘问题吗?如果是,是否可以适应现有的持续学习算法来减轻灾难性遗忘的风险?为了回答这些问题,这项工作对剪辑模型的持续学习问题进行了系统性研究。我们构建评估协议,以衡量微调更新的影响,并探索不同的方法来升级现有的持续学习方法,以减轻剪辑模型的遗忘问题。我们的研究揭示了剪辑持续学习问题的特殊挑战,并为进一步的研究奠定了基础。此外,我们提出了一种新算法,被称为学习,而无需通过重播词汇(VR-LWF)忘记,该算法显示出减轻剪辑模型遗忘问题的确切有效性。
translated by 谷歌翻译