Self-supervised learning (SSL) aims to produce useful feature representations without access to any human-labeled data annotations. Due to the success of recent SSL methods based on contrastive learning, such as SimCLR, this problem has gained popularity. Most current contrastive learning approaches append a parametrized projection head to the end of some backbone network to optimize the InfoNCE objective and then discard the learned projection head after training. This raises a fundamental question: Why is a learnable projection head required if we are to discard it after training? In this work, we first perform a systematic study on the behavior of SSL training focusing on the role of the projection head layers. By formulating the projection head as a parametric component for the InfoNCE objective rather than a part of the network, we present an alternative optimization scheme for training contrastive learning based SSL frameworks. Our experimental study on multiple image classification datasets demonstrates the effectiveness of the proposed approach over alternatives in the SSL literature.
translated by 谷歌翻译
We present NeRFEditor, an efficient learning framework for 3D scene editing, which takes a video captured over 360{\deg} as input and outputs a high-quality, identity-preserving stylized 3D scene. Our method supports diverse types of editing such as guided by reference images, text prompts, and user interactions. We achieve this by encouraging a pre-trained StyleGAN model and a NeRF model to learn from each other mutually. Specifically, we use a NeRF model to generate numerous image-angle pairs to train an adjustor, which can adjust the StyleGAN latent code to generate high-fidelity stylized images for any given angle. To extrapolate editing to GAN out-of-domain views, we devise another module that is trained in a self-supervised learning manner. This module maps novel-view images to the hidden space of StyleGAN that allows StyleGAN to generate stylized images on novel views. These two modules together produce guided images in 360{\deg}views to finetune a NeRF to make stylization effects, where a stable fine-tuning strategy is proposed to achieve this. Experiments show that NeRFEditor outperforms prior work on benchmark and real-world scenes with better editability, fidelity, and identity preservation.
translated by 谷歌翻译
Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we note that prior works in this direction suffer from an intrinsic domain shift problem, wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop.
translated by 谷歌翻译
颜色和结构是结合形象的两个支柱。对神经网络识别的关键结构感兴趣,我们通过将颜色空间限制为几个位来隔离颜色的影响,并找到能够在此类约束下实现网络识别的结构。为此,我们提出了一个颜色量化网络Colorcnn,该网络通过最大程度地减少分类损失来学习在有限的颜色空间中构建图像。在Colorcnn的体系结构和见解的基础上,我们介绍了ColorCnn+,该+支持多种颜色空间大小的配置,并解决了以前的识别精度差的不良问题和在大型颜色空间下的不良视觉保真度。通过一种新颖的模仿学习方法,Colorcnn+学会了群集颜色,例如传统的颜色量化方法。这减少了过度拟合,并有助于在大颜色空间下的视觉保真度和识别精度。实验验证ColorCNN+在大多数情况下取得了非常有竞争力的结果,可以保留具有准确颜色的网络识别和视觉保真度的关键结构。我们进一步讨论关键结构和准确颜色之间的差异及其对网络识别的具体贡献。对于潜在应用,我们表明ColorCNN可以用作网络识别的图像压缩方法。
translated by 谷歌翻译
半监督的语义细分需要对未标记的数据进行丰富而强大的监督。一致性学习强制执行相同的像素在不同的增强视图中具有相似的特征,这是一个强大的信号,但忽略了与其他像素的关系。相比之下,对比学习考虑了丰富的成对关系,但是为像素对分配二进制阳性阴性监督信号可能是一个难题。在本文中,我们竭尽所能,并提出多视图相关性一致性(MVCC)学习:它考虑了自相关矩阵中的丰富成对关系,并将它们匹配到视图中以提供强大的监督。加上这种相关性一致性损失,我们提出了一个视图增强策略,可以保证不同观点之间的像素像素对应关系。在两个数据集上的一系列半监督设置中,我们报告了与最先进方法相比的竞争精度。值得注意的是,在CityScapes上,我们以1/8标记的数据达到76.8%的MIOU,比完全监督的Oracle差0.6%。
translated by 谷歌翻译
概括和不变性是任何机器学习模型的两个基本属性。概括捕获了模型对看不见的数据进行分类的能力,而不变性测量数据转换的模型预测的一致性。现有研究表明存在积极的关系:概括井的模型应该是某些视觉因素的不变性。在这种定性含义的基础上,我们做出了两项贡献。首先,我们引入有效不变性(EI),这是一种简单合理的模型不变性度量,不依赖图像标签。给定对测试图像及其转换版本的预测,EI衡量了预测如何与何种置信度相吻合。其次,使用EI计算的不变性得分,我们在泛化和不变性之间进行大规模的定量相关研究,重点是旋转和灰度转换。从以模型为中心的角度来看,我们观察到不同模型的概括和不变性在分布和分布数据集上都表现出牢固的线性关系。从以数据集为中心的视图中,我们发现某个模型的精度和不变性在不同的测试集上线性相关。除了这些主要发现外,还讨论了其他次要但有趣的见解。
translated by 谷歌翻译
用于视觉数据的变压器模型的最新进程导致识别和检测任务的显着改进。特别是,使用学习查询代替区域建议,这已经引起了一种新的一类单级检测模型,由检测变压器(DETR)。这种单阶段方法的变化已经主导了人对象相互作用(HOI)检测。然而,这种单阶段Hoi探测器的成功可以很大程度上被归因于变压器的表示力。我们发现,当配备相同的变压器时,他们的两级同行可以更加性能和记忆力,同时取得一小部分训练。在这项工作中,我们提出了一对成对变压器,这是一个用于HOI的一元和成对表示的两级检测器。我们观察到我们的变压器网络的一对和成对部分专门化,前者优先增加积极示例的分数,后者降低了阴性实例的分数。我们评估我们在HiCO-DET和V-Coco数据集上的方法,并显着优于最先进的方法。在推理时间内,我们使用RESET50的模型在单个GPU上接近实时性能。
translated by 谷歌翻译
消失和爆炸渐变的问题是一种长期障碍,阻碍了神经网络的有效培训。尽管有各种诀窍和技巧,但在实践中被采用缓解问题,仍然缺乏令人满意的理论或可证明的解决方案。在本文中,我们从高维概率理论的角度解决了问题。我们提供了一个严谨的结果,如有轻微条件,如果神经网络具有足够的宽度,消失/爆炸梯度问题如何消失。我们的主要思想是通过新的激活函数,即高斯 - Poincar的归一化功能和正交权重矩阵来限制非线性神经网络中的前向和向后信号传播。综合性和现实世界数据的实验验证了我们的理论,并在实践中施用时确认了对非常深刻的神经网络的有效性。
translated by 谷歌翻译
A robot that can carry out a natural-language instruction has been a dream since before the Jetsons cartoon series imagined a life of leisure mediated by a fleet of attentive robot helpers. It is a dream that remains stubbornly distant. However, recent advances in vision and language methods have made incredible progress in closely related areas. This is significant because a robot interpreting a naturallanguage navigation instruction on the basis of what it sees is carrying out a vision and language process that is similar to Visual Question Answering. Both tasks can be interpreted as visually grounded sequence-to-sequence translation problems, and many of the same methods are applicable. To enable and encourage the application of vision and language methods to the problem of interpreting visuallygrounded navigation instructions, we present the Matter-port3D Simulator -a large-scale reinforcement learning environment based on real imagery [11]. Using this simulator, which can in future support a range of embodied vision and language tasks, we provide the first benchmark dataset for visually-grounded natural language navigation in real buildings -the Room-to-Room (R2R) dataset 1 .1 https://bringmeaspoon.org Instruction: Head upstairs and walk past the piano through an archway directly in front. Turn right when the hallway ends at pictures and table. Wait by the moose antlers hanging on the wall.
translated by 谷歌翻译
Top-down visual attention mechanisms have been used extensively in image captioning and visual question answering (VQA) to enable deeper image understanding through fine-grained analysis and even multiple steps of reasoning. In this work, we propose a combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions. This is the natural basis for attention to be considered. Within our approach, the bottom-up mechanism (based on Faster R-CNN) proposes image regions, each with an associated feature vector, while the top-down mechanism determines feature weightings. Applying this approach to image captioning, our results on the MSCOCO test server establish a new state-of-the-art for the task, achieving CIDEr / SPICE / BLEU-4 scores of 117.9, 21.5 and 36.9, respectively. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge.
translated by 谷歌翻译