理论深度学习的最新进展引入了训练期间发生的几何特性,超过了插值阈值 - 训练误差达到零。我们询问网络中间层中的神经崩溃,并强调了深网内部最近的中心不匹配的内部工作。我们进一步表明,这些过程既出现在视觉和语言模型体系结构中。最后,我们提出了一种随机变化损失(SVSL),该损失(SVSL)鼓励中间层中更好的几何特征,并改善了火车指标和泛化。
translated by 谷歌翻译
我们介绍了一种用于将一个自然形象的视觉外观转移到另一个自然形象的方法。具体地,我们的目标是生成一个图像,其中源结构图像中的对象是“绘制”的目标外观图像中的语义相关对象的视觉外观。我们的方法通过训练一个单个结构/外观映像对给出一个发电机作为输入。将语义信息集成到我们的框架中 - 解决此任务的关键组件 - 我们的主要思想是利用作为外部语义的预训练和固定视觉变压器(VIT)模型。具体而言,我们从深毒性特征中提取的结构和外观的新颖表示,从学习的自我关注模块中解开它们。然后,我们建立一个客观函数,即接头所需的结构和外观表示,在vit特征的空间中相互交互。我们术语“拼接”的框架不涉及对抗性培训,也不需要任何额外的输入信息,例如语义分割或通信,并且可以产生高分辨率结果,例如,在高清中工作。我们在物体数量,姿势和外观的显着变化下,我们展示了各种内野图像对的高质量结果。
translated by 谷歌翻译
我们利用从预先训练的视觉变压器(VIT)提取的深度特征,如密集的视觉描述符。我们证明这些特征是当从自我监督的Vit模型(Dino-Vit)中提取时,表现出几种打击性质:(i)特征在高空间分辨率下编码强大的高级信息 - 即,捕获精细的语义对象部件空间粒度和(ii)编码的语义信息跨相关但不同的对象类别(即超级类别)共享。这些属性允许我们设计强大的密集Vit描述符,便于各种应用,包括共分割,部分共分割和通信 - 通过将轻量级方法应用于深度染色特征(例如,分布/聚类)来实现。我们将这些应用程序进一步接受级别任务的领域 - 展示相关类别的对象如何在显着的姿势和外观变化下常规分段为语义部分。我们的方法,在定性和定量地评估的方法,实现最先进的部分共分割结果,以及最近监督方法的竞争结果,专门针对共同分割和对应关系。
translated by 谷歌翻译
GAN能够进行一代视频培训的生成和操纵任务。然而,这些单一视频GAN需要不合理的时间来训练单个视频,使它们几乎不切实际。在本文中,我们提出了从单个视频发电的GaN的必要性,并为各种生成和操纵任务引入非参数基准。我们恢复古典时空补丁 - 最近的邻居接近并使其适应可扩展的无条件生成模型,而无需任何学习。这种简单的基线令人惊讶地优于视觉质量和现实主义(通过定量和定性评估确认)的单视频导航,并且不成比例地更快(运行时从几天减少到秒)。除了不同的视频生成之外,我们使用相同的框架展示了其他应用程序,包括视频类比和时空复回靶向。我们所提出的方法很容易缩放到全高清视频。这些观察结果表明,古典方法(如果正确调整),这些任务的大幅优于重度深度学习机械。这为单视频生成和操作任务设置了新的基线,并且不太重要 - 首次从单个视频中从单个视频中产生多样化。
translated by 谷歌翻译
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to input text prompts, while consistent with input images. We present Imagen Editor, a cascaded diffusion model built, by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by using object detectors to propose inpainting masks during training. In addition, Imagen Editor captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.
translated by 谷歌翻译
Image segmentation is a fundamental task in computer vision. Data annotation for training supervised methods can be labor-intensive, motivating unsupervised methods. Some existing approaches extract deep features from pre-trained networks and build a graph to apply classical clustering methods (e.g., $k$-means and normalized-cuts) as a post-processing stage. These techniques reduce the high-dimensional information encoded in the features to pair-wise scalar affinities. In this work, we replace classical clustering algorithms with a lightweight Graph Neural Network (GNN) trained to achieve the same clustering objective function. However, in contrast to existing approaches, we feed the GNN not only the pair-wise affinities between local image features but also the raw features themselves. Maintaining this connection between the raw feature and the clustering goal allows to perform part semantic segmentation implicitly, without requiring additional post-processing steps. We demonstrate how classical clustering objectives can be formulated as self-supervised loss functions for training our image segmentation GNN. Additionally, we use the Correlation-Clustering (CC) objective to perform clustering without defining the number of clusters ($k$-less clustering). We apply the proposed method for object localization, segmentation, and semantic part segmentation tasks, surpassing state-of-the-art performance on multiple benchmarks.
translated by 谷歌翻译
Generative models are becoming ever more powerful, being able to synthesize highly realistic images. We propose an algorithm for taming these models - changing the probability that the model will produce a specific image or image category. We consider generative models that are powered by normalizing flows, which allows us to reason about the exact generation probability likelihood for a given image. Our method is general purpose, and we exemplify it using models that generate human faces, a subdomain with many interesting privacy and bias considerations. Our method can be used in the context of privacy, e.g., removing a specific person from the output of a model, and also in the context of de-biasing by forcing a model to output specific image categories according to a given target distribution. Our method uses a fast fine-tuning process without retraining the model from scratch, achieving the goal in less than 1% of the time taken to initially train the generative model. We evaluate qualitatively and quantitatively, to examine the success of the taming process and output quality.
translated by 谷歌翻译
本文提出了2022年访问量的挑战的最终结果。 OOV竞赛介绍了一个重要方面,而光学角色识别(OCR)模型通常不会研究,即,在培训时对看不见的场景文本实例的识别。竞赛编制了包含326,385张图像的公共场景文本数据集的集合,其中包含4,864,405个场景文本实例,从而涵盖了广泛的数据分布。形成了一个新的独立验证和测试集,其中包括在训练时出词汇量不超出词汇的场景文本实例。竞争是在两项任务中进行的,分别是端到端和裁剪的文本识别。介绍了基线和不同参与者的结果的详尽分析。有趣的是,在新研究的设置下,当前的最新模型显示出显着的性能差距。我们得出的结论是,在此挑战中提出的OOV数据集将是要探索的重要领域,以开发场景文本模型,以实现更健壮和广义的预测。
translated by 谷歌翻译
由于长期没有事件,处理动态数据时,陈旧问题是一个众所周知的问题。由于仅当节点参与事件时才更新节点的内存,因此其内存变为陈旧。通常,它是指缺乏社会帐户的时间停用等事件。为了克服内存的陈旧问题问题,除节点内存外,还来自节点邻居内存的信息。受此启发的启发,我们设计了一个更新的嵌入模块,该模块除节点邻居外还插入最相似的节点。我们的方法获得了与TGN相似的结果,并略有改进。这可能表明在微调我们的超参数后,尤其是时间阈值并使用可学习的相似度度量后,可能会有所改善。
translated by 谷歌翻译
网络分类旨在根据其结构将网络(或图形)分为不同的类别。我们研究网络及其组成节点的分类之间的联系,以及不同组网络的节点是否基于结构性节点特征,例如中心性和聚类系数。我们使用各种网络数据集和随机网络模型证明,可以训练分类器以准确预测给定节点的网络类别(不看到整个网络),这意味着复杂的网络即使在节点级别也显示出不同的结构模式。最后,我们讨论节点级网络分类的两个应用程序:(i)节点小样本和(ii)网络引导程序的全网络分类。
translated by 谷歌翻译