为了满足各种用户需求,近年来对图形布局的不同子任务进行了深入探讨。现有研究通常提出具有不同投入输出格式,专用模型体系结构和不同学习方法的任务特异性方法。但是,这些专业的方法使得适应了看不见的子任务,阻碍了不同子任务之间的知识共享,并且与设计通用模型的趋势背道而驰。在这项工作中,我们提出了Unilayout,该Unilayout以统一的方式处理图形布局生成的不同子任务。首先,我们统一地表示子任务的各种输入和输出作为令牌序列。然后,基于统一的序列格式,我们自然利用具有不同子任务的变压器的相同的编码器架构。此外,基于上述两种统一,我们进一步开发了一个同时支持所有子任务的单个模型。在两个公共数据集上的实验表明,尽管简单,单层虽然明显优于先前的特定于任务的方法。
translated by 谷歌翻译
创建视觉布局是图形设计的重要步骤。当我们寻求比例和多样化的视觉设计时,这种布局的自动生成很重要。在自动布局的作品上,专注于无条件生成,其中模型在忽略用户需要进行特定问题的同时生成布局。为了提前有条件布局,我们介绍了BLT,双向布局变压器。 BLT与自回归解码不同,因为它首先生成满足用户输入的布局,然后迭代地改进布局。我们验证了具有各种保真度量的多个基准测试模型。我们的结果表明,最先进的布局变压器模型的两个主要进步。首先,我们的模型授权布局变压器来满足可控布局的制作。其次,我们的模型削减了自回归解码的线性推理时间达到恒定的复杂度,从而在推理时间以制定布局实现4x-10x的加速。
translated by 谷歌翻译
布局生成是计算机视觉中的一项新任务,它结合了对象本地化和美学评估中的挑战,在广告,海报和幻灯片设计中广泛使用。准确而愉快的布局应考虑布局元素内的内域关系以及布局元素与图像之间的域间关系。但是,大多数以前的方法只是专注于图像 - 范围 - 不平衡的布局生成,而无需利用图像中复杂的视觉信息。为此,我们探索了一个名为“图像条件的布局生成”的新颖范式,该范式旨在以语义连贯的方式将文本叠加层添加到图像中。具体而言,我们提出了一个图像条件的变分变压器(ICVT),该变形变压器(ICVT)在图像中生成各种布局。首先,采用自我注意的机制来对布局元素内的上下文关系进行建模,而交叉注意机制用于融合条件图像的视觉信息。随后,我们将它们作为有条件变异自动编码器(CVAE)的构件,表现出吸引人的多样性。其次,为了减轻布局元素域和视觉域之间的差距,我们设计了一个几何对齐模块,其中图像的几何信息与布局表示形式对齐。此外,我们构建了一个大规模的广告海报布局设计数据集,并具有精致的布局和显着图。实验结果表明,我们的模型可以在图像的非侵入区域中自适应生成布局,从而产生和谐的布局设计。
translated by 谷歌翻译
给定的用户输入的自动生成平面图在建筑设计中具有很大的潜力,最近在计算机视觉社区中探索了。但是,大多数现有方法以栅格化图像格式合成平面图,这些图像很难编辑或自定义。在本文中,我们旨在将平面图合成为1-D向量的序列,从而简化用户的互动和设计自定义。为了产生高保真矢量化的平面图,我们提出了一个新颖的两阶段框架,包括草稿阶段和多轮精炼阶段。在第一阶段,我们使用图形卷积网络(GCN)编码用户的房间连接图输入,然后应用自回归变压器网络以生成初始平面图。为了抛光最初的设计并生成更具视觉吸引力的平面图,我们进一步提出了一个由GCN和变压器网络组成的新颖的全景精炼网络(PRN)。 PRN将初始生成的序列作为输入,并完善了平面图设计,同时鼓励我们提出的几何损失来鼓励正确的房间连接。我们已经对现实世界平面图数据集进行了广泛的实验,结果表明,我们的方法在不同的设置和评估指标下实现了最先进的性能。
translated by 谷歌翻译
We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 9 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark (DUE).
translated by 谷歌翻译
Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
translated by 谷歌翻译
随着视觉前训练的成功,我们目睹了最先进的方式,以多模式的理解和产生推动。但是,当前的预训练范式不能一次靶向所有模式(例如,文本生成和图像生成),或者需要多重设计良好的任务,从而显着限制可伸缩性。我们证明,可以通过文本和图像序列的前缀语言建模目标学习统一的模态模型。得益于简单但功能强大的预训练范式,我们提出的模型Davinci非常易于训练,可扩展到巨大的数据,并且可以适应跨模态(语言 /视觉 /视觉+语言)的各种下游任务(类型)(理解) / generation)和设置(例如,零射,微调,线性评估)具有单个统一体系结构。达文奇(Davinci)在26个理解 /发电任务的广泛范围内实现了竞争性能,并且在大多数任务上都超过了以前的统一视力语言模型,包括Imagenet分类(+1.6%),VQAV2(+1.4%)(+1.4%),可可标题生成(Bleu@@@@@ 4 +1.1%,苹果酒 +1.5%)和可可图像生成( +0.9%,FID -1.0%),在可比的模型和数据量表处。此外,我们通过在异质和广泛的分布覆盖范围内报告不同尺度的量表上的性能,为将来的研究提供了明确的基准。我们的结果建立了新的,更强的基线,以便将来在不同的数据量表上进行比较,并阐明了更广泛地比较VLP模型的困难。
translated by 谷歌翻译
文本对图像综合的症结很大,源于保持输入文本和合成图像之间的跨模式语义一致性的困难。试图直接建模文本图像映射的典型方法只能在文本中捕获指示常见对象或动作但无法学习其空间分布模式的文本中的关键字。规避此限制的一种有效方法是生成图像布局作为指导,这是通过一些方法尝试的。然而,由于输入文本和对象位置的多样性,这些方法无法生成实际有效的布局。在本文中,我们推动在文本到图像生成和布局到图像合成中进行有效的建模。具体而言,我们将文本到序列生成作为序列到序列建模任务,并在变压器上构建我们的模型,以通过对它们之间的顺序依赖性进行建模,以了解对象之间的空间关系。在布局到图像合成的阶段,我们专注于在布局中每个对象中的每个对象学习文本 - 视觉对齐,以精确地将输入文本纳入布局到图像构图合成过程。为了评估生成的布局的质量,我们设计了一个新的度量标准,称为布局质量得分,该评分既考虑了布局中边界框的绝对分布误差,又考虑了它们之间的相互空间关系。在三个数据集上进行的广泛实验证明了我们的方法优于最先进的方法,既可以预测布局和从给定文本综合图像。
translated by 谷歌翻译
Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific\cite {ahuja2019language2pose}\cite{ghosh2021synthesis}\cite{ferreira2021learning}\cite{li2021ai}. In this paper, we propose ``UDE", the first unified driving engine that enables generating human motion sequences from natural language or audio sequences (see Fig.~\ref{fig:teaser}). Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code\cite{van2017neural}, 2) a modality-agnostic transformer encoder\cite{vaswani2017attention} that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like\cite{radford2019language}) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D\cite{Guo_2022_CVPR} and AIST++\cite{li2021learn} benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance. Project website: \url{https://github.com/zixiangzhou916/UDE/
translated by 谷歌翻译
在本文中,我们提出了Unicorn,一种vision-language(vl)模型,使文本生成和边界框预测到单个架构中。具体而言,我们将每个框量化为四个离散框令牌,并将其序列化为序列,可以与文本令牌集成。我们将所有VL问题作为一代任务,其中目标序列由集成文本和框令牌组成。然后,我们训练变压器编码器解码器以以自动回归方式预测目标。通过如此统一的框架和输入输出格式,Unicorn在7 VL基准测试中实现了对现有技术的可比性的性能,涵盖了视觉接地,接地字幕,视觉问题应答和图像标题任务。当用多任务FINETUNING培训时,UNICORN可以通过单一的参数方法接近不同的VL任务,从而跨越下游任务边界。我们展示了具有单一模型不仅可以节省参数,而且还可以在某些任务上提高模型性能。最后,Unicorn显示了概括到诸如ImageNet对象本地化的新任务的能力。
translated by 谷歌翻译
Generalist models, which are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model, have been explored recently. Being, hopefully, an alternative to approaching general-purpose AI, existing generalist models are still at an early stage, where modality and task coverage is limited. To empower multi-modal task-scaling and speed up this line of research, we release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction. At the core of OFASys is the idea of decoupling multi-modal task representations from the underlying model implementations. In OFASys, a task involving multiple modalities can be defined declaratively even with just a single line of code. The system automatically generates task plans from such instructions for training and inference. It also facilitates multi-task training for diverse multi-modal workloads. As a starting point, we provide presets of 7 different modalities and 23 highly-diverse example tasks in OFASys, with which we also develop a first-in-kind, single model, OFA+, that can handle text, image, speech, video, and motion data. The single OFA+ model achieves 95% performance in average with only 16% parameters of 15 task-finetuned models, showcasing the performance reliability of multi-modal task-scaling provided by OFASys. Available at https://github.com/OFA-Sys/OFASys
translated by 谷歌翻译
Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs are skill-demanding, time-consuming, and non-scalable to batch production. Although generative models emerge to make design automation no longer utopian, it remains non-trivial to customize designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground contents. In this study, we propose \textit{LayoutDETR} that inherits the high quality and realism from generative modeling, in the meanwhile reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal elements in a layout. Experiments validate that our solution yields new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ads banner dataset. For practical usage, we build our solution into a graphical system that facilitates user studies. We demonstrate that our designs attract more subjective preference than baselines by significant margins. Our code, models, dataset, graphical system, and demos are available at https://github.com/salesforce/LayoutDETR.
translated by 谷歌翻译
我们提出了一种用于场景文本视觉问题的新型多模式架构(STVQA),命名为布局感知变压器(LatR)。 STVQA的任务需要模型以推理不同的方式。因此,我们首先调查每种方式的影响,并揭示语言模块的重要性,尤其是在丰富布局信息时。考虑到这一点,我们提出了一种客观预培训计划,只需要文本和空间线索。我们表明,尽管域间隙差距,但仍然对扫描文件进行了对扫描文件的培训方案具有某些优点。扫描的文档易于采购,文本密集并具有各种布局,帮助模型通过捆绑语言和布局信息来学习各种空间线索(例如,下面等等)。与现有方法相比,我们的方法执行无词汇解码,如图所示,概括到超出培训词汇。我们进一步证明Latr改善了对OCR错误的鲁棒性,在STVQA失败的常见原因。另外,通过利用视觉变压器,我们消除了对外部物体检测器的需求。 Latr在多个数据集上赢得最先进的STVQA方法。特别是+ 7.6%的TextVQA,ST-VQA上的10.8%,+ 4.0%在OCR-VQA(所有绝对精度数字)。
translated by 谷歌翻译
Astounding results from Transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. Among their salient benefits, Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence as compared to recurrent networks e.g., Long short-term memory (LSTM). Different from convolutional networks, Transformers require minimal inductive biases for their design and are naturally suited as set-functions. Furthermore, the straightforward design of Transformers allows processing multiple modalities (e.g., images, videos, text and speech) using similar processing blocks and demonstrates excellent scalability to very large capacity networks and huge datasets. These strengths have led to exciting progress on a number of vision tasks using Transformer networks. This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline. We start with an introduction to fundamental concepts behind the success of Transformers i.e., self-attention, large-scale pre-training, and bidirectional feature encoding. We then cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering, visual reasoning, and visual grounding), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution, image enhancement, and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works. We hope this effort will ignite further interest in the community to solve current challenges towards the application of transformer models in computer vision.
translated by 谷歌翻译
Color is a critical design factor for web pages, affecting important factors such as viewer emotions and the overall trust and satisfaction of a website. Effective coloring requires design knowledge and expertise, but if this process could be automated through data-driven modeling, efficient exploration and alternative workflows would be possible. However, this direction remains underexplored due to the lack of a formalization of the web page colorization problem, datasets, and evaluation protocols. In this work, we propose a new dataset consisting of e-commerce mobile web pages in a tractable format, which are created by simplifying the pages and extracting canonical color styles with a common web browser. The web page colorization problem is then formalized as a task of estimating plausible color styles for a given web page content with a given hierarchical structure of the elements. We present several Transformer-based methods that are adapted to this task by prepending structural message passing to capture hierarchical relationships between elements. Experimental results, including a quantitative evaluation designed for this task, demonstrate the advantages of our methods over statistical and image colorization methods. The code is available at https://github.com/CyberAgentAILab/webcolor.
translated by 谷歌翻译
文档信息提取(DIE)由于其在现实世界中的各种高级应用而引起了越来越多的关注。尽管最近的文献已经取得了竞争成果,但在处理具有嘈杂的OCR结果或突变布局的复杂文档时,这些方法通常会失败。本文提出了用于现实世界情景的生成多模式网络(GMN),以解决这些问题,这是一种强大的多模式生成方法,没有预定义的标签类别。借助精心设计的空间编码器和模态感知的蒙版模块,GMN可以处理复杂的文档,这些文档很难序列化为顺序。此外,GMN可以容忍OCR结果中的错误,并且不需要字符级注释,这是至关重要的,因为对众多文档的细粒注释很费力,甚至需要具有专门域知识的注释者。广泛的实验表明,GMN在几个公共模具数据集上实现了新的最新性能,并超过了其他方法,尤其是在现实的场景中。
translated by 谷歌翻译
移动屏幕的布局是UI设计研究和对屏幕的语义理解的关键数据源。但是,现有数据集中的UI布局通常是嘈杂的,具有与其视觉表示的不匹配,或者由难以分析和模型的通用或应用特定类型组成。在本文中,我们提出了使用深度学习方法的粘土管道,用于去噪UI布局,允许我们在比例下自动改进现有的移动UI布局数据集。我们的管道采用屏幕截图和原始UI布局,通过删除不正确的节点并向每个节点分配语义有意义的类型来注释原始布局。为了实验我们的数据清洁管道,我们根据来自Rico的截图和原始布局,创建59,555个人注释的屏幕布局的粘土数据集,该网站上是一个公共移动UI语料库。我们的深度模型可实现高精度,F1分数为82.7%,用于检测没有有效的视觉表示的布局对象,85.9%用于识别对象类型,这显着优于启发式基线。我们的工作为创建大规模高质量的UI布局数据集提供了用于数据驱动的移动UI研究的基础,并减少了手动标签的需要,这些努力非常昂贵。
translated by 谷歌翻译
我们提出了Unified-io,该模型执行了跨越经典计算机视觉任务的各种AI任务,包括姿势估计,对象检测,深度估计和图像生成,视觉和语言任务,例如区域字幕和引用表达理解,并引用表达理解,进行自然语言处理任务,例如回答和释义。由于与每个任务有关的异质输入和输出,包括RGB图像,每个像素映射,二进制掩码,边界框和语言,开发一个统一模型引起了独特的挑战。我们通过将每个受支持的输入和输出均匀地均匀地统一到一系列离散的词汇令牌来实现这一统一。在所有任务中,这种共同的表示使我们能够在视觉和语言字段中的80多个不同数据集上培训单个基于变压器的体系结构。 Unified-io是第一个能够在砂砾基准上执行所有7个任务的模型,并在NYUV2-DEPTH,Imagenet,VQA2.0,OK-VQA,SWIG,SWIG,VIZWIZ,BOOLQ,BOOLQ和SCITAIL,带有NYUV2-DEPTH,Imagenet,VQA2.0,诸如NYUV2-DEPTH,ImageNet,vqa2.0等16个不同的基准中产生强大的结果。没有任务或基准特定的微调。 unified-io的演示可在https://unified-io.allenai.org上获得。
translated by 谷歌翻译
用于图像文本生成任务的传统方法主要是分别解决自然双向生成任务,专注于设计任务特定的框架以提高所生成的样本的质量和保真度。最近,Vision-Language预训练模型大大提高了图像到文本生成任务的性能,但仍未开发出用于文本到图像综合任务的大规模预训练模型。在本文中,我们提出了一个具有变压器模型的双向图像文本生成的统一生成的预训练框架的Ernie-Vi​​lg。基于图像量化模型,我们将图像生成和文本生成标准为在文本/图像输入上调节的自回归生成任务。双向图像文本生成建模简化了视觉和语言的语义对齐。对于文本到图像生成过程,我们进一步提出了端到端的训练方法,共同学习视觉序列发生器和图像重建。为了探讨双向文本图像生成的大规模预培训景观,我们在大规模数据集中培训了100亿参数的Ernie-Vi​​lg模型,以145百万(中文)图像 - 文本对实现了达到的状态 - 文本到图像和图像到文本任务的最佳性能,以便在MS-Coco上获取7.9的FID,用于文本到图像合成以及用于图像标题的Coco-CN和AIC-ICC的最佳结果。
translated by 谷歌翻译
素描是一种常用于创新过程的自然和有效的视觉通信介质。深度学习模型的最新发展急剧改善了理解和生成视觉内容的机器能力。令人兴奋的发展领域探讨了用于模拟人类草图的深度学习方法,开设创造性应用的机会。本章介绍了开发深受学习驱动的创造性支持工具的三个基本步骤,这些步骤消耗和生成草图:1)在草图和移动用户界面之间生成新配对数据集的数据收集工作; 2)基于草图的用户界面检索系统,适用于最先进的计算机视觉技术; 3)一个对话的草图系统,支持基于自然语言的草图/批判创作过程的新颖互动。在本章中,我们在深度学习和人机互动社区中进行了对相关的事先工作,详细记录了数据收集过程和系统的架构,目前提供了定性和定量结果,并绘制了几个未来研究的景观在这个令人兴奋的地区的方向。
translated by 谷歌翻译