Geometry Aligned Variational Transformer for Image-conditioned Layout Generation

Yunning Cao^{1,2 $*$}, Ye Ma^{2 $*$}, Min Zhou², Chuanbin Liu¹, Hongtao Xie^{1 $†$}, Tiezheng Ge², Yuning Jiang²
¹University of Science and Technology of China ²Alibaba Group
cynasd@mail.utsc.edu.cn, {maye.my,yunqi.zm}@alibaba-inc.com,
{liucb92, htxie}@ustc.edu.cn, {tiezheng.gtz,mengzhu.jyn}@alibaba-inc.com

Abstract.

Layout generation is a novel task in computer vision, which combines the challenges in both object localization and aesthetic appraisal, widely used in advertisements, posters, and slides design. An accurate and pleasant layout should consider both the intra-domain relationship within layout elements and the inter-domain relationship between layout elements and the image. However, most previous methods simply focus on image-content-agnostic layout generation, without leveraging the complex visual information from the image. To this end, we explore a novel paradigm entitled image-conditioned layout generation, which aims to add text overlays to an image in a semantically coherent manner. Specifically, we propose an Image-Conditioned Variational Transformer (ICVT) that autoregressively generates various layouts in an image. First, self-attention mechanism is adopted to model the contextual relationship within layout elements, while cross-attention mechanism is used to fuse the visual information of conditional images. Subsequently, we take them as building blocks of conditional variational autoencoder (CVAE), which demonstrates appealing diversity. Second, in order to alleviate the gap between layout elements domain and visual domain, we design a Geometry Alignment module, in which the geometric information of the image is aligned with the layout representation. In addition, we construct a large-scale advertisement poster layout designing dataset with delicate layout and saliency map annotations. Experimental results show that our model can adaptively generate layouts in the non-intrusive area of the image, resulting in a harmonious layout design.

^†^†

*

Equal contribution. † Corresponding author.

image-conditioned layout generation, conditional variational autoencoder, Transformer, cross attention

1. Introduction

Figure 1. Examples of image-conditioned layout generation task. Given a background image with a subject presented (left), the layout consisting of multiple design elements needs to be generated (middle) following design principles. Hence, designers or even automatic rendering programs could utilize generated layouts to render the output images (right).

Layout design aims at arranging design elements such as images, text and shapes in order to catch readers’ attention and convey information in a visually appealing way. It is widely used in many graphic design scenarios such as document typesetting, web page layout, poster, slides, billboards and user interface design, where a high aesthetic quality layout is crucial and fundamental. Therefore, the ability to automatically generate a high quality layout design without involving human designers is beneficial and has great value for future applications.

Recently, layout design generation based on deep generative models such as Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Autoencoders (VAEs) (Lopez et al., 2018) has attracted increasing interests. Existing work has proved the effectiveness of deep learning to generate novel documents (Zheng et al., 2019; Li et al., 2019; Gupta et al., 2021), user interfaces (UI) (Li et al., 2021; Lee et al., 2020; Arroyo et al., 2021) and natural scenes (Gupta et al., 2021; Jyothi et al., 2019). They utilize the ability of generative models to automatically generate possible and various layouts, either by directly predicting elements’ bounding box locations (Jyothi et al., 2019; Lee et al., 2020; Gupta et al., 2021; Arroyo et al., 2021), or by synthesizing RGB images rendered from layouts (Li et al., 2019, 2021; Zheng et al., 2019).

However, most prior papers only focus on the layout of foreground elements, ignoring the existence of background images. It may work well in some traditional situations such as document typesetting or web page layout, which usually presents a pure color background. But when it comes to the design of posters or advertisement billboards, whose background often contains the subject to be highlighted, the importance of background image arises if we want the generated design to be more natural and comparable with that of designers. Professional poster designers usually start from a given image with the subject centered and then consider where to put all other design elements, making the poster more native and natural. This differs from most existing work which only considers the relationship of design elements and fails to involve the semantic constraints between image context and generated layout elements.

Thus, we claim that for layout design conditioned by an image background, the layout elements should be added to the image in a semantically coherent manner, such that the overall appearance of design elements and the image subject is harmonious and appealing. We define this problem as image-conditioned layout generation. Note that some prior work mentioned conditional layout generation, which aims to generate layouts under conditions of either partial layout input (Gupta et al., 2021) or the given number and types of elements(Jyothi et al., 2019), which is different from our problem because they are image-content-agnostic methods.

In this paper, we study the problem of image-conditioned layout generation. As shown in Figure1, there exists an appealing background image with other layout elements including text, text substrates, and logos. The positions of layout elements are determined by both image content and basic rules of layout design. For example, layouts should not occlude the salient subject in the image and text elements should not overlap with each other. Therefore, image-conditioned layout generation is more challenging than unconditional layout generation because it requires jointly modeling the intra-domain relationship within layout elements and inter-domain relationship between layout elements and the background image.

To solve this problem, we first parameterize layout elements to their categorical and geometric parameters (bounding boxes with classes). Then, we project each element of the layout into an embedding vector and represent the whole layout as a sequence of vectors. Thus, we formulate the problem as a conditional sequence autoregressive prediction problem. Consequently, we propose a transformer based conditional variational autoencoder (CVAE) to jointly model the layout sequence and conditional image. We adopt self-attention layers to depict the intra-domain relationship within layout elements and cross-attention layers to learn the inter-domain relationship between image and layout elements. Then we take them as building blocks of CVAE. VAE framework plays an important role in modeling real data distribution with its latent space, allowing us to generate various layouts by latent space sampling.

Furthermore, as the most unique thing of image-conditioned layouts compared to unconditional ones is the relationship between the visual domain (the background image) and geometric parameter domain (bounding box locations), feature alignment turns out to be a fundamental and critical problem to solve. Specifically, we need to extract the geometric information of the image and project it into the same domain with parameterized layout elements. There exists prior art trying to solve it by involving geometry enhancement module to transformers (Meng et al., 2021; Liu et al., 2021). However, they only focus on geometry prior of input queries, without considering the relationship between images and queries. Therefore, we propose a novel geometry alignment module to extract the image geometry information and align it with the layout elements. We divide the image into grids and represent each grid as its bounding boxes, which is aligned to the representation of layout bounding box. After that, it is much easier to model the geometry relationship between image and layout bounding boxes.

To validate our proposed method, we construct a large-scale advertisement poster design layout dataset, consisting of 117,624 poster images designed by professional designers, annotated with rich in-image layout annotations and saliency map of the background product subject. Extensive experiments on the proposed dataset demonstrate the effectiveness of the proposed ICVT model. Compared with previous unconditioned layout generation models, our ICVT model shows much better results in image-conditioned layout generation, achieving the state-of-the-art performance.

Our main contributions can be summarized as follows:

We propose the Image-Conditional Variational Transformer (ICVT) for visual-textual layout design generation, which is one of the first deep learning based models for image-conditioned layout generation to the best of our knowledge.
We propose a novel geometry alignment module to enhance inter-domain feature fusion by decoupling and projecting image geometry information into the same domain of layout elements.
We construct a large-scale advertisement poster design layout dataset, consisting of 117,624 poster images designed by designers with rich annotations of layouts.

2. Related Work

Recently, layout generation is increasingly attracting interest in the research community. We summarize existing work according to how they represent layout elements and divide them into two types: One is rendering layout elements into visual images, which we refer to as the visual domain. The other is using geometry and attribute parameters to express any layout elements, which we refer to as the geometric parameter domain.

Visual domain. To the best of our knowledge, LayoutGAN (Li et al., 2019) is the first to apply generative adversarial network (GAN) to layout generation. To leverage the CNN discriminator’s advantage of distinguishing visual patterns, they propose a novel differentiable wireframe rendering layer to rasterize structured data of layout elements into wireframe images. Therefore, the layout can be directly optimized in the visual domain. However, they simply parameterize all elements and render them into wireframe images, without any visual information. Following LayoutGAN, Attribute-Aware LayoutGAN (Li et al., 2021) adds editable attribute conditions such as element area, aspect ratio, and order. Although involving image elements, they simply represent the image as a bounding box and attribute descriptions, without involving any visual information from the image. Besides, (Zheng et al., 2019) directly takes an image-based representation for layouts, without involving any layout parameters (i.e. bounding box). Then they train a VAE-GAN conditioned on images, keywords, and attributes of the layout, which generates content-aware graphic design layouts.

Visual representation has the potential for cross-domain modeling (e.g. jointly modeling image and parameterized layouts), as it directly converts the layout parameters to the visual domain, naturally achieving domain alignment. However, representing layouts in the visual domain could not scale well when generating layouts of bigger size and resolution, where the computational cost and sophisticated post-processing algorithms such as contour or smoothing need to be considered.

Geometric parameter domain. Parameterized layout representation (e.g. bounding box) is a much more popular way. LayoutVAE (Jyothi et al., 2019) first proposes an autoregressive model based on VAE (Lopez et al., 2018) to generate various layouts. It generates layouts from a set of labels, i.e. the categories of layout elements. However, it is built with LSTMs, which suffers from the problem of long sequence modeling, resulting in the difficulty of modeling a large number of layout elements. As transformer (Vaswani et al., 2017) shows a great power of sequence modeling, the following work gradually adopt the transformer architectures instead of LSTMs. LayoutTransformer (Gupta et al., 2021) addresses the layout generation problem using an autoregressive language model based on transformers. They find that self-attention can learn relationships between layout elements, yielding competitive layout quality. Furthermore, to increase the diversity of generated layouts, Variational Transformer Network (VTN) (Arroyo et al., 2021) is proposed. VTN integrates an autoregressive transformer decoder into the VAE framework to generate various layouts by sampling from latent space. Neural Design Network (NDN) (Lee et al., 2020) involves Graph neural network (Scarselli et al., 2008) to model the relations among layout elements. Compared to the visual domain, parameterized layout representation is much more flexible. However, the domain gap between parameterized layouts and images obstructs cross-domain modeling.

Geometry enhancement is an important method for enhancing the performance of cross-attention. Detection Transformer (DETR) (Carion et al., 2020) directly fuses queries with visual features without geometry enhancement, which results in an extremely slow training convergence. To solve the problem, conditional DETR (Meng et al., 2021) designs 2D positional embedding $(x, y)$ as geometry prior for the queries, which increases the speed of convergence and accuracy. Furthermore, DAB-DETR (Liu et al., 2021) proposes a 4D positional embedding $(x, y, w, h)$ for queries to learn dynamic geometric prior. However, the above methods focus on enhancing geometry information for queries, without considering the alignment of geometry information between queries and image. Besides, in the image caption, geometry enhancement is used to help model image spatial information (Guo et al., 2020; Zhang et al., 2021). Geometry-aware model (Guo et al., 2020) calculates the relative geometry relation among different objects to benefit visual reasoning. As the development of transformer, RSTN (Zhang et al., 2021) utilizes a similar method to model relative geometry relation among grid features. However, these methods focus on enhancing geometric information in the visual domain, without cross-domain alignment.

3. Approach

Figure 2. Overall architecture of Image-Conditioned Variational Transformer (ICVT). During training (both gray and red arrows), masked image $X$ is put into the visual backbone and fed into ICVT encoder and decoder as a condition, while bounding boxes $Y$ is put into ICVT encoder-decoder. Attention Average Pooling (AAP) aggregates the encoder output. The model objective is the combination of reconstruction loss and KL loss. During inference (red arrows only), we sample latent representations $z$ from prior and autoregressively generate layouts.

3.1. Problem Formulation

We are interested in jointly modeling the intra-domain relationships within layout elements and the inter-domain relationships between layout elements and the image, and furthermore generating diverse and appealing layouts on the given image. The problem can be defined as follows:

Given a dataset of image-layout pair $D = {X_{i}, Y_{i}}_{i}^{| D |}$ , the $X_{i}$ represents image and $Y_{i} = [Y_{i}^{1}, Y_{i}^{2}, \dots, Y_{i}^{T}]$ represents $T$ layout elements on the corresponding image. Specifically, a layout element can be defined with its category and a bounding box $(c_{i}^{j}, x_{i}^{j}, y_{i}^{j}, w_{i}^{j}, h_{i}^{j})$ . We project the above attributes to the same $d$ -dimentional space and further concatenate them into a layout vector $Y_{i}^{j}$ , where $Y_{i}^{j} \in R^{5 d}$ . Besides, we use a latent variable $z$ to control the prediction and involve diversity.

Subsequently, our problem can be formulated as modeling the distribution of layout elements $Y$ conditioned on the image $X$ and a random vector $z$ (latent vector in CVAE). $^Y$ is the generated layouts:

(1)

^Y = arg max Y P (Y ∣ z, X)

3.2. Conditional Variational Autoencoders

The problem of image-conditioned layout generation requires generating various layouts from a single input image, which can be formulated as one-to-many mapping. Consequently, we adopt Conditional Variational Autoencoder (Sohn et al., 2015; Walker et al., 2016) as it naturally models the one-to-many mapping from input to output, with no need for an explicitly specified structure of the output distribution.

The goal of CVAE is to approximate the conditional data distribution $p_{θ} (Y ∣ X)$ by maximizing the conditional data log-likelihood . Considering the intractable problem of posterior inference, a $ϕ$ -parameterized encoder is involved to approximate $p_{θ} (z ∣ X, Y) \propto p_{θ} (Y ∣ z, X) p (z ∣ X)$ with a variational distribution $q_{ϕ} (z ∣ X, Y)$ . Variational inference is employed for CVAE learning with the following evidence lower bound (ELBO):

(2)

where $p (z ∣ X)$ is prior distribution and $q_{ϕ} (z ∣ X, Y)$ is posteriror distribution.

3.3. ICVT Architecture

As shown in Figure, our proposed ICVT model contains three main components, including a visual backbone to extract image features and a transformer-based encoder for generating latent code and a transformer decoder as a conditional generator.

3.3.1. Visual Backbone

The visual backbone learns visual features from the image, and use them as a condition to guide the process of layout generation. Specifically, we utilize a vision transformer (ViT) (Dosovitskiy et al., 2020) as it has a better relation modeling ability. Given an image $X \in R^{H \times W \times C}$ , we reshape it into a sequence of flattened 2D $P \times P$ patches. After that we feed them into a ViT model, generating a visual feature $f \in R^{l \times d}$ , where $L = \frac{H W}{P^{2}}$ is the length of the patch sequence, $d$ is the embedding dimension of the visual backbone. Besides, to align the embedding dimension of the visual backbone and that of variational transformer, we add a transformer encoder after the visual backbone.

3.3.2. ICVT Encoder

The ICVT encoder models the posterior distribution $q_{ϕ} (z ∣ X, Y)$ in the general CVAE framework. Thus, we need to encode both the image $X$ and layouts $Y$ into a joint conditional posterior distribution, which is a challenging problem that distinguishes our model from other unconditional layout generation models like VTN (Arroyo et al., 2021) and LayoutVAE (Jyothi et al., 2019). There are several common fusion methods like directly concatenating the visual features with the input layout embeddings, aggregating visual features into a single vector and fusing it with the latent vector $z$ , etc. However, after some comparisons, cross-attention mechanism comes to be the best choice to fuse the visual features with layout features, because it can not only keep the most spatial information of the image but also naturally model the inter-domain relationship between visual features and layout features.

Based on the above reasons, we implement our ICVT encoder with a standard transformer decoder (Vaswani et al., 2017), which consists of self-attention layer, cross-attention layer, and feed-forward network (FFN). In this module, the layout features are taken as the input sequence. As the coordinate of layout bounding boxes already contains spatial position information, we do not add positional encoding to layout features. While the visual features are taken as keys and values of cross-attention layers, with positional encoding added to keys. After that, we get the joint representation of layouts and images, which is a sequence of vectors with its length equal to the input layout sequence. After that, to thoroughly utilize the information in the joint representation, we design an attention average pooling (AAP) layer to aggregate the sequence of vectors into a single vector. Specifically, the AAP layer is implemented with a self-attention layer, where the query $Q \in R^{1 \times D}$ is a single learnable vector while the keys and values $K = V \in R^{L \times D}$ are the sequence of vectors from the ICVT encoder. Next, we use the aggregated vector to model the posterior distribution.

The posterior distribution $q_{ϕ} (z ∣ X, Y)$ is parameterized as a multi-variant normal distribution with diagonal covariance matrix, i.e. $N (μ, σ^{2} I)$ , where $I$ is identity matrix. We calculated the mean vector $μ$ and $log σ$ vector by projecting the aggregated vector with linear layers. As for prior distribution, we compare standard normal distribution $N (0, I)$ with learnable prior and empirically take non-learned prior the reason will be discussed in the ablation study. Besides, to implement backpropagation training for encoder, reparameterization trick (Kingma et al., 2015) is used to allow gradient passing through Gaussian sampling.

During the training process, the latent code $z$ is sampled from posterior distribution while in inference process, the latent code is sampled directly from prior distribution.

3.3.3. Autoregressive Decoder

The ICVT decoder $p_{θ} (Y ∣ z, X)$ shares the same architecture with the encoder. The only difference is that the decoder is performed as an autoregressive generator, i.e. $p_{θ} (Y ∣ z, X) = \prod_{i = 1}^{l} p_{θ} (Y_{i} ∣ Y_{i - 1}, z, X)$ , where $l$ is the number of bounding boxes in a layout. We take the latent code $z$ as the begin of sequence (BOS) token. The generated tokens are projected by five independent fully-connected layers to predict $c l s, x, y, w, h$ separately.

3.4. Geometry Alignment

In our model, cross-attention modules play an important role in jointly modeling the relationship between the layouts and the conditional image. However, the layout features are embeddings of class and geometric parameters (bounding boxes) while the conditions are image visual features, there is a domain gap between them. We find that vanilla cross-attention is not good enough to solve the problem of domain gap. For example, the dot product of the query $Q$ and the key $K$ is meaningless if $Q$ and $K$ come from different representation spaces.

To solve the problem of domain gap, we propose a Geometry Alignment (GA) Module as shown in Figure3. Considering the visual features (taken as $K$ and $V$ ) contain no explicit geometric information, our GA module calculates geometric parameters $(x_{j}, y_{j}, w_{j}, h_{j})$ for each patch feature of the image and projects them into the embedding space $R^{D}$ . In the following, we propose three geometry fusion methods between visual features and layout features.

Figure 3. Geometry Alignment Module (orange) and Transformer Block (blue). We take the concat fusion method as an example.

3.4.1. Adding

A natural fusion method is simply adding geometry information $Q_{g} (K_{g})$ into content information $Q_{c} (K_{c})$ as shown in the following Equation (3), where $Q$ represents layout features, $K$ and $V$ are both visual features.

(3)

Q = Q_{c} + Q_{g}, K = K_{c} + K_{g}, V = V_{c}

After that, we can calculate the cross-attention through the following Equation (4), where $σ$ represents the softmax operator. In the second and third line of Equation (4), we expand the expression and divide it into three parts including the content term $Q_{c} K_{c}^{T}$ , the geometry term $G = Q_{g} K_{g}^{T}$ and cross term $C r o s s T e r m = Q_{c} K_{g}^{T} + Q_{g} K_{c}^{T}$ .

(4)

3.4.2. Concat

Another fusion way is concatenation. Based on the above equation, we observe that the content and geometry information are coupled in the cross terms, which is redundant. Thus, we propose to decouple the content and geometry information by concatenating them along the embedding dimension as shown in Equation (5):

(5)

Q = concat (Q_{c}, Q_{g}), K = concat (K_{c}, K_{g}), V = V_{c}

Putting these concatenated features into cross attention achieves to decouple the content and geometry information, so that content query and geometry query focus on content and geometry attention weights respectively.

(6)

\begin{matrix} CrossAttn & = σ (\frac{Q K^{T}}{\sqrt{d_{k}}}) V_{c} = σ (\frac{Q_{c} K_{c}^{T} + Q_{g} K_{g}^{T}}{\sqrt{d_{k}}}) V_{c} = σ (\frac{Q_{c} K_{c}^{T} + G}{\sqrt{d_{k}}}) V_{c} \end{matrix}

3.4.3. Manually designed geometry relation

Furthermore, inspired by the above analysis, we propose to replace the geometry term $G$ with any manually designed geometry term, which explicitly calculates the relative geometry relations. For example, we imitate the computation of region geometry features in (Guo et al., 2020; Zhang et al., 2021) to obtain the relative geometry features between layouts and image patches. Our methods differ from them because they calculate relations within image features while we calculate the inter-relationship between layout features and image features.

(7)		$r_{i j} = ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ \begin{matrix} log (\frac{∣ ∣ c x_{i} - c x_{j} ∣ ∣}{w_{i}}) log (\frac{∣ ∣ c y_{i} - c y_{j} ∣ ∣}{h_{i}}) log (\frac{w_{i}}{w_{j}}) log (\frac{h_{i}}{h_{j}}) \end{matrix} ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠$
(8)		$G_{i j} = ReLU (w_{g}^{T} F C (r_{i j}))$

where $i$ and $j$ refer to layout bounding boxes and image grids respectively, $F C$ is a fully-connected layer, and $w_{g}$ is a learnable weight matrix.

3.5. ICVT Optimization

Since our ICVT is a CVAE model, we take the standard ELBO defined in Equation (2) as the optimization objective.

However, with an autoregressive decoder, the optimization suffers from the famous posterior collapse problem (Bowman et al., 2016). Concretely, during the training, the decoder tends to ignore the information in the latent code and depends on the masked input sequence. At the same time, $z$ learned from encoder perfectly matches the prior, transmitting no information to decoder. To solve the problem, we follow the $β$ -VAE (Higgins et al., 2016) objective in Equation (9), which adds a weight $β$ to KL term in Equation(2), :

(9)

and adopt a cyclic annealing schedule (Fu et al., 2019) for $β$ . Specifically, we set $β$ close to zero at the early steps of a cycle, and linearly increase it till it reaches the predefined $β$ weight. The above schedules are repeated periodically.

An intuitive explanation for the method is as follows. In the early training period, with $β$ closing to zero, the model degenerates to a VAE model and learns an informative latent representation $z$ . When the $β$ increases to non-negligible, the earlier learned $z$ conveys enough information to decoder, which alleviates the posterior collapse.

4. Experiments

4.1. Dataset

Training our ICVT model requires a large and diverse dataset with background image and design layout pairs. However, existing public layout datasets (Zhong et al., 2019; Lin et al., 2014) only focus on layout element locations ignoring background image and hence not applicable to our problem. Consequently, we build a large-scale advertisement poster layout dataset. It consists of 117,624 visual-textual advertisement poster images designed by professional designers, with both image background of the product subject and rich in-image design layout annotations. Bounding boxes of design texts, text substrates, and logos are annotated by outsourcing tasks. Besides, image saliency map is processed by a state-of-the-art matting model and attached to each image. We split our dataset into a training set and a validation set in 9:1 ratio and all hyper-parameters are selected on the validation set.

For testing, we construct a new test set with 166 background images, similar to training images while without any layout elements (i.e. images before design). Similarly, image saliency maps are processed with the same matting model and fed into models for inference.

In the following experiments, we use boxes of different colors to identify layout elements of different classes, with red for text, green for text substrates, and blue for logos.

4.2. Implementation Details

Our ICVT is a transformer model with a ViT (Dosovitskiy et al., 2020) visual backbone and an encoder-decoder structure based on standard transformer components described in (Vaswani et al., 2017). For visual backbone, we adopt ViT-S/32 i.e. ViT small model with a patch size of $32 \times 32$ model initialized with ImageNet-pretrained parameters. The ICVT decoder is stacked with 4 transformer decoder layers, with the head number set to 8. The embedding dimension of each component of bounding boxes (class, x, y, w, h) is 96, summing up to 480 as the model dimension. The dimension of FFN is set to 2048 and the dropout probability is 0.1. The ICVT encoder shares the same architecture as decoder.

We use AdamW (Loshchilov and Hutter, 2019) to optimize the model, where we set encoder-decoder’s learning rate to $5 \times 10^{- 5}$ , while a smaller learning rate $1 \times 10^{- 5}$ for the pretrained visual backbone. Besides, weight decay is set to $1 \times 10^{- 2}$ and batch size is set to 128. We train the model for two cycles with a cyclic annealing schedule for $β$ parameter. Concretely, the cyclic schedule is formulated as follows:

(10)

β_{t}

= ⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} 0.001, & 0 \leq t \leq T / 2 f (t), & T / 2 < t < 3 T / 4 0.3, & 3 T / 4 \leq t < T \end{matrix}

where $t$ is iteration numbers, $T$ is the total iteration numbers of a period, $f (t)$ is a linear function determined by two points $(T / 2, 0.001)$ and $(3 T / 4, 0.3)$ .

Also, we use data augmentation including color jitter for the input image and random flip for both image and layouts. Finally, our ICVT is implemented in PyTorch framework (Paszke et al., 2019) and is trained with 16 NVIDIA V100 GPUs.

\topruleMethod	Geometry Embedding	Output Rate $↑$	Overlap $↓$	Alignment $↓$	Occlusion $↓$
\midrulebaseline	—	97.8%	0.033	0.015	0.184
baseline(w/o PE)	—	97.6%	0.030	0.018	0.264
adding	learned	98.8%	0.036	0.014	0.181
adding	sine	98.2%	0.031	0.014	0.175
concat	learned	98.5%	0.055	0.014	0.161
concat	sine	98.8%	0.046	0.013	0.154
manually	—	98.4%	0.048	0.016	0.165
\bottomrule

Table 1. Ablation study on Geometry Alignment

4.3. Evaluation Metrics

Layout design follows some basic rules. For example, layout elements should be aligned with each other as possible and should not overlap with each other. And when it comes to image-conditioned layout generation, layout elements should not occlude with the salient object in background image. Evaluating the quality of the image-conditioned layouts involves two main aspects—intra-bounding-boxes relationship and relationship between image and bounding boxes. First, we follow previous work (Li et al., 2019; Lee et al., 2020) to take alignment and overlap as metrics for evaluating the layout bounding box quality. Then, different from unconditional layout generation, in order to measure the relationship between the background image and bounding boxes, we design a new metric, occlusion, which is defined as the overlap area of background saliency map and design element bounding boxes over the total bounding box area. A lower occlusion metric leads to a better layout. Besides, we also evaluate the global visual quality by calculating Fréchet Inception Distance (FID) (Heusel et al., 2017) following the setting mentioned by NDN (Lee et al., 2020). Last but not least, we calculate model output rate (inference samples with at least one valid box over all samples) as indicators for the consistency and stability of models.

4.4. Quantitative results

4.4.1. Ablation study

As shown in Table 1, we perform ablation studies to investigate the effectiveness of different model structures. We present an ICVT model without geometry alignment as a baseline. First, we study the importance of image positional encoding by completely removing it from the baseline model. We find that the occlusion rate significantly increases from 0.184 to 0.264, while other metrics almost remain unchanged. It demonstrates that the positional encoding is critical for learning spatial information of the image.

Then, we study the effect of the three fusion methods mentioned in Section 3.4, along with different embedding methods for geometry information. Experiments show that concatenation is a better fusion method than addition. And for both methods, sine geometry embedding performs slightly better than learned geometry embedding in occlusion rate. Manually designed geometry relation also performs well, with occlusion rate slightly higher than the concat method. However, we do not adopt this method because we find that the calculation of relative geometry relation and the following fully-connected layer is computationally expensive.

Finally, we study the performance of learned prior and non-learned prior on the best ICVT model chosen from the above ablation study. As shown in Table 2, we find that the non-learned prior performs better in output rate and occlusion rate, with nearly the same overlap and alignment performance. This result is unusual as in most conditional variational autoencoders, learned prior including conditional information leads to better results. One possible reason is that our task depends highly on spatial information of the conditional image, while the learned prior could harm the cross-attention of queries and keys. Consequently, we use standard normal prior in the following experiments.

\toprulePrior	Output Rate $↑$	Overlap $↓$	Alignment $↓$	Occlusion $↓$
\midrulenon-learned	98.8%	0.046	0.013	0.154
learned	97.6%	0.045	0.014	0.165
\bottomrule

Table 2. Ablation study on prior distribution.

4.4.2. Comparison with prior art

Based on the above ablation studies, we choose the ICVT model with concatenated sine geometry embedding to perform the following experiments.

To compare with previous models, we reproduce the content-aware GAN (Zheng et al., 2019), Layout Transformer (Gupta et al., 2021), and VTN (Arroyo et al., 2021) to train them on our proposed dataset. Note that there is no image input for VTN and Layout Transformer while an extracted image embedding is fed to content-aware GAN. As shown in Table 3, compared with VTN and Layout Transformer, our model shows similar performance on output rate, overlap, and alignment metrics. However, VTN and Layout Transformer yield a much higher occlusion rate than our ICVT model due to the lack of visual information, which also leads to the higher FID. As for content-aware GAN, although it involves visual information, our ICVT model performs better on alignment, occlusion rate, and FID. It is worth mentioning that the overlap metric of content-aware GAN is the smallest because it generates rendered images instead of parameters of geometry boxes. Those coordinates are obtained via contour algorithm and border smoothing, which makes it a leading position on the metric of alignment.

In all, the result demonstrates that our ICVT model is state-of-the-art in jointly modeling layout elements and visual information.

\topruleMethod	Output Rate $↑$	Overlap $↓$	Alignment $↓$	Occlusion $↓$	FID $↓$
\midrulecontentGAN (Zheng et al., 2019)	100.0%	0.016	0.015	0.334	85.81
LayoutTransformer (Gupta et al., 2021)	100.0%	0.038	0.017	0.252	71.58
VTN (Arroyo et al., 2021)	98.4%	0.045	0.015	0.254	79.77
ICVT(ours)	98.8%	0.046	0.013	0.154	62.04
\bottomrule

Table 3. Quantitative comparison with the prior art.

4.5. Qualitative results

4.5.1. Image-conditioned layout generation and completion.

We perform both image-condition layout generation and completion as qualitative results. As shown in Figure 4, We choose images (masked with saliency map) with different salient object locations to show that our ICVT model can adaptively generate layouts on the non-invasive regions of different images, resulting in a proper and beautiful layout design. Furthermore, autoregressive model can naturally complete partial layouts. As shown in Figure 5, taking a partial layout as input, our ICVT model can complete layouts based on given initial elements.

Figure 4. Image-conditioned layout generation. From left to right, the salient object is in the top, bottom, middle and random distributed on the image. Our model can adaptively generate layouts on different images properly.

Figure 5. Multiple Completions from the same initial element layout.

4.5.2. Comparison with prior art

As shown in Figure 6, We compare the generated results of our ICVT model with those of content-aware GAN (Zheng et al., 2019), VTN (Arroyo et al., 2021), and Layout Transformer (Gupta et al., 2021). Although VTN generates high-quality layouts, they severely occlude salient objects of the image. The layouts generated by our ICVT model seldom occlude salient objects.

Figure 6. Comparison with content-aware GAN (Zheng et al., 2019) (1st row), VTN (Arroyo et al., 2021) (2nd row), and Layout Transformer (Gupta et al., 2021) (3rd row). Our method (4th row) produces bounding boxes around the salient object of the image without occlusion.

4.5.3. Diversity and fidelity

Diversity and fidelity are two basic requirements for the task of image-conditioned layout generation. We need to generate diverse layouts while meeting the requirements of fidelity. The decoder should learn decoupled representations of the two input variables, latent vector $z$ and conditional image $X$ . Therefore, we test our model by independently changing the above two variables respectively. As shown in Figure 7, we change the conditional image $X$ with latent vector $z$ fixed and change $z$ with $X$ fixed. We find that with latent vector fixed(in the same row), the model still keeps fidelity under different image conditions. Similarly, with the conditional image fixed (in the same column), the model generates diverse layouts with different $z$ .

5. Conclusion

In this paper, we define and formulate the task of image-conditioned layout generation. We propose a novel transformer-based conditional variational autoencoder for image-conditioned layout generation, and construct a large-scale advertisement poster design layout dataset. We first parameterize layouts and extract information. Then we apply a transformer-based CVAE for jointly modeling the relationship between layouts and image. To alleviate the domain gap, we further design a geometry alignment module for enhancing spatial relationships. Finally, we perform extensive quantitative and qualitative experiments to demonstrate the effectiveness of our model.

Acknowledgements

This work is supported by Alibaba Group through Alibaba Innovation Research Program, the National Nature Science Foundation of China (62121002, 62022076, U1936210), the Fundamental Research Funds for the Central Universities under Grant WK3480000011, the Youth Innovation Promotion Association Chinese Academy of Sciences (Y2021122), the China Postdoctoral Science Foundation 2021M703081, and the Fundamental Research Funds for the Central Universities WK2100000026.

\balance

References

D. M. Arroyo, J. Postels, and F. Tombari (2021) Variational transformer networks for layout generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13642–13652. Cited by: Appendix B, Appendix B, §C.2, §1, §2, §3.3.2, Figure 6, §4.4.2, §4.5.2, Table 3.
S. Bowman, L. Vilnis, O. Vinyals, A. Dai, R. Jozefowicz, and S. Bengio (2016) Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 10–21. Cited by: §3.5.
N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §2.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. In International Conference on Learning Representations, Cited by: §3.3.1, §4.2.
H. Fu, C. Li, X. Liu, J. Gao, A. Celikyilmaz, and L. Carin (2019) Cyclical annealing schedule: a simple approach to mitigating KL vanishing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 240–250. External Links: Link, Document Cited by: §3.5.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu (2020) Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10327–10336. Cited by: §2, §3.4.3.
K. Gupta, J. Lazarow, A. Achille, L. S. Davis, V. Mahadevan, and A. Shrivastava (2021) Layouttransformer: layout generation and completion with self-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1004–1014. Cited by: Appendix B, Appendix B, §C.2, §1, §1, §2, Figure 6, §4.4.2, §4.5.2, Table 3.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems 30. Cited by: §4.3.
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-vae: learning basic visual concepts with a constrained variational framework. Cited by: §3.5.
A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori (2019) Layoutvae: stochastic scene layout generation from a label set. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9895–9904. Cited by: §1, §1, §2, §3.3.2.
D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (Eds.), Vol. 28, pp. . External Links: Link Cited by: §3.3.2.
H. Lee, L. Jiang, I. Essa, P. B. Le, H. Gong, M. Yang, and W. Yang (2020) Neural design network: graphic layout generation with constraints. In European Conference on Computer Vision, pp. 491–506. Cited by: §1, §2, §4.3.
J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu (2019) Layoutgan: generating graphic layouts with wireframe discriminators. 7th International Conference on Learning Representations, ICLR 2019, pp. 1–16. Cited by: §C.2, §1, §2, §4.3.
J. Li, J. Yang, J. Zhang, C. Liu, C. Wang, and T. Xu (2021) Attribute-conditioned layout gan for automatic graphic design. IEEE Transactions on Visualization and Computer Graphics 27, pp. 4039–4048. External Links: Document, ISSN 19410506 Cited by: §1, §2.
T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.1.
S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, J. Zhu, and L. Zhang (2021) DAB-detr: dynamic anchor boxes are better queries for detr. In International Conference on Learning Representations, Cited by: §1, §2.
R. Lopez, J. Regier, M. I. Jordan, and N. Yosef (2018) Information constraints on auto-encoding variational bayes. Advances in Neural Information Processing Systems 31. Cited by: §1, §2.
I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §4.2.
D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang (2021) Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660. Cited by: §1, §2.
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. External Links: Link Cited by: §4.2.
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE transactions on neural networks 20 (1), pp. 61–80. Cited by: §2.
K. Sohn, H. Lee, and X. Yan (2015) Learning structured output representation using deep conditional generative models. Advances in neural information processing systems 28. Cited by: §3.2.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cited by: §2, §3.3.2, §4.2.
J. Walker, C. Doersch, A. Gupta, and M. Hebert (2016) An uncertain future: forecasting from static images using variational autoencoders. In European Conference on Computer Vision, pp. 835–851. Cited by: §3.2.
X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji (2021) RSTNet: captioning with adaptive attention on visual and non-visual words. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15465–15474. Cited by: §2, §3.4.3.
X. Zheng, X. Qiao, Y. Cao, and R. W. Lau (2019) Content-aware generative modeling of graphic design layouts. ACM Transactions on Graphics (TOG) 38 (4), pp. 1–15. Cited by: Appendix B, Appendix B, §C.2, §1, §2, Figure 6, §4.4.2, §4.5.2, Table 3.
X. Zhong, J. Tang, and A. J. Yepes (2019) Publaynet: largest dataset ever for document layout analysis. In 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 1015–1022. Cited by: §4.1.

Appendix A Dataset Analysis

The large-scale advertisement poster layout design dataset consists of 117,624 poster images designed by professional designers, annotated with rich in-image layout annotations. As shown in Figure 9, we present some samples of the dataset. Layout elements including texts, text substrates, and logos are annotated with red, green, and blue bounding boxes, respectively. Besides, we use a saliency map to indicate the position of the product subject, which guides the model to place layout elements on the non-invasive area in a semantically coherent manner.

Furthermore, we analyze the dataset from different aspects, such as the number of bounding boxes per image, the coordinates of the center of bounding boxes, the width, and height of bounding boxes, and the width-height ratio, as shown in Figure 10.

Appendix B Implementation Details

We first provide a supplementary description of the implementation of our ICVT model. Then, considering that we reproduce the content-aware GAN (Zheng et al., 2019), Layout Transformer (Gupta et al., 2021), and VTN (Arroyo et al., 2021) to train them on our dataset, we describe the details of our implementation.

In our method, the model takes a masked image and corresponding layout design (sequence of bounding boxes) as input and reconstructs the layout conditioned on the image. We order layout elements according to their positions, from top to bottom and from left to right. We find that an ordered sequence is helpful for the model performance, although the layout elements can also be represented as an unordered set. For the input image, we resize the original product image into the size of $480 \times 704$ and multiply it with a binary saliency mask, which highlights the product subject to guide the layout generation.

We implement the content-aware GAN (Zheng et al., 2019) model with their open source code and downsample our input image into their proposed size of $45 \times 60$ . A larger resolution leads to a failure of training in our practice. We implement Layout Transformer (Gupta et al., 2021) with their open source code. We implement VTN (Arroyo et al., 2021) following the description in their paper. The model dimension is set to 480 which differs from the dimension of 512 in their original VTN model for a fair comparison.

Appendix C Supplementary Experimental Results

We present more uncurated random experimental results to fully evaluate the model performance.

c.1. Layout Generation and Completion

Samples of layout generation and layout completion are shown in Figure 11 and Figure 12. For layout completion, we take three different kinds of layout elements including logos, texts, and substrates as initial layouts and generate various layout completions. We find that our model can generate proper layouts conditioned both on the initial layout and images in most cases. Also, there are a few failure cases that produce overlapped layout elements.

c.2. Comparison and User Study

We also provide more results comparing our model with content-aware GAN (Zheng et al., 2019), LayoutTransformer (Gupta et al., 2021), and VTN (Arroyo et al., 2021). As shown in Figure 13, we find that the occlusion problem in content-aware GAN(Li et al., 2019), LayoutTransformer (Gupta et al., 2021), and VTN (Arroyo et al., 2021) is much more serious, while most layout elements generated by our model are surrounding the product subject.

Furthermore, we perform a user study to compare the design quality of different models. We randomly choose 20 images and 3 design results from different models for each conditional image. Subsequently, we let 51 volunteers vote for the best design for each image. After that, we statisticize the percentage of voting for different models. As shown in Figure 8, our model is the best.

Figure 8. User Study. The histogram presents the percentage of times that the design results are voted to be the best.

Figure 9. Showcase for the proposed dataset.

Figure 10. Layout bounding box distribution. Image (a) shows the distribution of bounding box numbers per image. In most cases, there are around 5 bounding boxes in an image. Image (b) (c) shows the coordinates of the center point of the bounding box. Most bounding boxes tend to be at the top and bottom of the image. Image (d) (e) shows the width and height of bounding boxes. As we can see, the width of most bounding boxes is 0.6 times the image width and the height is 0.2 times the image height. Image (f) shows the width-height ratio of bounding boxes, with the most probable value of around 5.

c.3. Diversity and Fidelity

Diversity and fidelity are two basic aspects to measure generative models. To thoroughly exploit the performance of our model, we first formulate our problem as a conditional maximum likelihood estimation problem as follows:

(11)

^Y = arg max Y P (Y ∣ z, X)

where $Y$ , $z$ , $X$ represent layout elements, latent vector, and conditional image, respectively. According to Equation 11, the prediction $^Y$ is determined by $z$ and $X$ , which represent the latent vector and conditional image, respectively. We evaluate our model by sampling in the 2-dimensional parameter space spanned by $z$ and $X$ . As shown in Figure 14, we change the latent vector $z$ in the direction of the vertical axis to show the diversity on the same image. Then, we fix latent vector $z$ and change the conditional image in the direction of the horizontal axis to show the fidelity in different conditional images. Experimental results show that our model shows good performance in diversity and fidelity.

Figure 11. Layout Generation. We present uncurated random samples from the generation results of our ICVT model. In most cases, our model can generate proper layout design in the non-invasive area of the product image. Also, there are bad cases with problem of overlap and alignment.

Figure 12. Layout Completion. The first column contains three different initial layouts of logos, texts, and text substrates. We generate various layout completions for each of them. In most cases, our model generates proper layouts conditioned on both the initial layout and image.

Figure 13. Comparison. We compare our model with the other three existing methods to demonstrate that our model generates a better in-image layout design, with fewer occlusion with the product subject.

Figure 14. Diversity and Fidelity. The problem can be formulated as $arg {max}_{Y} P (Y ∣ z, X)$ , where the prediction is determined by $z$ and $X$ . We build a coordinate system of $X O z$ to show the predictions given different $z$ and $X$ . By fixing $X$ and changing $z$ , we demonstrate the model diversity. By fixing $z$ and changing $X$ , we demonstrate the model fidelity.