Laplacian Pyramid-Like Autoencoder

Sangjun Han 1School of Mathematics & Computing (Mathematics), Yonsei University, Seoul, South Korea,
1qkqhwl4@yonsei.ac.kr Taeil Hur^$†$ 2JENTI Inc., Seoul, South Korea,
2taeil.hur@jenti.ai Youngmi Hur^$‡$ 3Department of Mathematics, Yonsei University, Seoul, South Korea,
3yhur@yonsei.ac.kr

Abstract

In this paper, we develop the Laplacian pyramid-like autoencoder (LPAE) by adding the Laplacian pyramid (LP) concept widely used to analyze images in Signal Processing. LPAE decomposes an image into the approximation image and the detail image in the encoder part and then tries to reconstruct the original image in the decoder part using the two components. We use LPAE for experiments on classifications and super-resolution areas. Using the detail image and the smaller-sized approximation image as inputs of a classification network, our LPAE makes the model lighter. Moreover, we show that the performance of the connected classification networks has remained substantially high. In a super-resolution area, we show that the decoder part gets a high-quality reconstruction image by setting to resemble the structure of LP. Consequently, LPAE improves the original results by combining the decoder part of the autoencoder and the super-resolution network.

Deep Learning, Autoencoder, Laplacian Pyramid, Classification, Acceleration, Super-resolution

\tocauthor

Sangjun Han, Taeil Hur, Youngmi Hur

1 Introduction

Deep neural networks are standard machine learning methods for diverse image processing such as object classification, image transform, image recognition. The networks have great varieties in architectures, algorithms, and processes. The autoencoder is a part of these varieties.

The autoencoder encodes a given data to some representation in a latent space, usually compressed from the input data, by a few layers. Then it decodes this representation to the reconstruction converted to have desired properties by different layers. The encoder has the advantage of analyzing the data in a low dimensional space, in a way similar to the Principal Component Analysis. Also, the simplicity of the model structure makes it easy to modify the structure according to various purposes like unsupervised pre-training, denoising, restoration of image.

Our paper is motivated by the article [6]. The authors develop the wavelet-like autoencoder (WAE) and use it for acceleration in classification networks. WAE decomposes the input image into two down-scaled images, low-frequency information $I\textsubscriptL$ and high-frequency information $I\textsubscriptH$ , through the encoder and reconstructs the original image at the decoder. Their use of the prefix “wavelet-like” is due to the imposed condition on $I\textsubscriptH$ to be sparse and the reconstruction process obtained by adding the convolution filtered versions of $I\textsubscriptL$ , $I\textsubscriptH$ . In WAE, to accelerate the classification model, they input $I\textsubscriptL$ as the mainstream and $I\textsubscriptH$ as a helper to the classification networks (e.g., VGG16, ResNet50) instead of the original image. The change made the network have smaller computational complexity; hence it takes less time for the entire process. Besides, the complementary analysis using the helper $I\textsubscriptH$ makes the network stay competitive in terms of accuracy.

Although WAE is good at accelerating the basic classification networks, it is not satisfactory in some crucial aspects. First, contrary to the wavelet, WAE does not impose any condition on low-frequency information. This missing condition makes the approximation $I\textsubscriptL$ hard to reflect the original image, which can drop classification performance. Second, such a low-frequency image can lower the quality of the reconstruction image. The WAE paper does not pay attention to the reconstructed result because its primary concern is the acceleration problem, requiring only two decomposed images in the classification networks. This limited architecture makes it hard for use in other areas requiring a reconstruction. Third, the name of the autoencoder is “wavelet-like,” but the model is missing a critical feature of the wavelet, namely multi-scale property. Consequently, WAE has difficulty decomposing multiple times, resulting in a restriction on WAE so that the model is stuck in the acceleration task.

Considering the preceding, we propose a new model named the Laplacian pyramid-like autoencoder (LPAE). We impose an extra condition on the low-frequency part of WAE, and get an autoencoder with a hierarchy, similar to the shape of the Laplacian pyramid (LP) introduced in [5]. As a result, LPAE makes the approximation image with better quality. Using this approximation image, we obtain higher performance of classification but also extend to super-resolution problems with $2^{k}$ magnification for various $k$ since LPAE decomposes and reconstructs an image multiple times.

The datasets used for the classification are ImageNet2012 (ImageNet) from [33] and Intel Image Classification (Natural Scene) from [18]. We combine two base networks, VGG16 (VGG) and ResNet50 (ResNet), with WAE and LPAE. LPAE shows better classification performance than WAE. LPAE accelerates test times sufficiently, although there is a slight drop in the acceleration time. In some cases, LPAE even has a faster total training time than WAE. For a super-resolution problem, we use three datasets, CelebA [26], DIV2K [35], and Set5 [4]. After training on CelebA, the test is done on CelebA. After training on DIV2K, the test is carried out on Set5, following the convention in the field. For the network, base network is WaveletSRNet (WaveSR) introduced in [14]. Since WaveSR uses the wavelet packet transform for image reconstruction, we can directly see the effectiveness of LPAE for super-resolution by replacing the wavelet part with LPAE. This result in super-resolution shows the potential usefulness of the proposed LPAE for solving other problems in deep learning because the classification and the super-resolution are different types of problems.

2 Related Works

Autoencoder

Recently, most research about the autoencoder concentrates on its connection to other applications rather than developing it independently. Generative autoencoder [31, 10] occupies a significant portion as a base model for application. In [17], variational autoencoder (VAE) solves a problem of insufficient data by producing synthetic data from a given sample data. [27] develops a reference-based SR approach using VAE, which can take any image as a reference. [36] suggests a novel generative model with hierarchical structure based on VAE. By exploiting unusual convolutions such as depthwise separable convolution and regular convolution, authors get high performance without modifying the loss function of VAE. Other existing autoencoders, such as sparse autoencoder and denoising autoencoder, are used to solve specific problems. For example, [15] composes stacked sparse denoising autoencoder for detecting electricity theft and [13] exploits a sparse autoencoder for landslide susceptibility prediction. Because the autoencoder extracts the feature efficiently and reconstructs data with a simple structure, it is a popular tool for developing a framework combined with machine learning. Our model, LPAE, is a model possessing the properties of LP. So LPAE can provide new approaches to diverse problems where LP properties are helpful.

Network Acceleration

After the substantial progress of the convolutional neural network (CNN), a lot of research focuses on the acceleration of the networks with keeping high performance. There are several approaches to accelerate networks. [6, 3] propose methods that modify an architecture or overall structure. [3] suggests an accelerator using the point that can compute the convolution layer similar to the fully connected layer. Moreover, to accelerate model training, some researches present a new training framework. [41] makes an assistant model along with the main model to remove trivial instances during training, then gets the results with the new training algorithm. [25] suggests a new training plan for fine-tuning. It accelerates the fine-tuning by adaptively freezing the nearly converged portion of the network. Since many natural language processing and classification models exploit fine-tuning the pre-trained network, this method is tempting. Furthermore, the study to achieve harmony between the hardware and the software is actively performed [30, 16]. Considering rapidly advanced hardware, both propose a hardware-based approach to speed up a CNN and reduce the energy required for computation.

Single Image Super-Resolution

In the super-resolution area, many of the studies try to build a deep convolution-based model using up-sampling. For example, [19] introduces the way of the deep network for super-resolution by a residual learning and high initial learning rate, while [22] suggests enhancing the power of the deep network by removing the batch normalization and setting a training pipeline. [12] makes a model to repeat the up-scaling and down-scaling iteratively and to reflect the feedback of error. In addition to the construction of deep models, some researchers are focusing on other points. By learning the feature correlations of hidden layers by second-order attention, [8] improves the expressional power of CNN. [34] trains the content-adaptive resampling (CAR) model with the main super-resolution (SR) network to create a low-resolution image. Since unsupervised training is conducted on CAR, CAR makes a down-scaled image to keep important information for super-resolution impartially. The primary concern of [7, 20] is the speed of the SR. While both of them construct a new network structure and framework, neural architecture search is a noticeable characteristic of [7]. The LPAE that we propose in this paper is an assistant model to reconstruct images trained with the main SR model, similar to the approach in [34]. Also, LPAE learns the correlation between the components for super-resolution and can improve the reconstructing power of the main SR model similar to [8]. However, LPAE is an autoencoder with a straightforward structure and is easy to connect with diverse architectures without too much modification.

3 Laplacian Pyramid in Neural Network

Idea and Structure of LP

The Laplacian pyramid (LP) is introduced in [5] as a technique for compact image encoding. This technique has its root in subtracting a low-pass filtered image from the original image. Such subtraction reduces redundant information by decreasing the correlation between neighboring pixels in an image, in other words, data compression. Moreover, this compression process can be repeated on low-pass filtered images having different scales. Then the repetitions form a pyramid-like structure and accelerate the reduction. Since this process is similar to the Laplacian operator sampling on diverse scales, the pyramid is named the Laplacian pyramid.

The following is the overall process of establishing LP. For an input image $I_{0}$ , a filtered image $I_{1}$ is obtained with the decreased resolution by some low-pass filter. A filtered image $I_{2}$ is similarly obtained from $I_{1}$ , and proceeding successively results in a sequence ${I_{k}}_{k = 0}^{K}$ . For each fixed $I_{k}$ with $k \geq 1$ , the image ${~ I}_{k}$ with the same size as $I_{k - 1}$ is defined by expanding $I_{k}$ by interpolation. Subtracting ${~ I}_{k}$ from $I_{k - 1}$ produces $d_{k}$ with the same size as $I_{k - 1}$ but with compressed image information. The sequence ${d_{k}}_{k = 1}^{K}$ of such differences corresponds to the pyramid of LP.

Strengths of LP and Applications on Neural Net

According to the above procedure, LP decomposes an image $I$ into a low-pass filtered image $I_{c}$ and a difference $I_{d}$ after one stage. The decomposition result of LP is redundant compared to the wavelet transform often used as the analysis tool because the $I_{d}$ part of LP has the same size as the original image. But some advantages are originated from the structure of LP. For example, the implementation that requires only low-pass filtering and subtraction is straightforward. Another strength of LP is that there exists no scrambled frequency [9] derived from high-pass filtering. And the organizational style of LP enables the perfect reconstruction of the original image. Such advantages encourage many usages of LP as a tool in traditional image processing.

In addition, there have been lots of efforts in machine learning to treat LP, both independently and as a concept in a model. For the super-resolution problem, [2] and [43] get competitive results by introducing the style of LP in the attention module and by considering a generator of the conditional generative adversarial network, respectively. On the other hand, [23, 21] use LP itself to make input images. After generating the image pyramid from the original image, [23] put components into two modules for high-quality transfer, a drafting module that transforms artistic style from low-resolution image and a revision module that refines the transformed image with a high-resolution image. On top of these, there are applications for diverse problems such as object detection [42], image compression [37]. These researches show a possibility of a connection between LP and deep learning, presenting satisfactory results. As will be seen later in this paper, our LPAE presents convincing results, which add to the above possibility.

4 Laplacian Pyramid-like Autoencoder

We propose a simple autoencoder model, the Laplacian pyramid-like autoencoder (LPAE), to have properties of LP such as a hierarchical structure, analysis and reconstruction parts. Then LPAE is connected to the classification networks and the super-resolution networks. These connection is performed to show the effectiveness of LPAE.

Proposed Model

LPAE has two parts, encoding (analysis) and decoding (reconstruction) parts. Fig. 1 shows the overall structure of LPAE. In the analysis part, LPAE decomposes the original image $I$ into images $I_{c}$ and $I_{d}$ . $I_{c}$ is the down-sampled approximation image that is low-pass filtered, and $I_{d}$ is the detail image representing the difference between the original image and the prediction from approximation. Both outputs pass through $4$ convolution layers. When we downsample images, we use the convolution layer with stride $2$ instead of a pooling layer. In the reconstruction part, the output image $I^{'}$ is reconstructed by an element-wise sum between the detail image $I_{d}$ and the prediction image $ϕ (I_{c})$ , where $ϕ$ is a deconvolution process. The process $ϕ$ consists of up-sampling with $4 \times 4$ transposed convolution layer and filtering with $3$ of convolution layers. In our LPAE, except for the output layers, we use convolution layers with filter-size $3 \times 3$ and 16 output channels for simplicity and efficiency, but the network can be made deeper or more redundant.

Loss for LPAE

Our loss function consists of three components that make the autoencoder similar to LP. Since we want LPAE to reconstruct the original image as much as possible, we define our first loss function to be the reconstruction loss,

l_{r} = \frac{1}{| I |} | | I - I^{'} | |_{1} .

For the approximation image representing the low-frequency channel of the original image well, we prepare an approximation image $I_{↓}$ by using the bicubic interpolation and apply the mean square error (MSE). we then define our second loss function as the energy (or approximation) loss,

l_{e} = \frac{1}{| I_{c} |} | | I_{c} - I_{↓} | |_{2}^{2} .

For the approximation image $I_{↓}$ , many other methods, including the wavelet transform by CDF 9/7 filters, can also be used, but we find no significant difference in the loss for using different approximation methods. Hence in this paper, we fix the bicubic interpolation to get $I_{↓}$ . The bicubic interpolation makes a natural connection with the super-resolution problem.

To constrain the detail image to be sparse, we set our last loss function to be the sparsity loss,

l_{s} = \frac{1}{| I_{d} |} | | I_{d} | |_{2}^{2} .

This sparsity loss makes the detail image carry a high-frequency channel of the original image and provide textures to the connected network.

The overall loss of LPAE is defined as the weighted sum of the three losses:

l_{t o t a l} = α l_{r} + β l_{e} + γ l_{s},

where $α = γ = 1, β = 0.8$ to give less weight on the approximation loss than other losses.

Fig. 2 shows the result of LPAE and WAE applied to two different images from ImageNet. For the approximation image in Fig. 2(b), there is no doubt that the result of LPAE is more vivid, original-like image. Besides, Fig. 2(c), the result of detail images, demonstrates that this precise approximation influences the detail image to be sparse. Although sparse, the detail contains many information for the original texture since it has the same spatial size as the original image, unlike WAE. Eventually, LPAE gets a more sharp reconstruction, as seen from the images in Fig. 2(d). The differences prove the validity of our loss function.

Image Classification Problem

In [6], the authors show that WAE can accelerate classification networks while keeping the accuracy almost the same by using two outputs of the encoding part. We show that LPAE can accelerate classification networks at about the similar level, even with a slight improvement in accuracy. We think that using both the approximation and the detail, similar to [6], has contributed to getting comparable accuracy, even if the input in VGG is not the original image. In this sense, we speculate that there is an improvement in accuracy for LPAE because LPAE gets a better approximation and a high-resolution detail than WAE.

To be comparable with the experiments in [6], we set the same structure except for the autoencoder part, replacing WAE with LPAE. Below we describe the use of the autoencoder only for VGG, as the process for ResNet is similar. Recall that the encoder part of LPAE decomposes an original image $I$ into an approximation image $I_{c}$ and a detail image $I_{d}$ . The features $f_{c}$ for $I_{c}$ are extracted from the feature extraction part of VGG. The features $f_{d}$ for $I_{d}$ are extracted from another lighter feature extraction part consisting of convolution layers with only a quarter of the output channels for $f_{c}$ . Using fully connected layers, we obtain the classification score $s_{c}$ from $f_{c}$ , and $s_{d}$ from the concatenation of $f_{c}$ and $f_{d}$ . The final score $s$ is the average of the two scores $s_{c}$ and $s_{d}$ .

From the acceleration perspective, according to [6], the total complexity of all the convolutional layers can be represented by

O (N), where N = d \sum l = 1 n_{l - 1} \cdot s_{l}^{2} \cdot n_{l} \cdot m_{l}^{2} .

(1)

Here, $d$ is the number of the convolution layers. For the $l$ -th layer, $n_{l}$ is the number of the output channels, $s_{l}$ is the size of the kernels, and $m_{l} \times m_{l}$ is the spatial size of the output features.

Since LPAE has $\frac{1}{2}$ size of feature maps for the approximation, the complexity $N$ in (1) is $\frac{1}{4}$ of the original network. However, unlike WAE, there is no change in the size of feature maps for the detail. Still following the setup of WAE for the detail, the number of the output channels for the LPAE’s detail becomes $\frac{1}{4}$ compared to the approximation case, so the complexity becomes $\frac{1}{16}$ of the original network. As a result, LPAE has $\frac{5}{16}$ total complexity compared to the original. Thus the acceleration rate is about $1 / \frac{5}{16} = 3.2$ . This number is comparable to $3.76$ , the acceleration rate of WAE.

Applications to Super-Resolution Problem

As observed earlier, LPAE’s approximation image in Fig. 2 tends to be similar to some other low-pass filtered images, such as the approximation image of the wavelet transform or the bicubic interpolation. This approximation image helps the detail to be more sparse. Besides, LPAE can make the hierarchical structure because the approximation carries low-frequency information sufficiently. Based on these observations, we try to expand the application domain of LPAE to super-resolution.

There are lots of algorithms in the super-resolution problem. Some of them are hard to be connected to LPAE directly in a natural way. So we try to choose the models having room for substitution, and WaveletSRNet (WaveSR) in [14] is one such model. The basic concept of the network is the use of the wavelet packet transform (WPT) as a reconstructor. For example, to train WaveSR for magnification of $4$ , an input image is decomposed into two levels using WPT. Then the network gets the approximation image of $\frac{1}{4}$ size and extracts features from the approximation image through the embedding part. The features form new details of the number needed in the reconstruction process (15 for magnification of 4) and one approximation image. At last, through WPT, the network creates the high-resolution image of magnification of 4. In this procedure, the authors set the loss function of WaveSR to make the new details and approximation similar to the decomposition results of WPT.

In this algorithm, we insert LPAE as a substitute for WPT. The divided encoding and decoding parts can take the role of the existing decomposition and reconstruction, respectively. Moreover, the hierarchical structure of LPAE does not restrict its use only for a magnification of 2 but allows it for an arbitrary magnification of $2^{k}$ . Although LPAE produces larger-sized details than WPT, the redundancy of information is helpful to reconstruct. And the fact that it has a smaller number of detail images makes the network efficient. In addition, we expect the substitute to improve the high-resolution result because it has the flexibility that the reconstruction of LPAE gets better by a fine-tune on various datasets in contrast with WPT.

We modify the loss function given in [14] to fit LPAE’s situation when we train WaveSR substituted with LPAE (named LPSR). The reconstruction loss is defined by using the L1 distance between the reconstruction $I^{'}$ and the original high-resolution image $I$ ,

l_{r e c} = \frac{1}{| I |} | | I - I^{'} | |_{1} .

The use of L1 loss for the whole procedure can reduce the smoothing effect and enhance the quality of the image [44]. Based on this observation, we define the pyramid loss by

l_{p} = \frac{1}{| I_{c} |} | | I_{c} - I_{c}^{'} | |_{1} + \sum i λ_{i} \frac{1}{| I_{d_{i}} |} | | I_{d_{i}} - I_{d_{i}}^{'} | |_{1}

where ${λ_{i}}$ are weights for the detail part. For instance, we take $λ_{1} = 0.8, λ_{2} = 1.2$ in case of magnification of 4.

Finally, the loss function of LPSR is the weighted sum of two losses:

l_{t o t a l} = γ l_{r e c} + δ l_{p} .

If the new details $I_{d_{i}}^{'}$ and approximation $I_{c}^{'}$ produced by LPSR are close to the outputs ${I_{d_{i}}, I_{c}}$ of LPAE sufficiently, the reconstruction using the new ones will have high quality on high-resolution stage. From this point of view, we focus on the closeness between the outputs of LPAE and LPSR. Hence in all of our experiments, weights $γ = 1$ and $δ = 10$ are used.

Details on Experiments

In classification, as mentioned in our paper, we use two backbones, VGG16, ResNet50, and use two datasets, ImageNet2012 (ImageNet), Intel Image Classification datasets (Natural Scene). ImageNet is a large dataset having train images of 1.2 million approximately and validation images of 50k. The images are classified as labels of 1000 and are with varying image sizes. In comparison to ImageNet, Natural Scene is a tiny dataset of 6 categories. There are 14k in train, 10k in validation and test. Moreover, the spatial size of each image is fixed as $150 \times 150$ .

We mostly follow the training strategies in [6] to train the autoencoders for comparison with WAE. In particular, for both WAE and LPAE, the Xavier algorithm is used for initialization of parameters, and the SGD algorithm with momentum 0.9 and weight decay of 0.0005 is used. However, some options are chosen differently because of the dataset size. Although we keep a batch of 4 as in [6] for Natural Scene, we choose 256 for ImageNet to shorten training time. Also, our training epochs are fixed as 100 for Natural Scene and 20 for ImageNet. Considering the difference between the two autoencoders, we choose the initial learning rate differently. For WAE, it is set as 0.000001 with the decay factor of 0.1 after every 10 epochs as in [6], but for LPAE as 0.01 with the same decaying strategy.

To train the connected classification network, for Natural Scene images, we randomly crop to $128 \times 128$ , and for ImageNet images, we resize to $256 \times 256$ and randomly crop to $224 \times 224$ . The only data augmentation we select is the random horizontal flip. We choose the batch size to be 256 regardless of datasets, and choose the SGD algorithm with the same options. But the decay strategy and the number of epochs are different along with the dataset. For ImageNet, training is performed with 20 epochs, and then the learning rate is multiplied by 0.1 after 10 epochs. For Natural Scene, the learning rate is multiplied by 0.5 after every 10 epochs during 100 training epochs.

In the super-resolution, we train on the DIV2K dataset and test on the Set5 dataset. Additionally, we use the CelebA dataset for comparison with the original task of WaveSR. DIV2K has 800 diverse images of the large size of 2k resolution, but CelebA has 162,770 images of center-arranged (to some degree) face with the size of $178 \times 218$ . Because of this substantial dissimilarity, we perform different data augmentations on two datasets.

The autoencoders’ training options are chosen similarly to the classification. For DIV2K, we randomly crop the high-resolution image to $192 \times 192$ with random horizontal/vertical flips. We train LPAE using the Adam algorithm with a batch of $4$ and adjust the initial learning rate of $0.001$ to be divided by $2$ after every $50$ epochs during $400$ epochs. To train LPSR, we set a batch of $8$ and select the Adam algorithm with the initial learning rate of $0.001$ decaying to half after every 50 epochs for $300$ training epochs. For the other dataset, CelebA, the training image is resized to $144 \times 144$ then randomly cropped to $128 \times 128$ with a random horizontal flip only. LPSR is trained during $40$ epochs using the Adam algorithm with a batch of $256$ , the initial learning rate of $0.01$ multiplied by $0.1$ after every $10$ epochs. Options for other networks, WaveSR and WSR, are chosen to be identical to LPSR. All of our codes are available at https://github.com/sangjun7/LPAE.

5 Experimental Results

In this section, we examine our LPAE performances. All of the detailed options on experiments are reported in Supplementary Material. We first make a comparison between LPAE and WAE using the PSNR value in Table 1. Then we join two autoencoders to the classification networks, VGG and ResNet. These networks represent the efficiency and the power of LPAE by inference time and accuracy. The following super-resolution results show the versatility of LPAE. By checking PSNR and SSIM values, we show that replacing original reconstruction parts with LPAE enhances super-resolution abilities. All of the experiments are conducted on 6 GPUs of GeForce RTX 3090 except for measuring test time. We measure the test time of classification networks by a GPU of Tesla T4 on Google Colab.

Autoencoder

LPAE is motivated by WAE. However, there is a considerable difference between the two autoencoders in their network structure and loss formulation. As we can see in Fig. 2, LPAE gets a high-resolution detail image compared to the detail image of WAE. The large-sized detail image has more abundant information for textures and high-frequency data of the original image. Furthermore, LPAE makes a more specific approximation image than WAE, as seen in Fig. 2. This fact comes from the constraint on the approximation image to resemble the bicubic interpolated image. Also, the change of the MSE loss between the input image and the reconstructed image to the L1 loss with a large learning rate returns high-quality reconstruction [44]. Table 1 shows a difference between the two autoencoders. The PSNR values in the table are calculated between the original image and the reconstruction image. Although PSNR alone cannot determine the quality of reconstruction, many researchers indeed consider a large PSNR value as an essential indication for better reconstruction. For Bicubic, we put back to the original image using bicubic interpolation after reducing the spatial size to half. For ImageNet, we get the large PSNR value of 47.89 dB, far superior to the others, i.e., 28.57 dB for WAE and 28.44 dB for Bicubic. In the DIV2K dataset, LPAE makes the closer reconstruction image to the original than the case of ImageNet. The PSNR value of LPAE is 54.73 dB, and that of WAE is 19.90 dB, which shows a significant difference between PSNR values. LPAE gets one compressed image and another sparse image, and based on the above results, we see that LPAE restores the original image much better. Thus LPAE can accomplish the role of accelerator comparable to WAE and be extended to the super-resolution problem.

		LPAE	WAE	Bicubic
ImageNet	Train Loss	0.0041	0.0019	-
	Test Loss	0.0042	0.0019	-
	PSNR (dB)	47.89	28.57	28.44
DIV2K	Train Loss	0.0023	0.0114	-
	Test Loss	0.0024	0.0109	-
	PSNR (dB)	54.73	19.90	26.19

Table 1: Comparison of power for restoring the original image after encoding and decoding between WAE, LPAE and bicubic interpolation (Bicubic).

Classification

ImageNet
		Top 5	Train	Trainable
		Accuracy (%)	Time (hr)	Parameters
VGG	Basic	86.94	30.11	138,365,992
	WAE	84.96	27.48	132,914,278
	LPAE	85.32	27.12	150,220,870
ResNet	Basic	84.01	29.23	25,575,784
	WAE	79.95	27.68	29,628,406
	LPAE	80.31	27.69	29,633,494
Natural Scene
		Top 5	Train	Trainable
		Accuracy (%)	Time (hr)	Parameters
VGG	Basic	89.97	0.55	138,365,992
	WAE	88.73	0.35	132,914,278
	LPAE	89.97	0.38	150,220,870

Table 2: Comparison of accuracy and training times between classification networks connected with WAE or LPAE.

As mentioned in Section 4, we expect that the link between LPAE and a classification network reduces the computational cost and accelerates algorithms with a slight drop in accuracy. To show this, we train the basic classification networks (VGG, ResNet) and those connected with WAE and LPAE, using two datasets (ImageNet, Natural Scene). We use our codes for WAE. Although not able to reproduce exactly, we identify a tendency of accelerating presented in [6]. Table 2 shows a comparison of performances between networks. For all cases, classification networks connected with LPAE (LPVGG, LPResNet) get better precision than those connected with WAE (WVGG, WResNet). For Natural Scene, resulting in 89.97 $%$ , LPVGG even has the same accuracy as the original VGG. We speculate that more information of the original image is kept after the LPAE’s encoding helps classify. Table 2 also reports the number of trainable parameters for each case for reference. The training times of LPVGG/LPResNet are reduced to a similar level as WVGG/WResNet. For ImageNet, WVGG saves about 2.63 hr than the basic, and LPVGG saves about 2.99 hr that is about 10 $%$ of the whole training time of the basic. For Natural Scene, LPVGG reduces about 31 $%$ of training time than the basic.

ImageNet
		FLOPs	Compl.	Test Time (ms) for Batch
		(B)	(B)	1	20	50
VGG	B	31.02	15.47	14.45	127.07	274.55
	W	8.95	4.46	10.84	57.17	117.25
	L	11.01	5.48	12.67	80.77	171.98
ResNet	B	7.77	4.06	7.40	61.09	140.10
	W	2.74	1.40	8.75	40.86	88.27
	L	3.68	1.88	9.26	56.90	127.46
Natural Scene
		FLOPs	Compl.	Test Time (ms) for Batch
		(B)	(B)	1	20	50
VGG	B	10.15	5.06	6.96	41.78	85.88
	W	2.95	1.47	7.03	21.03	39.45
	L	3.62	1.80	7.56	27.80	57.68

Table 3: Comparison about FLOPs and test times between VGG or ResNet connected with WAE and LPAE. B, W and L is Basic, WAE and LPAE same as Table 2, respectively. The unit of FLOPs and Compl. is a billion (B).

Table 3 shows each network’s FLOPs, complexity (Compl.), and test times for different batch sizes. Our complexity here is calculated similarly to $N$ in (1) for convolution layers and fully connected layers. For instance, with VGG on ImageNet, we get an acceleration rate of $\frac{15.47}{5.48} \approx 2.82$ for LPAE, and $\frac{15.47}{4.46} \approx 3.47$ for WAE. These numbers are similar to the rough computation of acceleration rate in Section 4, i.e., $3.2$ for LPAE and $3.76$ for WAE. When we check the results of FLOPs about VGG, we get 31.02 billion (B) for the basic, 8.95 B for LPVGG, and 11.01 B for WVGG. Thus the FLOPs of LPVGG and WVGG are about 0.35 and 0.29 of that of VGG. Hence, the computational cost of LPVGG, similar to WVGG, is vastly reduced. We think that this point leads to the result accelerating LPVGG sufficiently in the test. LPVGG gets 12.67 ms in the test with 1 batch, which is decreased by 1.78 ms. If we raise batch size to 20 or 50 (cf. [39]), the decrease rate becomes larger because LPVGG has 80.77 ms instead of 127.07 ms for size 20, and 171.98 ms instead of 274.55 ms for size 50. Although the connection with LPAE accelerates VGG less than the connection with WAE, there is a meaningful difference in test time between LPVGG and VGG. For other cases with 1 batch, results are unexpected, representing that basic VGG takes the shortest time for the test. We think the unexpected situations are due to the small size of the input. For increased batch, the test time of networks connected with LPAE again becomes smaller than that of original networks.

Super-resolution

	Scale		WaveSR	WSR	LPSR
CelebA	$\times 2$	PSNR (dB)	30.87	31.93	36.04
	$\times 2$	SSIM	0.920	0.940	0.967
	$\times 4$	PSNR (dB)	27.71	17.96	29.15
	$\times 4$	SSIM	0.840	0.513	0.871
	$\times 8$	PSNR (dB)	24.34	15.05	24.86
	$\times 8$	SSIM	0.707	0.431	0.735
Set5	$\times 2$	PSNR (dB)	32.03	25.70	37.62
	$\times 2$	SSIM	0.914	0.781	0.955
	$\times 4$	PSNR (dB)	28.60	15.28	32.25
	$\times 4$	SSIM	0.852	0.381	0.899
	$\times 8$	PSNR (dB)	24.01	13.85	27.07
	$\times 8$	SSIM	0.650	0.336	0.782

Table 4: Results for the super-resolution network based on WaveletSRNet.

	Set5
	$\times 2$	$\times 4$	$\times 8$
CAR [34]	38.94	33.88
DRLN+ [2]	38.34	32.74	27.46
ABPN [29]		32.69	27.25
HBPN [28]	38.13	32.55	27.17
DBPN-RES-MR64-3 [12]	38.08	32.65	27.51
MWCNN [24]	37.91	32.12
CARN [1]	37.76	32.13
LFFN-S [38]	37.66	31.79
CSRCNN [40]	37.45	31.01	25.74
IKC [11]	36.62	31.52
DeepRED [32]		30.72	26.04
LPSR	37.62	32.25	27.07

Table 5: Comparison of PSNR values of LPSR with the State-of-the-art (SOTA) on Set5.

We evaluate super-resolution networks (WaveSR, WSR, LPSR) on two datasets (CelebA, Set5) using metrics PSNR and SSIM. For the result of WaveSR, we used our codes and could not reproduce results reported in [14] despite the same options. The values in Table 4 are obtained by applying our codes with the same environment to the three WaveletSRNet-based networks, namely, WaveSR, WSR, and LPSR. For CelebA, LPSR has top values among the three networks for all scales. In particular, for $\times 2$ scale, there is the most significant gap when we change WPT to LPAE because PSNR is 36.04 dB for LPSR and 30.87 dB for WaveSR. This shows the power of LPAE reconstructing the image and having a learning method that fits the model’s parameter on data distribution. If we focus on the comparison between WSR and LPSR, the PSNR value of WSR takes a sharp drop with increased scale. WSR for $\times 2$ scale gets 31.93 dB, which is even higher than the PSNR of WaveSR, but for $\times 4$ scale, WSR reaches 17.96 dB. This result describes the limit of WAE, which means that WAE does not consider multi-scale analysis and reconstruction of the image (c.f. Section 1). The same tendency appears once more in the results for the Set5 dataset. LPSR obtains the biggest PSNR, SSIM values among the three networks. Its PSNR values are 37.62 dB for $\times 2$ scale, 32.25 dB for $\times 4$ scale, and 27.07 dB for $\times 8$ scale. For WSR, since Set5 is dissimilar to CelebA, which is the center-aligned data of constant size, reconstruction by WAE falls down to 25.70 dB in $\times 2$ scale. For $\times 4$ or $\times 8$ scale, it is hard to do reconstruction using WSR. Table. 5 shows the comparison of our LPSR on Set5 with SOTA networks. Our PSNR values rank about the middle on average, which is good, considering the fact that LPSR is obtained by a simple change using LPAE from WaveSR. Fig. 2(o) shows the reconstruction of WaveSR, WSR, LPSR for two images of Set5. WaveSR works well for $\times 2$ scale, but as the scale is getting bigger, the reconstruction images of WaveSR are blurred and have a checkers pattern. However, LPSR gets high-quality reconstruction images for all scales.

6 Conclusion and Future Works

We organize LPAE, which assists in various problems from network acceleration to super-resolution. It reflects the structure of LP and consists of the encoder and the decoder. Using the encoder, we decompose an image into the approximation image with a low-resolution/frequency channel and the detail image with a high-frequency channel. The decoder recreates the original image correctly using the approximation and the detail. Three types of loss (approximation loss, sparsity loss, and reconstruction loss) enable us to obtain clear decomposed images and to achieve better reconstruction. Experiments in this paper show that LPAE makes the existing classification network light and preserves the original accuracy in classification. For super-resolution, it accomplishes better performance than the established wavelet-based model. In the future, we plan to explore a range of applications to different problems such as generative models, image compression, and character recognition.

7 Acknowledgments

$†$ This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-00023, Developing a lightweight Korean text detection and recognition technology for complex disaster situations).

$‡$ This work was supported in part by National Research Foundation of Korea (NRF) [Grant Numbers 2015R1A5A1009350 and 2021R1A2C1007598], and by the ‘Ministry of Science and ICT’ and NIPA via “HPC Support” Project.

References

[1] Ahn, N.; Kang, B.; and Sohn, K.-A. 2018. Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network. In Proceedings of the European Conference on Computer Vision (ECCV), 252–268.
[2] Anwar, S.; and Barnes, N. 2020. Densely Residual Laplacian Super-Resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1.
[3] Ardakani, A.; Condo, C.; Ahmadi, M.; and Gross, W. J. 2018. An Architecture to Accelerate Convolution in Deep Neural Networks. IEEE Transactions on Circuits and Systems I: Regular Papers, 65(4): 1349–1362.
[4] Bevilacqua, M.; Roumy, A.; Guillemot, C.; and line Alberi Morel, M. 2012. Low-Complexity Single-Image Super-Resolution based on Nonnegative Neighbor Embedding. In Proceedings of the British Machine Vision Conference, 135.1–135.10. BMVA Press. ISBN 1-901725-46-4.
[5] Burt, P. J.; and Adelson, E. H. 1987. The Laplacian Pyramid as a Compact Image Code. Readings in Computer Vision, 671–679.
[6] Chen, T.; Lin, L.; Zuo, W.; Luo, X.; and Zhang, L. 2018. Learning a Wavelet-Like Auto-Encoder to Accelerate Deep Neural Networks. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), 6722–6729.
[7] Chu, X.; Zhang, B.; Ma, H.; Xu, R.; and Li, Q. 2021. Fast, Accurate and Lightweight Super-Resolution with Neural Architecture Search. In 2020 25th International Conference on Pattern Recognition (ICPR), 59–64.
[8] Dai, T.; Cai, J.; Zhang, Y.; Xia, S.-T.; and Zhang, L. 2019. Second-Order Attention Network for Single Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11065–11074.
[9] Do, M. N.; and Vetterli, M. 2003. Framing Pyramids. IEEE Transactions on Signal Processing, 51(9): 2329–2342.
[10] Doersch, C. 2021. Tutorial on Variational Autoencoders. arXiv:1606.05908.
[11] Gu, J.; Lu, H.; Zuo, W.; and Dong, C. 2019. Blind Super-Resolution With Iterative Kernel Correction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1604–1613.
[12] Haris, M.; Shakhnarovich, G.; and Ukita, N. 2020. Deep Back-Projection Networks for Single Image Super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence (Early Access), 1–1.
[13] Huang, F.; Zhang, J.; Zhou, C.; Wang, Y.; Huang, J.; and Zhu, L. 2020. A deep learning algorithm using a fully connected sparse autoencoder neural network for landslide susceptibility prediction. Landslides, Journal of the International Consortium on Landslides, 17: 217–229.
[14] Huang, H.; He, R.; Sun, Z.; and Tan, T. 2017. Wavelet-SRNet: A Wavelet-based CNN for Multi-scale Face Super Resolution. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), 1689–1697.
[15] Huang, Y.; and Xu, Q. 2021. Electricity theft detection based on stacked sparse denoising autoencoder. International Journal of Electrical Power & Energy Systems, 125: 106448.
[16] Imani, M.; Garcia, R.; Gupta, S.; and Rosing, T. 2019. Hardware-Software Co-design to Accelerate Neural Network Applications. ACM Journal on Emerging Technologies in Computing Systems, 15(21): 1–18.
[17] Islam, Z.; Abdel-Aty, M.; Cai, Q.; and Yuan, J. 2021. Crash data augmentation using variational autoencoder. Accident Analysis & Prevention, 151(1): 105950.
[18] Kaggle (Photo by Jan Bottinger on Unsplash). 2018. Intel Image Classification. https://www.kaggle.com/puneet6060/intel-image-classification.
[19] Kim, J.; Lee, J. K.; and Lee, K. M. 2016. Accurate Image Super-Resolution Using Very Deep Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1646–1654.
[20] Kong, X.; Zhao, H.; Qiao, Y.; and Dong, C. 2021. ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 12016–12025.
[21] Liang, J.; Zeng, H.; and Zhang, L. 2021. High-Resolution Photorealistic Image Translation in Real-Time: A Laplacian Pyramid Translation Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 9392–9400.
[22] Lim, B.; Son, S.; Kim, H.; Nah, S.; and Lee, K. M. 2017. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 136–144.
[23] Lin, T.; Ma, Z.; Li, F.; He, D.; Li, X.; Ding, E.; Wang, N.; Li, J.; and Gao, X. 2021. Drafting and Revision: Laplacian Pyramid Network for Fast High-Quality Artistic Style Transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 5141–5150.
[24] Liu, P.; Zhang, H.; Zhang, K.; Lin, L.; and Zuo, W. 2018. Multi-Level Wavelet-CNN for Image Restoration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 773–782.
[25] Liu, Y.; Agarwal, S.; and Venkataraman, S. 2021. AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning. arXiv:2102.01386.
[26] Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep Learning Face Attributes in the Wild. In Proceedings of International Conference on Computer Vision (ICCV).
[27] Liu, Z.-S.; Siu, W.-C.; and Wang, L.-W. 2021. Variational AutoEncoder for Reference Based Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 516–525.
[28] Liu, Z.-S.; Wang, L.-W.; Li, C.-T.; and Siu, W.-C. 2019a. Hierarchical Back Projection Network for Image Super-Resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 0–0.
[29] Liu, Z.-S.; Wang, L.-W.; Li, C.-T.; Siu, W.-C.; and Chan, Y.-L. 2019b. Image Super-Resolution via Attention Based Back Projection Networks. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 3517–3525.
[30] Mahmoud, M.; Edo, I.; Zadeh, A. H.; Awad, O. M.; Pekhimenko, G.; Albericio, J.; and Moshovos, A. 2020. TensorDash: Exploiting Sparsity to Accelerate Deep Neural Network Training. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 781–795.
[31] Makhzani, A.; Shlens, J.; Jaitly, N.; Goodfellow, I.; and Frey, B. 2016. Adversarial Autoencoders. arXiv:1511.05644.
[32] Mataev, G.; Milanfar, P.; and Elad, M. 2019. DeepRED: Deep Image Prior Powered by RED. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 0–0.
[33] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; Berg, A. C.; and Fei-Fei, L. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3): 211–252.
[34] Sun, W.; and Chen, Z. 2020. Learned Image Downscaling for Upscaling Using Content Adaptive Resampler. IEEE Transactions on Image Processing, 29: 4027–4040.
[35] Timofte, R.; Agustsson, E.; Gool, L. V.; Yang, M.-H.; Zhang, L.; Lim, B.; et al. 2017. NTIRE 2017 Challenge on Single Image Super-Resolution: Methods and Results. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
[36] Vahdat, A.; and Kautz, J. 2021. NVAE: A Deep Hierarchical Variational Autoencoder. arXiv:2007.03898.
[37] Wang, J.; Duan, Y.; Tao, X.; Xu, M.; and Lu, J. 2021. Semantic Perceptual Image Compression With a Laplacian Pyramid of Convolutional Networks. IEEE Transactions on Image Processing, 30: 4225–4237.
[38] Yang, W.; Wang, W.; Zhang, X.; Sun, S.; and Liao, Q. 2019. Lightweight Feature Fusion Network for Single Image Super-Resolution. IEEE Signal Processing Letters, 26(4): 538–542.
[39] Yapıcı, M. M.; Tekerek, A.; and Topaloglu, N. 2019. Performance Comparison of Convolutional Neural Network Models on GPU. In 2019 IEEE 13th International Conference on Application of Information and Communication Technologies (AICT), 1–4.
[40] Zhang, J.; Wang, Z.; Zheng, Y.; and Zhang, G. 2021a. Cascaded Convolutional Neural Network for Image Super-Resolution. In Sun, X.; Zhang, X.; Xia, Z.; and Bertino, E., eds., Advances in Artificial Intelligence and Security, 361–373. Cham: Springer International Publishing.
[41] Zhang, J.; Yu, H.-F.; and Dhillon, I. S. 2019. AutoAssist: A Framework To Accelerate Training Of Deep Neural Networks. In NIPS’19: Proceedings of the 33rd International Conference on Neural Information Processing Systems, 539, 5998–6008.
[42] Zhang, W.; Jiao, L.; Li, Y.; Huang, Z.; and Wang, H. 2021b. Laplacian Feature Pyramid Network for Object Detection in VHR Optical Remote Sensing Images. IEEE Transactions on Geoscience and Remote Sensing, 1–14.
[43] Zhang, X.; Song, H.; Zhang, K.; Qiao, J.; and Liu, Q. 2020. Single image super-resolution with enhanced Laplacian pyramid network via conditional generative adversarial learning. Neurocomputing, 398: 531–538.
[44] Zhao, H.; Gallo, O.; Frosio, I.; and Kautz, J. 2016. Loss Functions for Image Restoration With Neural Networks. IEEE Transactions on Computational Imaging, 3(1): 47–57.