Practical Real Video Denoising with Realistic
Degradation Model

Jiezhang Cao

^{1}

, Qin Wang

^{1}

, Jingyun Liang

^{1}

, Yulun Zhang

^{1}

, Kai Zhang

^{1}

, Luc Van Gool

^{1, 2}

^{1}

Computer Vision Lab, ETH Zürich, Switzerland

^{2}

KU Leuven, Belgium
{jiezhang.cao, qin.wang, jingyun.liang, yulun.zhang, kai.zhang, vangool}@vision.ee.ethz.ch
https://github.com/caojiezhang/ReViD

Abstract

Existing video denoising methods typically assume noisy videos are degraded from clean videos by adding Gaussian noise. However, deep models trained on such a degradation assumption will inevitably give rise to poor performance for real videos due to degradation mismatch. Although some studies attempt to train deep models on noisy and noise-free video pairs captured by cameras, such models can only work well for specific cameras and do not generalize well for other videos. In this paper, we propose to lift this limitation and focus on the problem of general real video denoising with the aim to generalize well on unseen real-world videos. We tackle this problem by firstly investigating the common behaviors of video noises and observing two important characteristics: 1) downscaling helps to reduce the noise level in spatial space and 2) the information from the adjacent frames help to remove the noise of current frame in temporal space. Motivated by theses two observations, we propose a multi-scale recurrent architecture by making full use of the above two characteristics. Secondly, we propose a synthetic real noise degradation model by randomly shuffling different noise types to train the denoising model. With a synthesized and enriched degradation space, our degradation model can help to bridge the distribution gap between training data and real-world data. Extensive experiments demonstrate that our proposed method achieves the state-of-the-art performance and better generalization ability than existing methods on both synthetic Gaussian denoising and practical real video denoising.

1 Introduction

Video denoising, with the aim of reducing the noise from a video to recover a clean video, has drawn increasing attention in low-level computer vision community tassano2019dvdnet; tassano2020fastdvdnet; vaksman2021pacnet; davy2018vnlnet; chan2022basicvsrpp2; lee2021restore; maggioni2021efficient; huang2022neural; cao2022datsr; cao2022davsr; cao2021vsrt. Compared with image denoising, video denoising remains large underexplored domain. With the advance of deep learning ren2021adaptive; zheng2021deep; zamir2021multi, deep neural networks (DNNs) vaksman2021pacnet; tassano2020fastdvdnet; sheth2021unsupervised have become the dominant approach for video denoising. To push the envelope of video denoising, existing DNNs-based methods mainly focus on two directions with the some assumptions.

Firstly, a line of studies tassano2019dvdnet; tassano2020fastdvdnet assume noisy videos are the addition of white Gaussian noises (AWGN) to clean videos. These methods perform well when tested on videos with the same degradation setting. However, their performance would deteriorates significantly when tested on videos corrupted by other types of noises (e.g., video compression noise and camera sensor noise) due to the noise distribution mismatch zhang2022scunet. To handle these noises, it is impractical to train multiple models. Moreover, noises in real-world videos are even more complex. Nevertheless, it is fair and necessary to train with AWGN and evaluate the effectiveness of different denoising methods in this simplified setup as a start point.

Secondly, to relieve the degradation mismatch between synthetic training data and real videos, the other line of work claus2019videnn proposed to capture noisy-clean video pairs for training. However, the video capturing and alignment process is time-consuming and expensive, which limits the potential size of such datasets. Another important limitation is that the training data is often captured by one specific camera, the degradation distribution of which may differ far away from other cameras under other recording environments. Therefore, deep models claus2019videnn trained on such clean-noisy paired videos can suffer from poor generalization performance when tested on data collected from other cameras.

However, these two assumptions only consider limited types of degradations which rarely happen in real noisy videos. Such degradation mismatch between training videos and real test videos would inevitably give rise to poor generalization performance. To address this, we focus on a more general video denoising setup with the goal to train a deep model to generalize well to unseen real-world videos, different from existing studies illustrated in Figure 2. To tackle this problem, we first take a closer look on the inherent properties of noisy videos in the spatial and temporal space. The statistics of clean patches in noisy images have been explored in some studies zontak2013separating. However, there are little work devoted to the analysis of noisy videos. In Figure 1, we observe that downscaling can reduce part of noise for different levels. Motivated by this observation, we propose to integrate multi-scale learnable downscaling into the denoising network. On the other hand, noise in a video often has random patterns in temporal space. Some pixels in current frame may have much more noise, while pixels in the same position of adjacent frames can have less noise, as shown in Figure 1 (b). To restore clean videos, it is necessary to model temporal connections so as to utilize information from adjacent frames to remove noise in the current frame.

Figure 1: Two interesting properties of noisy videos. (a) Downscaling can remove part of noise in a video. (b) The pixels in a video have different degrees of occlusion. The high-quality adjacent pixels can help provide details to occluded pixels.

Motivated by these two properties, we design a new architecture for general real video denoising, which we refer to as ReViD. ReViD consists of multiple scales, each of which has learnable downscaling to remove spatial noise and recurrent modeling to separate temporal signal from a noisy video. To handle the degradation mismatch between training data and real-world test videos, we propose a new degradation model to generate diverse noisy video and bridge the distribution gap by using a randomized composition of a wide range of degradations.

The contributions of this paper can be summarized as follows:

[leftmargin=3mm]
We design a simple but effective real video denoising network by exploiting the inherent properties of a noisy video. Our method achieves the state-of-the-art performance on additive white Gaussian denoising and real-world video denoising tasks.
We make the first attempt for general real video denoising and propose a new noise degradation model. Our degradation model is able to generalize well on unseen and complex real-world videos. Moreover, we provide a theoretical analysis that training with our degradation model is equivalent to regularized loss with strong penalty. Our degradation model can generate diverse noisy video with large variance to better match the distribution of real-world videos.
We conduct extensive experiments to demonstrate the effectiveness and superiority of our proposed method on both synthetic Gaussian denoising and practical real video denoising. We propose a new real video denoising test dataset consisting of different real-world noises. Our dataset can serve as a real video denoising benchmark for further studies.

Discussion on the difference of existing video denoising setups. (a) Non-blind denoising methods take an AWGN video and its noise as input to synthesize a clean video. (b) Blind denoising methods aim to map a noisy video to a clean video without inputting the noise level. When training a model with noisy videos from a specific camera, it has poor performance (marked by the dotted line) on another camera. (c)
Our general real denoising method first synthesizes different kinds of noisy videos with the degradation models, and then generalize well on different real-world videos. — Figure 2: Discussion on the difference of existing video denoising setups. (a) Non-blind denoising methods take an AWGN video and its noise as input to synthesize a clean video. (b) Blind denoising methods aim to map a noisy video to a clean video without inputting the noise level. When training a model with noisy videos from a specific camera, it has poor performance (marked by the dotted line) on another camera. (c) Our general real denoising method first synthesizes different kinds of noisy videos with the degradation models, and then generalize well on different real-world videos.

2 Related Work

Image denoising. The goal of image denoising is to reduce noise from a noisy image kim2021noise2score; fu2021unfolding; luo2021functional; bodrito2021trainable. The well-known BM3D dabov2007bm3d uses the block-matching and the collaborative filtering in a 3D transform domain. Alternatively, NLB lebrun2013nlb proposes a non-local Bayesian image denoising algorithm. However, the performance of these methods depend highly on the specific forms of prior and hand-tuned parameters in the optimization. They also lack flexibility as multiple models need to be trained for different levels of noise. To address this, recent methods exploit the benefits of deep neural networks. Results from continually improving neural networks have been demonstrating significant denoising performance enhancement. This includes convolution neural networks (CNNs) (e.g., DnCNNs zhang2017dncnn, RBDN santhanam2017rbdn and FFDNet zhang2018ffdnet) and Transformer liu2021swin (e.g., SwinIR liang2021swinir and SCUNet zhang2022scunet). In addition, many image denoising models plotz2017benchmarking; brooks2019unprocessing; chang2020learning; kousha2022modeling train on real image pairs yue2020supervised; hasinoff2016burst captured by one cameras. However, these methods often have poor performance on other cameras. While image based denoising methods can in theory construct a baseline for real-world blind video denoising by treating each frame as a separate image, directly using them in our setup ignores the fruitful temporal connections between different frames in a video and leads to relatively poor performance.

Video denoising. Video denoising aims at removing noise to synthesize clean video sequences. Based on BM3D dabov2007bm3d, VBM4D maggioni2012bm4d presents a video filtering algorithm to exploit temporal and spatial redundancy of the video sequence. Some existing methods make use of the Recurrent Neural Network (RNN) to capture this sequential information. DRNNs chen2016drnn first applies deep RNN for video denoising on the grady-scale images. However, the method seems to have difficulty to be extended to RGB images probably due to the difficulties of training RNN pascanu2013difficulty. Recently, BasicVSR++ chan2021basicvsrpp improves the second-order grid propagation and flow-guided deformable alignment in RNN and extends video super-resolution to the video denoising chan2022basicvsrpp2. In addition, some denoising methods adopt an asymmetric loss function vogels2018kpnn to optimize the networks, or propose patch-based video denoising algorithm arias2018vnlb; davy2018vnlnet to exploit the correlations among patches. For example, VNLB arias2018vnlb is a patch-based empirical Bayesian video denoising algorithm. VNLnet davy2018vnlnet combines a patch-based framework with DnCNN zhang2017dncnn architecture by proposing a non-local patch search module in video denoising and fusing features by CNN. PaCNet vaksman2021patch combines a patch-based framework with CNN by augmenting video sequences with patch-craft frames and inputting them in a CNN. To further improve over patch-based methods, DVDnet tassano2019dvdnet proposes spatial and temporal denoising blocks and trains them separately. To boost the efficiency, FastDVDnet tassano2020fastdvdnet extends DVDnet tassano2019dvdnet by using two denoising steps in the architecture which composed of a modified multi-scale U-Net ronneberger2015u, and it achieves fast runtimes. VRT liang2022vrt proposes a video restoration transformer with parallel frame prediction, and achieves the state-of-the-art performance in video denoising. However, this transformer-based method has a large model size and expensive computational cost. Moreover, the above methods cannot be directly used in our real-world video denoising setup as they only consider synthesized gaussian noise. Recently, ViDeNN claus2019videnn proposes a blind video denoising method trained either on AWGN noise or on collected real-world videos. However, this method may have limited generalization ability as the training only considers the specific noise type presented in the training dataset. This can lead to potential issue when tested on different real-wold videos captured from different sensors under different conditions.

Figure 3: Degradation mismatch in different setups. The highlighted part indicates training data distribution and therefore good test performance. The dotted area means the model has poor generalizations in these unseen area. (a) Real test distribution of videos captured from different cameras. (b) Training with Gaussian distribution cannot generalize well to most areas of the real distributions. (c) The model trained with collected dataset from one camera has poor performance on other cameras. (d) Our noise degradation aims to synthesize large amounts of data to match the real distribution.

3 Proposed Method

3.1 General Real Video Denoising

In digital video processing, a noisy video can be corrupted by some random process. Formally, given a clean video sequence $x$ , a noisy video $x_{σ}$ can be obtained by additive noises, i.e., $x_{σ} = x + z_{σ}$ , where $z_{σ}$ is a variable sampled from some distribution with density $p (σ)$ . For traditional gaussian denoising, this distribution is a zero-mean Gaussian distribution with standard deviation $σ$ , i.e., $N (0, σ I)$ , where $σ$ represents the noise level in a video. However, real-world video noises are mostly unknown and can differ between different videos due to differences in cameras, imaging setups, environments, etc. To improve the denoising performance on videos with unknown noises, we generalize the assumption on noises and do not assume any pre-defined noise type. We call this new setup General Real Video Denoising. As shown in Figure 2, unlike previous blind video denoising methods claus2019videnn which implicitly assume that the training and test data share the same noises, our proposed setup is more generalizable and can be tested on videos with unknown noises. Formally, our goal is to learn a video denoiser $f_{π}$ to reduce noise and synthesize clean video sequence by minimizing the following problem, i.e.,

f_{π} = {a r g m i n}_{f} L_{π} (f) := E_{σ \in π (σ)} [E_{(x_{σ}, x)} [∥ f (x_{σ}) - x ∥^{2}]],

(1)

where $E [\cdot]$ is an expectation w.r.t. the data or the noise distribution. To understand how to train a denoiser for testing videos with unknown noises, we first provide a Lemma motivated by gnanasambandam2020one.

Lemma

Assume that the training distribution $π$ and testing distribution $p$ are partly overlapped, let $f_{p} = {a r g m i n}_{f} L_{p} (f)$ . The risk of $f_{p}$ is bounded by: $L_{p} (f_{p}) \leq L_{p} (f_{π}), \forall π$ . $_{■}$

From this lemma, we can minimize Equation (1) such that the generalization error becomes small as long as $π$ and $p$ are partly overlapped. For traditional Gaussian denoising problem which considers multiple noise levels $σ$ , this is achievable by training a denoiser using all noise levels, because the testing distribution $p$ will be on a specific noise level and $p$ is then a subset of the training distribution $π$ , therefore overlapped. For our proposed general real video denoising problem, it is more complicated as the testing distribution $p$ is unknown. To minimize the generalization error, we need to build a new noise degradation for training such that the training distribution $π$ can be partly overlapped with the unknown test distribution $p$ . An illustration of the difference between gaussian, blind, and our proposed general real denoising is provided in Figure 3.

Motivated by Lemma 3.1, we propose a new video denoising method that aims to tackle the general real video denoising problem. In this section, we first show how video noise properties can be exploited for network design to facilitate the optimization. We then propose a video degradation model to make the distributions of training data $π$ match better with real test videos $p$ .

3.2 Multi-Scale Recurrent Network for Video Denoising

We show how common properties of video noises can benefit network design in video denoising. The proposed architecture is provided in Figure 4.

Denoising in the spatial space.

As shown in Figure 1 (a), simple downscaling (e.g., bicubic) can suppress specific noises (e.g., Gaussian noise). However, simple downscaling is hard to handle more complex noises (e.g., combination of different kinds of noises) in real-world videos and can also induce the serious blur artifacts. Therefore, we introduce a learnable convolution to downscale features to reduce different kinds of noise. Specifically, given an $n$ -frame noisy video $x_{σ}$ , we first deploy a convolutional layer to extract low-level features ${{ˆ g}_{1}, \dots, {ˆ g}_{n}}$ . Here, $x_{σ}$ is an input image which combines the noisy video and the level map of the additive white Gaussian noise (AWGN) for traditional denoising problem. For real video denoising, $x_{σ}$ is augmented by our proposed noise degradation model which is discussed in the next section. Then, we use a spatial encoder $E_{spatial}$ to extract deep features and reduce the noise in space, i.e.,

g_{i}^{s} = E_{spatial} (g_{i}^{s - 1}),

(2)

where $g_{i}^{0} = {ˆ g}_{i}$ , and the spatial encoder $E_{spatial}$ can be modelled by multi-layered residual blocks.

Denoising in the temporal space.

Motivated by the temporal property and chan2021basicvsrpp, we follow the second-order Markov chain to propagate the features. Given a denoised spatial feature $g_{i}^{s}$ , we use the optical-flow-guided deformable alignment as our temporal encoder $E_{temporal}$ to compute the features

{ˆ f}_{i, j}^{s} = E_{temporal} (g_{i}^{s}, f_{i - 1, j}^{s}, f_{i - 2, j}^{s}, o_{i \to i - 1}^{s}, o_{i \to i - 2}^{s}),

(3)

where $f_{i, j}^{s}$ is the feature at the $i$ -th timestep in the $j$ -th propagation branch at the $s$ -th scale, and $o_{i_{1} \to i_{2}}^{s}$ is the optical flow from $i_{1}$ -th frame to the $i_{2}$ -th frame at the $s$ -th scale. In practice, we implement $E_{temporal}$ by using the architecture of the flow-guided deformable alignment of chan2021basicvsrpp to predict offset and mask in DCN zhu2019dcnv2. More details are provided in the Supplementary. After reducing the temporal noise, we use another spatial encoder $E_{spatial}^{'}$ to further remove the noise in space, i.e.,

f_{i, j}^{s} = {ˆ f}_{i, j}^{s} + E_{% s p a t i a l}^{'} ([f_{i, j - 1}^{s}; {ˆ f}_{i, j}^{s}]),

(4)

where $[\cdot; \cdot]$ is a concatenation along the channel dimension and $f_{i, 0}^{s} = g_{i}^{s}$ . Let $f_{i}^{s}$ be the feature in the last branch at the $s$ -th scale, the spatial decoder $D_{spatial}$ aggregates features with the skip connection,

h_{i}^{s} = f_{i}^{s} + D_{spatial} (h_{i}^{s + 1}),

(5)

where $h_{i}^{S} = f_{i}^{S}$ at the last scale $S$ and spatial decoder can be implemented by multi-layered residual blocks he2016resnet with PixelShuffle shi2016pixelshuffle in the experiment. Last, we use convolutional layers to produce residual noise. In the training, we first train a denoiser using L1 loss, and then we further train the model by minimizing a weighted combination of L1 loss, perceptual loss and GAN loss.

Figure 4: The architecture of the proposed multi-scale recurrent network. Our network is motivated by video noise properties. For non-blind video denoising, we take the noisy video and noise level map as an input. For general real video denoising, we feed the noisy video augmented by our degradation models to train the network. At each scale, the network first removes spatial noise with learnable ResNet downscaling blocks and then removes temporal noise using a recurrent structure.

3.3 Real Noise Degradations

Unlike Gaussian noises in traditional setups, real-world videos often contains unknown noises and blur and they differ from video to video. They are more complex and also harder to collect. A denoiser trained on one noise distribution can have poor generalization on real-world video noise because of the distribution mismatch between the training and test. Following the guidance from Lemma 3.1, we propose a general video denoising method with a new noise degradation for real-world videos. Different from traditional methods which directly reducing noise, the general video denoising is more practical because it is able to learn a residual and jointly remove noises and blur.

To better model real-world distribution, we propose to use randomized combination of a wide range of degradation types. Specifically, we randomly change the order of different degradations in the training. The distribution of training data augmented with the proposed randomized degradations can overlap more with the potential test data with unknown degradations. Formally, given a clean video $x$ , we use composition function of $N$ shuffled degradations to synthesize noisy video sequences $x_{σ}$ :

x_{σ} = g (x) = (g_{i_{1}} \circ g_{i_{2}} \circ \dots \circ g_{i_{N}}) (x), where {i_{1}, \dots, i_{N}} = ϕ ({1, \dots, N}),

(6)

where $ϕ$ is a shuffle function, $\circ$ is a function composition, and $g_{i_{n}}$ is a degradation model of the $i_{n}$ -th type. Motivated by bishop1995training, we prove the following theorem to understand our degradation.

Theorem (Effect of noise degradations)

Let $z_{σ} = g (x) - x$ , and assume that the mean and variance of the noise distribution are $0$ and $η^{2} (z_{σ})$ , then the loss (1) , i.e.,

	$E_{σ} [E_{(x)} [∥ f (x_{σ}) - x ∥^{2}]] =$	$E_{x} [∥ f (x) - x ∥^{2}]$		(7)
		$+ η^{2} (z_{σ}) E_{x} [{∥ ∥ ∥ \frac{\partial f}{\partial x} ∥ ∥ ∥}^{2} + \frac{1}{2} {(f (x) - x)}^{⊤} \frac{\partial^{2} f}{\partial x^{2}} 1] .$		(7)

$_{■}$

From this theorem, the loss (1) trained with our noise degradations is equivalent to a normal loss with a regularization term. The parameter $η^{2} (z_{σ})$ is related to the amplitude or variance of the noise $z_{σ}$ and controls how the regularization term influences the loss. Moreover, our degradation model make $η^{2} (z_{σ})$ to be large (see Figure 11) to improve the generalization performance of our model.

An illustration of the proposed noise degradation pipeline.
For a high quality video, a randomly shuffled degradation sequence is performed to produce a noisy video. — Figure 5: An illustration of the proposed noise degradation pipeline. For a high quality video, a randomly shuffled degradation sequence is performed to produce a noisy video.

Noise. Noises in real-world videos come from different sources. To simulate such noises, we propose noise degradations, including Gaussian noise, Poisson noise, Speckle noise, Processed camera sensor noise, JPEG compression noise and video compression noise.

[leftmargin=*]
Gaussian noise. When there are no prior information of noise, one can add Gaussian noise into a video sequence. Such Gaussian noise can be additive white Gaussian noise (AWGN) and gray-scale AWGN. Given a clean video $x$ , the noisy video can be synthesized by additive noise $z$ , i.e., $g_{1} (x) = x + z$ , where the noise $z$ can be sampled from AWGN $N (0, σ I)$ and gray-scale AWGN $N (0, σ 1)$ . Here, $σ$ is a covariance, $I$ is an identity matrix and $1$ is a $3 \times 3$ all-ones matrix.
Poisson noise. In electronics, Poisson noise is a type of shot noise which occurs in photon counting in optical devices. Such noise arises from the discrete nature of electric charge, and it can be modeled by a Poisson process. Given a clean video $x$ , we synthesize a noisy video by $g_{2} (x) = x + z$ , where $z = z^{'} - x$ and $z^{'} \sim P (10^{α} \cdot x) / 10^{α}$ .
Speckle noise. Speckle noise exists in the synthetic aperture radar (SAR), medical ultrasound and optical coherence tomography images. We simulate such noise by multiplying the clean image $x$ and Gaussian noise $z$ , i.e., $x * z$ . Then, we synthesize noisy video by $g_{3} (x) = x + x * z$ .
Processed camera sensor noise. In modern digital cameras, the processed camera sensor noise originates from the image signal processing (ISP). Inspired by zhang2022scunet, the reverse ISP pipeline first get the raw image from an RGB image, then the forward pipeline constructs noisy raw image by adding noise to the raw image, which denoted by $g_{4} (x) = forward (reverse (x))$ .
JPEG compression noise. It is widely used to reduce the storage for digital images with the fast encoding and decoding zhang2021designing. We denote the synthesized frames with JPEG compression noise by $g_{5} (x) = Dec (Enc (x))$ . Such JPEG compression methods often cause $8 \times 8$ blocking artifacts.
Video compression noise. Videos sometimes have compression artifact and presents on videos encoded in different format. We use the Pythonic operator av in FFmpeg, i.e., $g_{6} (x) = av (x)$ .

Apart from noise, most real-world videos inherently suffer from bluriness. Thus, we additionally consider two common blur degradations, including Gaussian blur and resizing blur. For Gaussian blur, we synthesize a video as $g7(x)=x∗{\boldmathκ}$ , where ‘ $*$ ’ is the convolution operator and $κ$ is the Gaussian kernel. For resizing blur, we first downscale a video for $s \times$ and then upscale to the original size, i.e., $g_{8} (x) = {up}_{s} ({down}_{\frac{1}{s}} (x))$ , where ${down}_{\frac{1}{s}}$ and ${up}_{s}$ are downscaling and upscaling function.

4 Experiments

Dataset	$σ$	VBM4D maggioni2012bm4d	VNLB arias2018vnlb	DVDnet tassano2019dvdnet	FastDVDnet tassano2020fastdvdnet	VNLNet davy2018vnlnet	PaCNet vaksman2021pacnet	BasicVSR++ chan2022basicvsrpp2	VRT liang2022vrt	ReViD
DAVIS	10	37.58	38.85	38.13	38.71	39.56	39.97	40.13	40.82	41.03
	20	33.88	35.68	35.70	35.77	36.53	36.82	37.41	38.15	38.50
	30	31.65	33.73	34.08	34.04	-	34.79	35.74	36.52	36.97
	40	30.05	32.32	32.86	32.82	33.32	33.34	34.49	35.32	35.83
	50	28.80	31.13	31.85	31.86	-	32.20	33.45	34.36	34.90
Set8	10	36.05	37.26	36.08	36.44	37.28	37.06	36.83	37.88	38.07
	20	32.18	33.72	33.49	33.43	34.08	33.94	34.15	35.02	35.35
	30	30.00	31.74	31.79	31.68	-	32.05	32.57	33.35	33.78
	40	28.48	30.39	30.55	30.46	30.72	30.70	31.42	32.15	32.66
	50	27.33	29.24	29.56	29.53	-	29.66	30.49	31.22	31.77
Params. (M)		-	-	0.48	2.48	-	2.87	9.76	18.3	13.68
Runtime (s)		420.0	156.0	2.51	0.08	1.65	35.24	0.08	5.91	0.32

Table 1: Quantitative comparison (average RGB channel PSNR) with state-of-the-art methods for video denoising on the DAVIS khoreva2018davis and Set8 tassano2019dvdnet datasets. Best results are in bold.

Methods	Noise levels			Average
Methods	10	30	50	Average
ViDeNN claus2019videnn	37.13	32.24	29.77	33.05
FastDVDnet tassano2020fastdvdnet	38.65	33.59	31.28	34.51
PaCNet vaksman2021pacnet	39.96	34.66	32.00	35.54
ReViD-blind	40.94	36.79	34.65	37.46
ReViD (Ours)	41.00	36.91	34.83	37.58

Table 3: Quantitative comparison in PSNR for single image denoising on Set8 dataset.

Methods	Noise levels			Average
Methods	15	25	50	Average
BM3D vaksman2020lidia	29.00	28.64	26.50	28.05
Restormer zamir2021restormer	34.36	31.40	28.57	31.44
SwinIR liang2021swinir	34.87	32.37	29.19	32.14
SCUNet zhang2022scunet	34.82	32.34	29.14	32.10
ReViD (Ours)	36.47	34.49	31.77	34.24

Table 2: Quantitative comparison in PSNR for denoising clipped Gaussian noise on DAVIS.

Figure 6: Visual comparison of different methods on DAVIS khoreva2018davis under the noise level of 50.

4.1 Synthetic Gaussian Denoising

Datasets. We use DAVIS khoreva2018davis and Set8 tassano2019dvdnet in synthetic Gaussian denoising. Following the setting of liang2022vrt, we synthesize the noisy video sequences by adding AWGN with noise level $σ \in [0, 50]$ on the DAVIS khoreva2018davis training set. We then train the model by using the synthesized data and test it on the DAVIS testing set and Set8 tassano2019dvdnet with different Gaussian noise levels ${10, 20, 30, 40, 50}$ .

Figure 7: Runtime, PSNR, and model size.

Quantitative comparison. Tables 1-3 show quantitative comparison of PSNR chan2021basicvsrpp between different methods on the test datasets DAVIS khoreva2018davis and Set8 tassano2019dvdnet. Our method has best performance on both DAVIS and Set8 with a large margin. Specifically, our model outperforms BasicVSR++ chan2022basicvsrpp2 by an average PSNR of 1.21db and 1.24db on DAVIS and Set8, respectively. Moreover, we also train a blind model for clipped AWGN to obtain the best performance. In Figure 7, our model achieves the best performance gains with similar model size and runtime. In particular, for the largest noise level of 50, our model outperforms VRT liang2022vrt with a smaller model size and faster inference time. Our model yields a PSNR improvement of 0.54db and 0.55db on DAVIS and Set8, respectively.

Qualitative comparison.

In Figure 6, we provide the visual comparisons of different video denoising methods under the high noise level of 50. Our proposed denoiser restores better structures and preserves clean edge than previous state-of-the-art video denoising methods, even though the noise level is high. In particular, our model is able to restore the letters ‘Gebr’ in the first example and piano texture in the second example of Figure 6. In contrast, VBM4D maggioni2012bm4d, DVDnet tassano2019dvdnet and FastDVDnet tassano2020fastdvdnet fail to remove severe noise from a video frame. BasicVSR++ chan2021basicvsrpp and VRT liang2022vrt only restore part of the textures.

Methods	VideoLQ			NoisyCity4
Methods	NIQE $↓$	BRISQUE $↓$	PIQE $↓$	NIQE $↓$	BRISQUE $↓$	PIQE $↓$
SCUNet zhang2022scunet	4.7797	39.6360	68.7677	5.1971	51.5672	85.2371
Restormer zamir2021restormer	4.3755	39.9023	69.6296	5.1884	52.7126	86.2248
ViDeNN claus2019videnn	4.2722	33.8539	60.7876	4.7613	42.5865	78.9111
BasicVSR++ chan2021basicvsrpp	4.0233	34.9458	51.4780	5.4899	52.1469	81.1234
BasicVSR++ $^{*}$ chan2021basicvsrpp	4.2879	29.1541	49.1658	4.4235	33.4198	47.5131
RealBasicVSR $^{*}$ chan2021realbasicvsr	4.2167	29.2103	48.0369	4.0578	26.3504	51.5825
ReViD-real	4.0205	29.0212	45.0768	3.8540	24.2025	48.2962

Table 4: Quantitative Comparison of different methods on VideoLQ and NoisyCity4 for the practical video denoising task. For fair comparison, we train BasicVSR++ and RealBasicVSR on the same proposed noise degradation pipeline, which is denoted by suffix ‘

^{*}

’.

Figure 8: Visual comparison of different video denoising methods on NoisyCity4.

4.2 General Real Video Denoising

Figure 9: Examples of the NoisyCity4 dataset.

For real video denoising, we use REDS nah2019reds as the training set. According to the setting of wang2019edvr, we use 266 regrouped training clips in REDS nah2019reds, where each with 100 consecutive frames. Specifically, we synthesize noisy video sequences on the REDS training set by using our proposed noise degradation model. To evaluate the generalizability of real-world video denoising methods, one can use VideoLQ chan2021realbasicvsr which is downloaded from Flickr and YouTube and contains 50 video sequences, where each with up to 100 frames. However, the VideoLQ dataset was mainly proposed for real-world video super-resolution and it has low level of noise itself. To address this, we additionally propose a new benchmark dataset for real-world video denoising, called NoisyCity4 dataset. This dataset is collected from YouTube and contains four city street videos from decades ago. The videos in the proposed dataset contain real-world noises from different sources such as film grains, film scratches, flickers etc. Examples of the NoisyCity4 videos are shown in Figure 9 and further provided in the supplementary material. Each video in NoisyCity4 contains a sequence of 100 frames with different noises.

Quantitative comparison. Table 4 provides the quantitative comparison of different methods on VideoLQ chan2021realbasicvsr and NoisyCity4. Here, we use three non-reference metrics NIQE mittal2012niqe, BRISQUE mittal2011brisque and PIQE venkatanath2015piqe as evaluation metrics because they are commonly used to measure the quality of images and ground-truth videos are not available. Our model achieves better performance than all other methods under all metrics. In contrast, it is difficult for ViDeNN to reduce noise in real video since the videos are captured by different cameras. With the help of our noise degradation model, the denoisers are able to reduce the real-world noise.

Qualitative comparison. As shown in Figure 8, our model achieves the best visual quality among different methods. By taking the spatial and temporal properties into account and using the proposed noise degradation model, our denoiser improves visual quality and leads to cleaner details and edges than other methods. For instance, our model is able to recover the windows in the building. In contrast, it is hard for image based denoisers zamir2021restormer; zhang2022scunet and ViDeNN claus2019videnn to remove the noise well in a real-world video. There results demonstrate our degradation model is able to improve the generalization ability.

Figure 10: Distributions of noise degradation without (Top) and with (Bottom) random shuffle.

4.3 Further Experiments

Effect of our degradation model. To study the effect of our degradation model, we show the distributions of the synthesized noise by our degradation model with and without the proposed random shuffle in Figure 10. The random shuffle strategy can improve the diversity of the synthesized distributions. In addition, this strategy can increase the noise variance in Figure 11. This shows that the proposed method can generate more diverse distributions in the training.

Ablation study. We investigate the effectiveness of the spatial and temporal modules in Table 11. Specifically, we conduct experiments by removing these modules. The model without these modules has performance drop, which demonstrates the importance of them. In addition, we investigate the performance by increasing the times of downscaling to 3. The model has comparable PSNR but with larger model size. Thus, we downscale the videos twice in the experiment.

5 Conclusion

In this paper, we propose a practical and important setup in video denoising called general video denoising. Motivated by properties of video noises, we first propose a real video denoising network, called ReViD to achieve the state-of-the-art performance on synthetic Gaussian denoising and general real video denoising. Moreover, we make the first attempt to design a new noise degradation model for the real-world video denoising task which considers different kinds of noise with random shuffle. In addition, we propose a new real video denoising dataset with different levels of noise. Extensive experiments demonstrate the effectiveness and superiority of denoising and practicability of our method. Besides, our model has good generalization performance on unseen real videos.

Figure 11: Variance of noise degradations.

Acknowledgements

This work was partly supported by Huawei Fund and the ETH Zürich Fund (OK).

References

missingmissing

Supplementary Materials: Practical Real Video Denoising with Realistic Degradation Model

missingmissing

Organization.

We organize our supplementary materials as follows. For the theory part, we provide detailed proofs of the lemma and theorems in Section A. In Sections B, we provide more detailed formulations of spatial and temporal denoising. In Section C, we provide detailed settings of noise degradation in the experiments. In Section D, we provide more settings, details and results of the experiments. In Section E, we give the limitations and societal impacts of our proposed method.

Appendix A Theoretical Analysis

We first provide proofs of Lemma 3.1 in the paper.

Lemma 3.1 Assume that the training distribution $π$ and testing distribution $p$ are partly overlapped, let $f_{p} = {a r g m i n}_{f} L_{p} (f)$ . The risk of $f_{p}$ is bounded by: $L_{p} (f_{p}) \leq L_{p} (f_{π}), \forall π$ .

Proof

According to the definition of $f_{p}$ , $f_{p}$ is the minimizer of the loss $L_{p} (f)$ , i.e., $L_{p} (f_{p}) = {inf}_{f} L_{p} (f)$ . Thus, we have $L_{p} (f_{p}) \leq L_{p} (f_{π})$ for any training distribution $π$ which is overlapped with the testing distribution $p$ . $_{■}$

We build a relationship between the denoising problem and general optimization problem training with noise. Given a training data $(x, y)$ , the general optimization problem training with some kind of noise $z_{σ}$ can be written as:

min f E_{σ} [E_{(x_{σ}, y)} [∥ f (x_{σ}) - y ∥^{2}]] .

(8)

Based on the analysis of suppbishop1995training , we first provide the following theorem.

Theorem 3.3 (Effect of noise degradations) Let $z_{σ} = g (x) - x$ , and assume that the mean and variance of the noise distribution are $0$ and $η^{2} (z_{σ})$ , then the loss (1) , i.e.,

	$E_{z_{σ}} [E_{(x, y)} [∥ f (x + z_{σ}) - y ∥^{2}]] =$	$E_{(x, y)} [∥ f (x) - y ∥^{2}]$		(9)
		$+ η^{2} (z_{σ}) E_{(x, y)} [{∥ ∥ ∥ \frac{\partial f}{\partial x} ∥ ∥ ∥}^{2} + \frac{1}{2} {(f (x) - y)}^{⊤} \frac{\partial^{2} f}{\partial x^{2}} 1] .$		(9)

Proof

Based on the expectation w.r.t. $x, y$ and $z_{σ}$ , we have

	$E_{z_{σ}} [E_{(x, y)} [∥ f (x + z_{σ}) - y ∥^{2}]]$	(10)
$=$	$\int \int \int ∥ f (x + z_{σ}) - y ∥^{2} p (x) p (y \| x) p (z_{σ}) d x d y d z_{σ}$	(11)
$=$	$\int \int \int \sum k (f_{k} (x + z_{σ}) - y_{k})^{2} p (x) p (y \| x) p (z_{σ}) d x d y d z_{σ}$	(12)
$=$	$\int \int \sum k (f_{k} (x) - y_{k})^{2} p (x) p (y \| x) d x d y$	(13)
	$+ \int \int \sum i, k [{(\frac{\partial f_{k}}{\partial x_{i}})}^{2} + \frac{1}{2} (f_{k} (x) - y_{k}) \frac{\partial^{2} f_{k}}{\partial x_{i}^{2}}] p (x) p (y \| x) d x d y$	(14)
$=$	$E_{(x, y)} [∥ f (x) - y ∥^{2}] + η^{2} (z_{σ}) E_{(x, y)} [{∥ ∥ ∥ \frac{\partial f}{\partial x} ∥ ∥ ∥}^{2} + \frac{1}{2} {(f (x) - y)}^{⊤} \frac{\partial^{2} f}{\partial x^{2}} 1],$	(15)

where the Equations (13-14) hold the assumption of the noise $z_{σ}$ , i.e.,

\int z_{i} p (z_{σ}) d z_{σ} = 0, \int z_{i} z_{j} p (z_{σ}) d z_{σ} = η^{2} (z_{σ}) δ_{i j}

(16)

and use the Taylor series of the noise $z_{σ}$ , i.e.,

f_{k} (x + z_{σ}) = f_{k} (x) + \sum i z_{i} {\frac{\partial f_{k}}{\partial x_{i}} ∣ ∣ ∣}_{z_{σ} = 0} + \frac{1}{2} \sum i \sum j z_{i} z_{j} {\frac{\partial^{2} f_{k}}{\partial x_{i} \partial x_{j}} ∣ ∣ ∣}_{z_{σ} = 0} + O (z_{σ}^{3}) .

(17)

$_{■}$

Note that when $x = y$ , the general learning problem turns to a problem of learning AutoEncoder. Based on Theorem 3.3, we have rewrite the following theorem when $x = y$ .

Theorem

(Effect of noise degradations) Let $z_{σ} = g (x) - x$ , and assume that the mean and variance of the noise distribution are $0$ and $η^{2} (z_{σ})$ , then the loss (1) , i.e.,

	$E_{σ} [E_{(x)} [∥ f (x_{σ}) - x ∥^{2}]] =$	$E_{x} [∥ f (x) - x ∥^{2}]$		(18)
		$+ η^{2} (z_{σ}) E_{x} [{∥ ∥ ∥ \frac{\partial f}{\partial x} ∥ ∥ ∥}^{2} + \frac{1}{2} {(f (x) - x)}^{⊤} \frac{\partial^{2} f}{\partial x^{2}} 1] .$		(18)

$_{■}$

Proof

Let $x = y$ in Theorem 3.3, we complete the proof. $_{■}$

From this theorem, the loss (1) trained with our noise degradations is equivalent to a Autoencoder loss with a regularization term. The parameter $η^{2} (z_{σ})$ is related to the amplitude or variance of the noise $z_{σ}$ and controls how the regularization term influences the loss.

Appendix B More Details of Spatial and Temporal Denoising

Spatial denoising. Given a feature, we use multi-layered residual blocks supphe2016resnet to implement the spatial encoder $E_{spatial}$ to extract deep features and reduce the spatial noise at each scale, i.e.,

E_{spatial} (g_{i}^{s - 1}) = R_{N} \circ \dots \circ R_{1} (g_{i}^{s - 1}),

(19)

where $\circ$ is a function composition, and each $R_{i}$ is a residual block. In the experiment, we set $N = 5$ and the number of features channels is 64. Given a feature $g$ , the residual block is formulated as

R_{i} (g) = g + {C o n v}_{2} (R e L U ({C o n v}_{1} (g))),

(20)

where ${C o n v}_{1}$ and ${C o n v}_{2}$ are convolutional layers, and $R e L U$ is an activation.

Temporal denoising. We implement $E_{temporal}$ by using the architecture of the flow-guided deformable alignment of suppchan2021basicvsrpp to predict offset and mask in DCN suppzhu2019dcnv2 . Given denoised spatial features $g_{i}^{s}$ , we use the optical-flow-guided deformable alignment as our temporal encoder $E_{temporal}$ to compute the features at the $j$ -th branch, i.e.,

	${ˆ f}_{i} =$	$E_{temporal} (g_{i}^{s}, f_{i - 1, j}^{s}, f_{i - 2, j}^{s}, o_{i \to i - 1}^{s}, o_{i \to i - 2}^{s})$		(21)
	$=$	$D C N ([f_{i - 1}; f_{i - 2}], [{˜ o}_{i \to i - 1}; {˜ o}_{i \to i - 2}], [m_{i \to i - 1}; m_{i \to i - 2}]),$		(22)

where the offsets and masks are formulated as

	${˜ o}_{i \to i - p}$	$= o_{i \to i - p} + C o n v ([g_{i}; {¯ f}_{i - 1}; {¯ f}_{i - 2}]),$		(23)
	$m_{i \to i - p}$	$= S i g m o i d (C o n v ([g_{i}; {¯ f}_{i - 1}; {¯ f}_{i - 2}])),$		(24)

where $p = 1, 2$ and $f_{i - 1}$ is a warped feature using the optical flow $o_{i \to i - 1}$ , i.e.,

	${¯ f}_{i - 1} = warp (f_{i - 1}, o_{i \to i - 1}),$		(25)
	${¯ f}_{i - 2} = warp (f_{i - 2}, o_{i \to i - 2}),$		(26)

where $warp (\cdot)$ is a warp function according to the optical flow. After reducing the temporal noise, we use another spatial encoder $E_{spatial}^{'}$ with 7 residual blocks.

Difference from BasicVSR++. Our architecture design differs from BasicVSR++ in the following aspects. First, our denoiser is built on the U-Net architecture with downscaling and upscaling, which is effective to capture spatio-temporal information for video denoising. Specifically, in downscaling, features are extracted at different scales by both spatial denoising and temporal propagation. Multi-scale optical flows are used for guidance in alignment, so as to deal with different motion magnitudes. In upscaling, we only do spatial modelling to save computation cost. In contrast, BasicVSR++ does not use multiscale modelling which is important in video denosing as shown in Figure 1 (a) of the paper. Second, BasicVSR++ directly downsamples the inputs using Bicubic interpolation. Such downsampling can remove part of noise but also remove some useful texture information. In contrast, we propose to train with learnable parameters to remove noise and preserve the useful texture information. Since BasicVSR++ is designed for video super-resolution rather than video denoising, directly applying it for video denoising would result in inferior performance in Table 1.

Appendix C Experiment Details of Noise Degradation

Noise. In the experiment, we consider 6 kinds of noises in the degradations, including Gaussian noise, Poisson noise, Speckle noise, Processed camera sensor noise, JPEG compression noise and video compression noise. To explore the properties of video denoising, we use the default order of the following noise in Figure 1 (a) and Figure 10 (Top).

[leftmargin=*]
Gaussian noise. We uniformly sample noise levels $σ$ from $[2, 50]$ . We randomly choose AWGN and grayscale AWGN with the probabilities of 0.6 and 0.4, respectively.
Poisson noise. We add Poisson noise in color and grayscale images by sampling different noise levels. We first multiply the clean video by $10^{α}$ in the function of Poisson distribution, where $α$ is unformly chosen from $[2, 4]$ and divide by $10^{α}$ .
Speckle noise. We sample the level of this noise from $[0, 50]$ .
Processed camera sensor noise. Inspired by suppzhang2022scunet , the reverse ISP pipeline first get the raw image from an RGB image, then the forward pipeline constructs noisy raw image by adding noise to the raw image.
JPEG compression noise. The JPEG quality factor is uniformly chosen from $[30, 95]$ . JPEG compression noise will introduce $8 \times 8$ blocking artifacts.
Video compression noise. We use the Pythonic operator av in FFmpeg to produce compression noise. We randomly selected codecs from [‘libx264’, ‘h264’, ‘mpeg4’] and bitrate from [1e4, 1e5] during training.

Blur. In addition to noise, most real-world videos inherently suffer from blur structure in a digital camera. Thus, we consider two blur degradations, including Gaussian blur and resizing blur.

[leftmargin=*]
Gaussian blur. We synthesize Gaussian blur with different kernels, including [‘iso’, ‘aniso’, ‘generalized_iso’, ‘generalized_aniso’, ‘plateau_iso’, ‘plateau_aniso’, ‘sinc’]. We randomly choose these kernels with the probabilities of [0.405, 0.225, 0.108, 0.027, 0.108, 0.027, 0.1]. The settings of these blur are the same as suppchan2021realbasicvsr .
Resizing blur. We randomly draw the resize scales from [0.5, 2], and choose the interpolation mode from [‘bilinear’, ‘area’, ‘bicubic’] with the same probability of $1 / 3$ .

Appendix D More Experiments

d.1 More Details of Experiment Setting

We adopt Adam optimizer suppkingma2014adam and Cosine Annealing scheme supploshchilov2016sgdr to decay the learning rate from $1 \times 10^{- 4}$ to $10^{- 7}$ . The patch size is $256 \times 256$ , and batch size is 8. The number of input frames is 15. All experiments are implemented by PyTorch 1.9.1. We train a denoising model on 8 A100 GPUs. We use the pre-trained SPyNet suppranjan2017optical to estimate the flow. Note that we fix the parameters of SPyNet during the training. We train our video denoiser with 150k iterations. For the synthetic Gaussian denoising, the learning rate of the generator is $1 \times 10^{- 4}$ . For real-world video denoising, the learning rates of the generator and discriminator are set to $5 \times 10^{- 5}$ and $1 \times 10^{- 4}$ . The architecture of the generator is introduced in Section B. The architecture of the discriminator is the same as Real-ESRGAN suppwang2021realesrgan . When training classic video denoising, we use Charbonnier loss suppcharbonnier1994two due to its stability and good performance. For real video denoising, we first use Charbonnier loss to train a model, then we finetune the network by using the perceptual loss $L_{p i x}$ suppjohnson2016perceptual and adversarial loss $L_{a d v}$ suppgoodfellow2014gan , i.e., $L = L_{p i x} + λ_{1} L_{p e r} + λ_{2} L_{a d v}$ , where $λ_{1} = 1$ and $λ_{2} = 5 \times 10^{- 1}$ . We set the values of N and L as 5 and 7, respectively.

d.2 Training Loss and PSNR

To demonstrate the efficiency of our model, we show the training loss and PSNR, as shown in Figure 12. At every 10K iterations, the PSNR value is calculated on Set8 with the noise level of 10. The total training iterations is 150k and takes 3 days. The training loss decreases rapidly at early iterations and stay steady in the later iterations. The PSNR values on Set8 increase during the training. These results demonstrate that our model is easy to train to have good performance.

Figure 12: An illustration of training loss and PSNR.

d.3 Differences of Image and Video Degradations

Our video degradation significantly differs from the existing single image degradation. First, we consider the blur degradation (Gaussian blur and resizing blur) which would change the statistics of other noises and make the noise more complex. Second, we consider different video compression noises which usually require temporal information for better noise removal. From Table 6, training without blur degradation and video compression noise lead to inferior performance, which demonstrates the dominant role.

Types	NIQE	BRISQUE	PIQE
w/o Blur degradation	4.1643	34.8137	50.2962
w/o video compression noise	4.0537	31.8712	50.7835
Ours	4.0205	29.0212	45.0768

Table 6: Ablation study on noise types on VideoLQ.

d.4 Parameters for Noise Types.

We determine the parameters for each noise type according to the common well-studied settings or experimental analysis. For example, some parameter settings in image-based denoising methods zhang2022scunet; zhang2021designing have been well-studied. We further conduct an analysis for different parameters of noise degradation. Here, we analyze the bitrate range of video compression noise due to its importance in our noise degradation in Table 7. The model achieves the best performance with a bitrate range of $[1 e 4, 1 e 5]$ , which accords with the setting in chan2021realbasicvsr.

Types	NIQE	BRISQUE	PIQE
$[1 e 3, 1 e 4]$	4.2317	30.1674	46.7984
$[1 e 4, 1 e 5]$	4.0205	29.0212	45.0768
$[1 e 5, 1 e 6]$	4.1276	31.3297	49.4122

Table 7: Performance of different bitrate ranges on VideoLQ.

Actually, the included speckle noise is already a type of spatially correlated noise. For certain unseen applications, there may exist other types of noises. Without prior knowledge, it is difficult to cover all types of unseen noises. Thus, this paper considers the most common and general noises in our degradation model. Certainly, if a noise type is dominant for a certain application, one can augment it into our degradation model to match the noise distribution.

d.5 Comparison on sRGB Dataset

We conduct experiments on real-world raw video (transformed to sRGB) denoising. Specifically, we test our denoiser on the indoor test videos (Scenes 7-11) of CRVD suppyue2020crvd in Table 8. We use reference-based evaluation and directly test the models on the indoor test set. For fair comparisons, we here compare with RealBasicVSR* because both methods are not trained on CRVD. Our model outperforms RealBasicVSR* by a large margin under the reference-based metrics.

Reference-based metric	RealBasicVSR $^{*}$	ReVid (Ours)
PSNR	27.41	29.61
SSIM	0.896	0.919

Table 8: Comparisons of ReVid and RealBasicVSR* on CRVD (indoor).

Processing an existing raw paired dataset as a paired sRGB dataset is possible as an alternative. However, dealing with real-world videos (not synthesized videos from raw images) is the ultimate goal of real video denoising. For most images/videos we encounter in our life and on the internet, we do not have access to their raw versions and neither do we have access to the parameters of different sensors and ISP pipelines. Therefore, for practical applications, the no-reference image quality assessment (IQA) metrics (e.g., NIQE, BRISQUE and PIQE) are important and widely used in existing real-world super-resolution/denoising methods zhang2022scunet; wang2021realesrgan; zhang2021designing. In addition to the non-reference IQA metric, we also compared the visual quality of different methods for the real-world test videos, as shown in Figure 8 in the paper and Figure 14, which we believe can also demonstrate the effectiveness of the proposed method.

In addition, our synthetic setting is very meaningful in real-world applications. Actually, real image denoising/super-resolution methods chan2021realbasicvsr; zhang2022scunet; zhang2021designing which use synthetic degradations have already shown promising results and have attracted more and more attention in the low-level computer vision community. The synthetic degradations aim to cover a wide range of reasonable noises from randomized pipelines. These methods have shown better generalization performance than training on collected datasets with a specific camera. Based on this, we make the first attempt to propose a new video noise degradations in real video denoising. Extensive experiments verify the superiority of our method on real-world videos.

d.6 Ablation Study on DCN and Multiscale

We conduct ablation studies on DCN and multiscale in Table 9. Specifically, we train all architectures on DAVIS, and calculate average PSNR over all testing noise levels on the DAVIS test set. Training our model without DCN or multiscale degrades the performance, which demonstrate the effectiveness of DCN and multiscale. In addition, training the model with more scales achieves better performance but with the expense of a larger model size (29.02M). To trade-off the performance and model size, we do not use more scales in our architecture.

Methods	w/o DCN	w/o multiscale	w/ more scales	ReVid (Ours)
Average PSNR	36.47	36.52	37.48	37.45

Table 9: Ablation study on DCN and multiscale on DAVIS test set.

d.7 Comparison on FLOPs

Model inference time was provided in Table 1 in the paper, which can reflect the efficiency of the models. We compared the FLOPs and PSNR performance of different video denoising methods in Table 10. Here, the FLOPs is measured in TITAN RTX GPU with the spatial resolutions of $256 \times 256$ . Our model achieves the best PSNR performance, although it has more FLOPs than BasicVSR++ due to the multi-scales. Besides, our model outperforms VRT with much fewer FLOPs.

Methods	BasicVSR++	VRT	ReVid (Ours)
FLOPs (G)	42.8	721.9	172.8
PSNR (db)	36.24	37.03	37.45

Table 10: Comparison with different methods on FLOPs.

d.8 Results of Video Deblurring

Our main goal is to propose a new realistic degradation model for effective real video denoising. The proposed degradation model and architecture can indeed be further extended to other real-world video restoration tasks. We extend our model for the video deblurring task. Specifically, we train our model on the GoPro dataset suppnah2017gopro and show the results in Table 11. Comparing to other competing methods, our model achieves the best PSNR and SSIM. These results further demonstrate the effectiveness and flexibility of our design.

Methods	EDVR suppwang2019edvr	STFAN suppzhou2019spatio	TSP supppan2020cascaded	BasicVSR++ suppchan2021basicvsrpp	ReVid (Ours)
PSNR/SSIM	26.83/0.843	28.59/0.861	31.67/0.928	34.01/0.952	34.23/0.958

Table 11: Performance on video deblurring on the GoPro test set.

d.9 Generalization of Real Video Denoising Model

To investigate the generalization performance, we compare the PSNR of our method with RealBasicVSR suppchan2021realbasicvsr using on REDS4 testing set. Note that these two methods are trained on our noise degradations. Specifically, we use REDS4 (4 testing clips, i.e., 000, 011, 015 and 020) to synthesize Gaussian noise, Poisson noise, Speckle noise, Camera noise, JPEG compression noise and Video compression noise using the same setting as Figures 1 and 10. The levels of Gaussian and Speckle noise are 10, the scale of Poisson is 0.05, the quality scale of JPEG compression noise is 80, and the codec and bitrate of Video compression noise are ‘mpeg4’ and $1 e 5$ . In Table 12, our method achieves higher PSNR than RealBasicVSR suppchan2021realbasicvsr . It means that our video denoiser has better generalization performance on other noise.

Methods

Gaussian

noise

Poisson

noise

Speckle

noise

Camera

noise

JPEG compression

noise

Video compression

noise

RealBasicVSR

^{*}

suppchan2021realbasicvsr

26.57

26.63

26.15

26.92

26.19

25.13

Ours-real

28.03

28.17

28.14

28.63

28.18

26.82

Table 12: Generalization to different kinds of noise on REDS4.

d.10 More Qualitative Comparison

As shown in Figures 13 and 14, we provide more visual comparisons of different video denoising methods for synthetic Gaussian denoising and general real video denoising. Our proposed denoiser restores better structures and preserves clean edge than previous state-of-the-art video denoising methods, even though the noise level is high. In particular, our model is able to synthesize the side profile in the second line of Figure 13. For real video denoising, our model achieves the best visual quality among different methods. For example, our model can generate feather texture of a bird in the third line of Figure 14. There results demonstrate our degradation model is able to improve the generalization ability.

Appendix E Limitations and Societal Impacts

Our method achieves state-of-the-art performance in synthetic Gaussian denoising and practical real video denoising. This paper makes the first attempt to propose noise degradations. Our method can be used in some applications with positive societal impacts. For example, it is able to restore old videos and remove compression noise from video in web. However, there are some limitations in practice. First, it is hard for our model to remove blur artifacts which often occur in videos due to exposure time in different cameras. However, our degradation pipeline mainly contains different kind of noise. Second, it is challenging to remove big spot noise. Third, our denoiser is trained with the GAN loss and it may change the identity of details (e.g., human face) especially when the input is severely degraded.

Figure 14: Visual comparison of different video denoising methods on VideoLQ suppchan2021realbasicvsr and NoisyCity4.

References

(1) Bishop, Chris M. Training with noise is equivalent to Tikhonov regularization. In Neural Computation, 1995.
(2) He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition, 2016.
(3) Chan, Kelvin CK and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change. BasicVSR++: Improving video super-resolution with enhanced propagation and alignment. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
(4) Zhu, Xizhou and Hu, Han and Lin, Stephen and Dai, Jifeng. Deformable convnets v2: More deformable, better results. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
(5) Zhang, Kai and Li, Yawei and Liang, Jingyun and Cao, Jiezhang and Zhang, Yulun and Tang, Hao and Timofte, Radu and Van Gool, Luc. Practical Blind Denoising via Swin-Conv-UNet and Data Synthesis. In arXiv preprint arXiv:2203.13278, 2022.
(6) Chan, Kelvin CK and Zhou, Shangchen and Xu, Xiangyu and Loy, Chen Change. Investigating Tradeoffs in Real-World Video Super-Resolution. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
(7) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
(8) Ranjan, Anurag and Black, Michael J. Optical flow estimation using a spatial pyramid network. In IEEE conference on computer vision and pattern recognition, 2017.
(9) Wang, Xintao and Xie, Liangbin and Dong, Chao and Shan, Ying. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In IEEE International Conference on Computer Vision, 2021.
(10) Charbonnier, Pierre and Blanc-Feraud, Laure and Aubert, Gilles and Barlaud, Michel. Two deterministic half-quadratic regularization algorithms for computed imaging. In International Conference on Image Processing, 1994.
(11) Johnson, Justin and Alahi, Alexandre and Fei-Fei, Li. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision, 2016.
(12) Goodfellow, Ian and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua. Generative adversarial nets. In Advances in neural information processing systems, 2014.
(13) Tassano, Matias and Delon, Julie and Veit, Thomas. Dvdnet: A fast network for deep video denoising. In IEEE International Conference on Image Processing, 2019.
(14) Maggioni, Matteo and Boracchi, Giacomo and Foi, Alessandro and Egiazarian, Karen. Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. In IEEE Transactions on Image Processing, 2012.
(15) Tassano, Matias and Delon, Julie and Veit, Thomas. Fastdvdnet: Towards real-time deep video denoising without flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition, 2020.
(16) Liang, Jingyun and Cao, Jiezhang and Fan, Yuchen and Zhang, Kai and Ranjan, Rakesh and Li, Yawei and Timofte, Radu and Van Gool, Luc. Vrt: A video restoration transformer. In arXiv preprint arXiv:2201.12288, 2022.
(17) Khoreva, Anna and Rohrbach, Anna and Schiele, Bernt. Video object segmentation with language referring expressions. In Asian Conference on Computer Vision, 2018.
(18) Zamir, Syed Waqas and Arora, Aditya and Khan, Salman and Hayat, Munawar and Khan, Fahad Shahbaz and Yang, Ming-Hsuan. Restormer: Efficient Transformer for High-Resolution Image Restoration. In IEEE Conference on Computer Vision and Pattern Recognition, 2022.
(19) Loshchilov, Ilya and Hutter, Frank. Sgdr: Stochastic gradient descent with warm restarts. In arXiv preprint arXiv:1608.03983, 2016.
(20) Claus, Michele and van Gemert, Jan. Videnn: Deep blind video denoising. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.
(21) Yue, Huanjing and Cao, Cong and Liao, Lei and Chu, Ronghe and Yang, Jingyu Supervised Raw Video Denoising with a Benchmark Dataset on Dynamic Scenes. In IEEE Conference on Computer Vision and Pattern Recognition, 2020
(22) Nah, Seungjun and Hyun Kim, Tae and Mu Lee, Kyoung Deep multi-scale convolutional neural network for dynamic scene deblurring. In IEEE conference on Computer Vision and Pattern Recognition, 2017
(23) Wang, Xintao and Chan, Kelvin CK and Yu, Ke and Dong, Chao and Change Loy, Chen EDVR: Video restoration with enhanced deformable convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019
(24) Zhou, Shangchen and Zhang, Jiawei and Pan, Jinshan and Xie, Haozhe and Zuo, Wangmeng and Ren, Jimmy Spatio-temporal filter adaptive network for video deblurring. In IEEE International Conference on Computer Vision, 2019
(25) Pan, Jinshan and Bai, Haoran and Tang, Jinhui Cascaded deep video deblurring using temporal sharpness prior. In IEEE Conference on Computer Vision and Pattern Recognition, 2020

Methods	DAVIS	Set8
w/o spatial module	34.45	31.12
w/o temporal module	31.90	29.59
downscaling three times	34.91	31.75
ReViD (Ours)	34.90	31.77