Leveraging Synthetic Data to Learn Video Stabilization Under Adverse Conditions thanks: The paper is under review.

Abdulrahman Kerim
School of Computing and Communications
Lancaster University, UK
a.kerim@lancaster.ac.uk
&Washington L. S. Ramos
Computer Science Department
Universidade Federal de Minas Gerais, Brazil
washington.ramos@dcc.ufmg.br
\ANDLeandro Soriano Marcolino
School of Computing and Communications
Lancaster University, UK
l.marcolino@lancaster.ac.uk
&Erickson R. Nascimento
Computer Science Department
Universidade Federal de Minas Gerais, Brazil
erickson@dcc.ufmg.br
&Richard Jiang
School of Computing and Communications
Lancaster University, UK
r.jiang2@lancaster.ac.uk
Abstract

Video stabilization plays a central role to improve videos quality. However, despite the substantial progress made by these methods, they were, mainly, tested under standard weather and lighting conditions, and may perform poorly under adverse conditions. In this paper, we propose a synthetic-aware adverse weather robust algorithm for video stabilization that does not require real data and can be trained only on synthetic data. We also present Silver, a novel rendering engine to generate the required training data with an automatic ground-truth extraction procedure. Our approach uses our specially generated synthetic data for training an affine transformation matrix estimator avoiding the feature extraction issues faced by current methods. Additionally, since no video stabilization datasets under adverse conditions are available, we propose the novel VSAC105Real dataset for evaluation. We compare our method to five state-of-the-art video stabilization algorithms using two benchmarks. Our results show that current approaches perform poorly in at least one weather condition, and that, even training in a small dataset with synthetic data only, we achieve the best performance in terms of stability average score, distortion score, success rate, and average cropping ratio when considering all weather conditions. Hence, our video stabilization model generalizes well on real-world videos and does not require large-scale synthetic training data to converge.

\keywords

Video StabilizationSynthetic DataAffine Transformation

1 Introduction

Over the past several years, we have witnessed an explosion of videos being recorded and shared on the Internet. However, most shared videos are unedited and shaky, which makes them unpleasant to watch. Therefore, video stabilization techniques became an essential step in the video processing pipeline, gaining momentum as more unedited videos are being created and shared. State-of-the-art video stabilization approaches perform well under standard conditions but struggle at adverse conditions. Furthermore, collecting training videos in these adverse conditions is hard, dangerous, and time-consuming.

The training data bottleneck mentioned above causes many video stabilization methods to be essentially non-learning-based commonly adopting affine or homography matrix estimation in the camera motion estimation step, to extract the camera trajectory. Usually, features extraction, description, and matching are involved in this process. Feature extractors like SIFT [23] and the learning-based ones, like R2D2 [26] and ASLFeat [24], can perform well under standard weather conditions and enough illumination, but they may fail under challenging conditions such as foggy, rainy, and snowy weather, as well as night-time scenes and abrupt changes of illumination. For instance, rain and snow drop particles and textureless scenes under fog or at night-time pose a clear challenge to find sufficient robust features. Failing to accurately estimate the actual camera trajectory, in the motion estimation step, causes the error to propagate to the later steps of the process, decreasing the quality of the stabilized video.

Our key idea is to use specially-designed synthetic data to train an affine transformation matrix estimation CNN.
Figure 1: Our key idea is to use specially-designed synthetic data to train an affine transformation matrix estimation CNN.

Synthetic data have shown a great progress in the field of computer vision [21, 14, 30, 31]. Its increase in popularity started to attract many researchers to apply it for different computer vision problems. Most importantly, synthetic data seems a promising solution to overcome the lack of suitable data for training supervised learning models. However, we believe that the potential of synthetic data is not simply in the amount of available data for training. In fact, in this paper we propose a novel synthetic-aware video stabilization algorithm that leverages synthetic data and achieves state-of-the-art results using only a small-scale synthetic dataset. Figure 1 depicts the main steps of this approach. Leveraging the powerful tools of the Unity game engine, we built Silver, which creates three-dimensional, photo-realistic virtual worlds procedurally at run-time. The system can automatically diversify many essential scene attributes like weather conditions, time of the day, crowdedness, to name a few. Most importantly, our system can generate the required ground-truth training data for our learning-based video stabilization method. Our novel video stabilization model can be trained with a small number of synthetic videos. With no training or fine-tuning on real data, our model is more robust and accurate than the state-of-the-art across different weather conditions. To the best of our knowledge, this is the first work to study video stabilization under adverse weather conditions and utilize synthetic videos for video stabilization.

Despite supervised learning-based approaches [32, 22, 35, 1] being able to learn parameters like cropping window, sensitivity, and even to extract discriminative features, there is no sufficient labeled data for obtaining high-quality results in any condition. Finding, collecting, and annotating relevant data is cumbersome, time-consuming, error-prone, expensive, and subject to privacy issues. It is widely accepted that collecting many videos under adverse conditions is not easy since various attributes are needed to be diversified, such as camera motion, scene type, time of the day, scene crowdedness, and recording resolution. Even assuming that videos under these attributes can be captured, there is still a problem capturing training data for this task. Using two cameras, one with a mechanical stabilizer and another without, is still not an ideal solution since the difference between the two camera locations will cause the scene to be captured from two different viewpoints. Thus, learning-based models may not be able to learn well the video stabilization task, especially under adverse conditions.

Hence, our main contributions are three-fold: i) a novel synthetic-aware video stabilization method achieving state-of-the-art results on real videos and being trained only on synthetic videos. It is worth noting that our key contribution is the idea of using specially designed synthetic data with a simple yet powerful architecture for the task of video stabilization; ii) a new synthetic training data generator, Silver 111A preliminary version of Silver [15] was presented as an unarchived workshop paper., which is able to generate an unlimited number of training videos for training our video stabilization algorithm under a wide set of attributes; iii) a new video stabilization dataset, the VSAC105Real, composed of real videos spanning foggy, rainy, and snowy weather conditions, and night-time attributes. The implementation of our proposed video stabilization algorithm, our simulator, and the datasets are all available at https://github.com/A-Kerim/SyntheticData4VideoStabilization.

2 Related Work

Our work utilizes synthetic data for the task of video stabilization. Thus, video stabilization and synthetic data generation literature is briefly reviewed in this section.

2.1 Video Stabilization

Video stabilization methods can be categorized broadly into non-learning-based and learning-based approaches. Non-learning-based video stabilization methods do not perform training. For instance, Grundmann et al. [11] stabilize the shaky camera trajectory using L1-norm optimization under constraints. Similarly, Bradley et al. [4] address the stabilization task as a constrained convex optimization problem. These non-learning-based methods stabilize videos without the need for training data to tune the model’s parameters. However, they tend to give less pleasant results, can work only under predefined conditions, and the model’s parameters must be tuned manually.

The learning-based approaches, on their turn, can be classified into unsupervised and supervised approaches. Non-supervised video stabilization methods require training videos but do not demand for annotated pairs of shaky-stable videos. Deep Iterative FRame INTerpolation (DIFRINT) [6] is an unsupervised learning-based approach that can be trained in an end-to-end manner. It can stabilize videos without cropping original frames. Essentially, this method utilizes the frame interpolation technique to synthesize middle frames for video stabilization. Supervised learning approaches, on the other hand, require human labeled ground-truth data, which is the main limitation of applying them to video stabilization. Some works approached this problem using a mechanical stabilizer to generate ground-truth stable videos with their shaky counterparts. As an example, StabNet [32] followed this procedure, generating videos for training CNNs for video stabilization. The network learns a warping transformation of multi-grids given the shaky video frames and the previously stabilized ones. On the other hand, Liu et al. [22] apply a learning-based hybrid-space fusion to compensate for optical flow inaccuracy. They synthesize stabilized frames by fusing the warped content estimated from neighboring frames. Yu et al. [35] learn to stabilize videos using optical flow. They compute the per-pixel warp field from the optical flow of the shaky video, allowing to better handle moving objects and occlusion. In contrast to the work of Yu et al. [35], Ali et al. [1] propose a full-frame supervised video stabilization method that does not require optical flow. Their novel pipeline for dataset generation uses a linearly moving window on high-resolution images.

Despite the substantial progress made by these methods, they present a partial solution to the video stabilization problem since it is assumed that they will work under normal weather conditions and sufficient illumination. However, finding resilient features in adverse conditions is rather challenging. For example, rain particles, foggy weather conditions, and low illumination pose clear challenges to finding robust features. Thus, it leads to inaccurate motion estimation and low-quality video stabilization. At the same time, collecting diverse real-world videos for training video stabilization methods is cumbersome and time-consuming. Moreover, obtaining data with perfect ground-truth is not feasible given the physical limitation of recording the same scene from the same perspective via steady and shaky cameras.

Our video stabilization method belongs to the supervised learning-based category. However, unlike other methods, we use only synthetic data for training. To the best of our knowledge, this is the first work to utilize synthetic videos for video stabilization. No pre-training or fine-tuning on real data is required, and by using only a small-scale training dataset, our method is more robust than state-of-the-art methods.

2.2 Affine and Homography Transformation

Estimating the affine or homography transformation between two images are two common approaches to align one image to another. For that aim, there are different ways to find these matrices like applying a feature extractor (e.g., SIFT [23], ORB [28], SURF [3], and OAN [36]) and an outlier rejection algorithm (e.g., RANSAC [9] and MAGSAC [2]). At the same time, there are unsupervised [37] and supervised [7] methods that estimate, specifically, the homography matrix.

Although the traditional approach, i.e., feature extraction, does well at standard conditions, it performs poorly under challenging conditions such as adverse weather conditions and low illumination. Moreover, supervised approaches cannot reflect scene parallax [33] and generating a suitable training data is rather hard. Unsupervised approaches tried to solve these problems but they still fail under large baseline alignment which makes it unpractical for applications such as image stitching and video stabilization under sharp camera movements.

Unlike previous methods, our model learns the affine transformation in a supervised manner using specially generated synthetic training data for this task and it does not need any real data in training time. Estimating the affine transformation is of a special importance in video stabilization because it allows us to recover the camera translation, rotation, and scale. Although it is possible to decompose the homography matrix to extract these information, it is not accurate. Moreover, training a model to estimate the affine transformation is much easier compared to estimating the homography transformation. As have been discussed by other similar works [11, 18], although homography can better model the camera motion between frames for a limited number of frames, it starts to cause many artifacts like skew and perspective as the number of frames becomes larger. At the same time, higher degrees of freedom transformations (e.g., homography) overfit easily even with some regularization. Thus, utilizing homography transformation is both harder to train (more parameters and easier to overfit) and more subject to artifacts.

2.3 Synthetic Data Generation

The use of synthetic data in the computer vision community has shown to be a promising solution to overcome the lack of suitable data for training supervised learning models [5, 27, 30, 21, 14, 31]. Adapting a specific video game to generate synthetic data with its corresponding ground-truth for the task of semantic segmentation was presented by Richter et al. [27], who modified the game Grand Theft Auto V for that purpose. At the same time, another work by Shafaei et al. [30] investigated the use of photo-realistic video games to generate synthetic data and their corresponding ground-truths for image segmentation and depth estimation. Using open-source animation movies was another approach discussed by Butler et al. [5], who showed how to obtain an optical flow large-scale dataset, MPI-Sintel, following a systematic and easy process. Dosovitskiy et al. [8] presented CARLA, a popular open-source simulator widely used for autonomous driving research. Recently, the UrbanScene3D simulator was proposed by Liu et al. [21] and developed for computer vision and robotics research. The system supports autonomous driving and flying research in different environments. GANcraft is another work by Hao et al. [12] for generating photorealistic images.

Despite the great advancements of the previous simulators, they present a partial solution for the data generation issue because of the lack of control on the environmental elements. They also fail to randomize the scene elements which will lead to some clear repetitions (to scene elements) when a large-scale dataset is required to be generated by these methods. In our simulator, we utilize procedural content generation to generate 3D virtual worlds. Additionally, our simulator can generate a special training data for the task of video stabilization. In our experiments, we show that generating appropriate training data and creating a pipeline that utilizes synthetic data, can achieve superior results. In fact, our algorithm can still provide better results in real videos even though our engine does not generate state-of-the-art photo-realistic videos. Additionally, our synthetic aware algorithm and our specially designed synthetic data both contribute to teaching the model to accurately estimate the affine transformation while not overfitting to the synthetic data distribution being trained on. Thus, it mitigates the domain gap and achieves satisfactory results on real data.

3 Methodology

Let be a shaky video composed of frames. The aim of our approach is to generate a stabilized version while preserving the original camera movement made by the recorder and maintaining its trend. Our proposed method is composed of two major stages: i) Motion Estimation; and ii) Trajectory Smoothing. In the first stage, we train a motion estimation network using the ground-truth data from generated synthetic videos to estimate an affine transformation matrix for every consecutive frames and . Then, in the second stage, we calculate the camera trajectory , where and represents the estimated parameters for the pair of frames . Following this and after smoothing , we warp and crop frames using the smoothed transformations retrieved from the smoothed trajectory . An outline of our approach is described in Figure 2.

 Our method estimates the translation, rotation, and scale for each pair of frames of the shaky video. After computing the camera trajectory, upper (red) and lower (green) bounds are found and averaged, and the Savitzky-Golay filter is applied to smooth the trajectory. Finally, warping and cropping is performed to generated the stabilized video.
Figure 2: Video stabilization pipeline. Our method estimates the translation, rotation, and scale for each pair of frames of the shaky video. After computing the camera trajectory, upper (red) and lower (green) bounds are found and averaged, and the Savitzky-Golay filter is applied to smooth the trajectory. Finally, warping and cropping is performed to generated the stabilized video.

3.1 Motion Estimation

The first stage of our pipeline consists in estimating the camera motion throughout the video. Most existing 2D-based stabilization approaches apply key-point feature extraction and tracking to solve this task [17, 11, 19]. However, both feature extraction and tracking may fail under adverse weather conditions due to repetitive textures and partial occlusions caused by rain and snow particles or textureless objects under foggy weather and at night-time. To overcome this issue and properly recover the camera motion in , we propose estimating parameters , , , and of an affine transformation matrix

(1)

for every consecutive pairs of frames using deep neural networks with synthetic data. Thus, we abdicate the feature extraction procedure entirely since, using our proposed engine, we can generate the ground-truth affine transformation needed for training as described in Section 3.3.

Two identical networks are utilized: the network that estimates the and translations, i.e., ; and that outputs the rotation angle and scale values , where is the center cropped image width and height. It is important to note that and share the same architecture, but not the same weights.

Both networks consist of a feature extractor implemented as four convolutional layers followed by a pooling and a dropout layer, and a regressor, which is a fully connected network composed of three linear layers that process the extracted features to estimate the parameters. The number of output channels of each layer is illustrated in Figure 2-a. For each training step, we feed the networks with an input , where are two consecutive grayscale frames from the input video and is the optical flow map for the pair obtained via the FlowNet2 [13, 25]. Then, we estimate the parameters for as and .

To optimize the parameters of the networks and , we train separately each one using the Mean Squared Error (MSE) loss functions:

(2)

where is the number of training samples in a batch of randomly selected consecutive pair of frames to compose . and are the ground-truth parameters of the affine matrix .

A key contribution of our approach is the usage of specially-designed synthetic data to learn affine transformation. Let denote the 2D coordinates of mark points at the frame from a generated synthetic video. Since these mark points in frame and are static in the world space, we can compute an affine transformation with degrees of freedom using and . Following this strategy, we get the ground-truth values , , , and for a single pair of images. The detailed process of the ground-truth data generation is described in Section 3.3.

Finally, with the estimated parameters for each image pair in the video, we can compute the estimated camera trajectory , where , and represents the estimated parameters for the pair of frames . It is important to note that, similar to Grundmann et al. [11], we do not use directly to warp the shaky images; we warp the frames by applying smoothed affine transformation composed using the smoothed translation, ration, and scale parameters as detailed in the following section.

3.2 Trajectory Smoothing

After estimating the shaky camera trajectory, we need to smooth it. In contrast to other methods like the work of Grundmann et al. [11], where they tackle the camera trajectory smoothing as an optimization problem, we deploy the Savitzky-Golay filter [29] on the averaged envelop of the shaky camera trajectory to smooth it, as described in the sequel.

Signal Envelop Calculation

Given the camera trajectory , we first calculate the extremes of by applying the first-order discrete derivative. Then, we interpolate the trajectory maxima () and minima () values to extract the upper and lower envelop, respectively. We empirically experimented with linear, cubic, and quadratic interpolations. However, quadratic interpolation presented the best results in our experiments since it makes smooth interpolations and tends to stay within the ranges of the interpolation points. The final upper and lower signal envelops are represented as and , respectively.

Smoothing

After obtaining the upper and lower signal envelops, we apply the Savitzky-Golay filter [29] on the average envelop to remove the unwanted sudden camera shakiness and create the smooth camera trajectory as shown in Figure 2-b. The Savitzky-Golay filter smooths the digital signal by fitting a low-degree polynomial to consecutive signal points using linear least squares. This strategy has an advantage over other techniques as it preserves the signal tendency. Thus, still maintains the properties of while ensuring smooth camera transition over time. After that, we calculate the difference between both trajectories . Then, the smoothed affine transformation parameters can be calculated as , where and , with being the smoothed parameters , , , and .

Warping and Cropping

At last, we warp the video frames and crop them to compose the final video. For each video frame , we compute its warped version by applying a transformation matrix to every pixel. Formally, we retrieve from the smoothed transformations and use the smoothed parameters , , , and to compose the smoothed affine matrix . Finally, we crop the warped frames using a predefined virtual cropping window similar to Grundmann et al. [11] to generate the stabilized video.

Figure 3: Ground-truth Generation. points are randomly sampled from the screen space (yellow circles in a). From each of these points we cast rays to infinity in the 3D scene space (dashed yellow lines in b), and create hypothetical objects at the intersection of these rays with the scene (red circles in c).We obtain the affine transformation matrix using the coordinates of the hypothetical objects in screen space and since they remain stationary in the scene from the frame to .

3.3 Ground-truth Generation

The goal of our network is to infer the affine transformation matrix given two consecutive frames. Thus, we need to generate the affine transformation ground-truth to supervise the training process. Our idea is two-fold: a) create stationary hypothetical labeled objects in the 3D world scene; b) record their coordinates in the screen space of the recording camera. In that way, we can guarantee that the coordinates of these objects in frames correspond to exactly the same static elements in the 3D world seen by the recording camera at frames and . In other words, we create a number of invisible objects and save their coordinates in the camera space for each frame. In order to do so, for each frame a number of random points are sampled from the screen camera space. Then, a ray is cast from each of these points to infinity. At each ray’s intersection point with the scene, a hypothetical invisible object is created. The object remains stationary for a number of seconds before being destroyed. For each frame, the object position in world space is transferred to the camera space and saved in XML format. Figure 3 demonstrates how these hypothetical objects are created.

If the current camera view does not include the hypothetical object, or if it exceeds its time limit duration , it is removed. Each object is given a Unique Identifier (UID) over its lifetime. Later in the post processing stage, for each two consecutive frames using the UIDs of these hypthetical objects and their screen locations, the affine transformation ground-truth is calculated for each pair as described in Algorithm 1. It should be noted that generating millions of hypothetical objects is still a valid option. However, our algorithm provides a solution to create only objects in the field view of the recording agent improving the overall performance.

Require : ,
Ensure :  Per frame containing camera screen locations of a number of hypothetical objects stationary in scene space.
Total Number of Frames to Generate
Sampling Period
while Recording do
        for  do
               if  then
                      K Sampling points on camera screen space foreach  point  do
                             point Cast a ray from to infinity
                             Intersection point between the ray with scene objects
                            
                             Assign a unique identifier
                             Transfer coordinates from world to screen space while  is visible and did not exceed its lifetime  do
                                   
                                   
                            
                     
               else
                     
              
       
Algorithm 1 Affine Transfromation Ground-truth Generation

4 Silver: Framework for Generating Synthetic Data for Computer Vision Tasks

There are many photo-realistic synthetic data generators like CARLA [8] and UrbanScene3D [21] that can be used to simulate photo-realistic, diverse, and visually complex 3D worlds. However, generating special data in such engines is cumbersome and most importantly, they do not support generating training data for video stabilization. Silver, on the other hand, fills the gap and generates the required training data for this task. Additionally, it supports other computer vision tasks like semantic segmentation, instance segmentation, depth estimation, pose estimation, surface normals estimation. However, in this paper, we limit our discussion to the usability of our engine for video stabilization task.

In this work, we show that more vital than photo-realistic and diverse 3D scenes is designing computer vision models targeted at using synthetic data, and generating the appropriate synthetic data for these models. In virtual worlds, it is not only easy to control all scene aspects but also to generate more suitable training data for supervised learning algorithms.

Simulator

Our work deploys synthetic data to train a synthetic-aware video stabilization algorithm. For that aim, we developed Silver using the Unity game engine to generate our synthetic training datasets VSAC65Synth and VSNC35Synth. We employ the Procedural Content Generation (PCG) concept to create a full 3D virtual world at run-time while the system’s extensibility is attained by taking advantage of the modular approach followed as we built the system from scratch. Although our simulator can provide clean, unbiased, and large-scale training and testing data for various computer vision tasks, we focus on the video stabilization task. A shaky synthetic video is recorded after procedurally creating a 3D virtual world sampled from a predefined set of 3D models, materials, and animations. Note that for each video, a new virtual world is created to diversify the training data. A simplified flowchart describing the scene creation process is shown in Figure 4.

Flowchart describing the scene creation in
Figure 4: Flowchart describing the scene creation in Silver.
Samples from the procedurally generated scenes using
Figure 5: Samples from the procedurally generated scenes using Silver.

Static Elements

Starting from the given parameters, Silver initially creates the static part of the 3D virtual world. In this part, first the street length and the number of crosses are set at random. Following this, the buildings are created where buildings’ locations, types, and frequency are set at random. After that, the other scene elements like benches, trash containers and bags, trees, and other elements are created. To further improve the realism and diversity of the generated scenes, we introduce a new variable called Anomaly Rate; higher values will cause more artifacts to appear in the scene such as more street lights being off at night, more being on at daytime, and some trash bags being on roads.

Dynamic Elements

Once the static part of the scene is completed, the dynamic part of the scene is initiated. Initially, the characters generator retrieves the locations of buildings and benches, and instantiates characters based on the required characters density. The Microsoft Rocketbox Avatar Library [10] is used to define the character avatar, and the animations are selected based on character pose (standing or sitting). Character animations were adopted from Mixamo. In a similar way, the cars are created. However, number of cars and models are selected at random. Additionally, car shader attributes: Smoothness, Metallic, and BaseColour are all randomized at run-time to give different visual appearance even to the same car model. After that, the plates of the cars are selected at random from a large set collected manually from the web. The main processes are summarized in Figure 4. In parallel to that, the first-person videos are recorded using Cinemachine camera behaviour from Unity, attached to an AI navigation agent. We use Cinemachine since it gives unlimited sets of behaviours that enrich the diversity of the generated synthetic data in terms of the camera view angle and transition.

Camera Shakiness

To introduce shakiness to the recording camera, we create noise by using a predefined noise profile asset. The amplitude and frequency of the noise are randomly sampled from a uniform distribution. The noise is applied to change the translation and rotation of the recording agent camera. Figure 5 demonstrates examples of the generated scenes.

 On top, each dashed box shows frames from other datasets. Our proposed dataset, VSAC105Real, is at the bottom. It includes more diverse and challenging attributes as compared to the other datasets.
Figure 6: VSAC105Real versus other datasets. On top, each dashed box shows frames from other datasets. Our proposed dataset, VSAC105Real, is at the bottom. It includes more diverse and challenging attributes as compared to the other datasets.
Dataset Name #Videos
Average
#Frames
Total
#Frames
DeepStab [32]
Stabfr [38]
Selfie Video [34]
LiuSigg2013 [20]
VSAC105Real
Table 1: Dataset statistics. Comparison among the available video stabilization datasets and VSAC105Real dataset.

5 Real and Synthetic Datasets

5.1 Real Data Collection

The available video stabilization benchmarks such as DeepStab [32], Stabfr [38], Selfie Video [34], and LiuSigg2013 [20] exclusively contain videos under normal weather condition and at a sufficient illumination. To assess the performance of the state-of-the-art video stabilization methods under foggy, rainy, snowy, and night-time conditions, we created the VSAC105Real dataset.

Our dataset is composed of videos collected from YouTube using search queries like “Fog”, “Rain”, “Snow”, “Night”, “Adverse”, and “Severe”. We manually inspected all the videos and selected the ones with shaking camera movement. Then, we cut the videos to ensure continuous temporal criteria and the query attribute. VSAC105Real dataset comprises videos spanning normal, rainy, foggy, snowy, and night-time attributes. The first four attributes were selected to study the effect of severe weather conditions on video stabilization quality. Similarly, the night-time was chosen to understand the effect of low illumination on video stabilization. Table 1 shows a comparison among different video stabilization datasets and VSAC105Real dataset. The VSAC105Real dataset has the advantage in terms of the average number of frames. Moreover, it includes a diverse set of challenging attributes where videos are evenly distributed across the classes, i.e., videos per class. A visual comparison among VSAC105Real and other video stabilization datasets is depicted in Figure 6.

5.2 Synthetic Data

Using Silver, we generate two different synthetic training datasets: VSNC35Synth under normal weather conditions and VSAC65Synth under both normal and adverse weather conditions.

VSNC35Synth dataset

It is used in all training experiments, unless otherwise specified. It includes videos at fps and average number of frames per video; it covers only videos in normal weather conditions. The average number of frames was set to to match the available real video stabilization dataset average number of frames.

VSAC65Synth dataset

It is used in one of the ablation study’s experiments (Table 4, under More Data column). It consists of videos spanning normal, rainy, foggy, and snowy weather conditions at daytime and nigh-time. It has the same fps and average number of frames as the VSNC35Synth dataset.

6 Experiments

6.1 Experimental Setup

Implementation Details

We trained our method using only the synthetic data provided by our simulator, i.e., VSNC35Synth. We used the Adam optimizer [16] with a learning rate with , , and . Additionally, the hypothetical objects time limit duration was second. We trained the and translation prediction model () and rotation  and scale prediction model () for and epochs, respectively, using batches of size . After epochs, we decrease the learning rate of to . Our architecture is fully implemented in PyTorch, and our whole training procedure takes about hours on a Tesla V100 GPU. For the smoothing step, we used a window length, i.e., number of coefficients, equal to with st order polynomial as parameters to the Savitzky-Golay filter. We experimentally selected these values since they give slightly better results than others.

Video Stabilization Methods

We evaluate five state-of-the-art video stabilization algorithms on two datasets: i.e., VSAC105Real and Selfie Video [34]. The baselines span non-learning based [11], supervised [32], and unsupervised [6] video stabilization methods. Furthermore, we compare our method to the one proposed by Yu et al. [35], which heavily relies on optical flow since we also use optical flow. Similarly, we consider the work of Liu et al. [22] because they also use CNNs for video stabilization close to our work.

Metric Method Weather Condition Average
Fog Night Normal Rain Snow
Stability Avg. Score  FuSta [22]
Grundmann et al. [11]
StabNet [32]
DIFRINT [6]
Yu et al. [35]
Ours
Distortion Score  FuSta [22]
Grundmann et al. [11]
StabNet [32]
DIFRINT [6]
Yu et al. [35]
Ours
Cropping Ratio  FuSta [22]
Grundmann et al. [11]
StabNet [32]
DIFRINT [6]
Yu et al. [35]
Ours
Success Rate  FuSta [22]
Grundmann et al. [11]
StabNet [32]
DIFRINT [6]
Yu et al. [35]
Ours
Higher is better Better closer to
Table 2: Comparison across different weather conditions in the VSAC105Real dataset. Our method presents the best average values in comparison to the other competitors for all metrics. Bold indicates the best and underline second best.

Evaluation Metrics

To evaluate our approach, we use three metrics commonly used to evaluate video stabilization algorithms [22, 6, 32, 35]: i) Stability Score. It assesses the smoothness of the stabilized video; the higher the value the better. It is computed as the average between Stability Average Translation and Stability Average Rotation Scores. To compute this score, we estimate the homography matrix between and to obtain the translation and rotation arrays. Following this, we calculate their Fast Fourier Transform (FFT). Finally, we obtain the score by calculating the ratio between the nd through th frequency components and all frequency components. Note that the th frequency component is neglected; ii) Distortion Score. It measures the global distortion caused by a given video stabilization method. It fits a homography matrix between the original and stabilized videos. Then, it finds the anisotropic scaling among these frames; the closer to , the better; iii) Cropping Ratio. It describes the ratio of the remaining frame’s area after stabilization to the original one; iv) Success Rate. We also measure the success rate, which computes the ratio of videos that were successfully processed and yielded a distortion score lower than or equal to one.

Method
Stability
Avg. Score
Distortion
Score
Cropping
Ratio
Success
Rate
FuSta [22]
Grundmann et al. [11]
StabNet [32]
DIFRINT [6]
Yu et al. [35]
Ours
Higher is better Better closer to
Table 3: Comparison in the Selfie Video dataset [34]. Our method achieves the best distortion score and comparable results to other baselines. Bold indicates the best, underline second best, and italic the third best result.

6.2 Results

Table 2 shows the results of comparing our method to several video stabilization approaches. As can be seen, our method presented the best values on average in comparison to all the baselines in terms of stability average, distortion, cropping ratio, and success rate. Even though our method did not surpass the baselines at each class individually, it still achieved competitive results. Every baseline performs badly in at least one class, while our method is more robust across classes, hence holding the final best results on the VSAC105Real.

Preserving the content while compensating for camera shakiness is another important feature of our algorithm. Our method achieved the best results as compared to other state-of-the-art methods. The superiority of our method can be linked to the accurate affine transformation matrix estimation and the smoothing stage. Moreover, our method achieved the highest success rate compared to the competitors as shown in Table 2. It is worth noting that all baselines failed to stabilize most of the shaky videos at foggy weather conditions. The reason for this outcome is that participating media, like fog, work as a low pass filter that removes high-quality features that most videos stabilization algorithms depend on to estimate the camera trajectory. We highlight that even though our model was not trained on any samples under the foggy weather condition, it was capable of learning useful features from both raw images and optical flow.

Furthermore, we evaluate our model on Selfie Video dataset [34], contains videos under normal weather condition and standard illumination. Our model achieves comparable results to other methods while maintaining the best distortion score as shown in Table 3. It should be noted that our model was trained from scratch on synthetic data only.

6.3 Ablation Study

Single
Network
SIFT
More
Data
No Optical
Flow
Directed
Smoothing
Complete
Model
Stability Avg. Score
Distortion Score
Cropping Ratio
Success Rate
Higher is better Better closer to
Table 4: Ablation study. Performance for different design choices (best in bold).

We analyzed different design options and showed the effectiveness of each component of our proposed pipeline. The results are reported in Table 4. First, since we use two networks in our pipeline, we trained using a single CNN with four losses: horizontal translation loss, vertical translation loss, rotation loss, and scale loss. As a result, the network was unable to converge well. One problem could be that the translation losses were larger than others. However, even after applying weights for the losses to make them comparable to each other, the network could not learn well (column Single Network in Table 4).

To emphasize the advantages of using our learning-based model, for affine transformation matrix estimation over applying SIFT, we apply SIFT to find the affine transformation matrix while keeping the smoothing part of our model intact. As expected, SIFT did not perform well (column SIFT in Table 4). Standard feature extractors like SIFT struggle to extract reliable and robust features under adverse conditions. Rain and snow particles, low illumination at night, and foggy weather make finding and matching features rather hard. This leads to inaccurate affine transformations, thus low-quality stabilized videos. To further evaluate the advantages of using our smoothing algorithm over directed smoothing, as done in [11], we apply directed smoothing on the predicted camera trajectory, while keeping our learning-based model for the affine transformation matrix estimation. As expected, the model does not preform very well as compared to using our proposed smoothing algorithm (column directed smoothing in Table 4) because ours considers more sophisticated camera paths and is not limited to constant, linear and parabolic motions like [11].

To show the need for optical flow in the affine transformation learning, we train using only graysale images rather than both grayscale and optical flow modalities. As anticipated, not utilizing optical flow information decreased the video stabilization quality (column No Optical Flow in Table 4).

At last, to demonstrate the effect of our training synthetic data on the quality of video stabilization, we trained our model on more data, including normal weather, adverse weathers, and night-time videos. The training was performed using VSAC65Synth dataset from scratch, and no real data were used. The results indicated no significant improvement on the baseline that was trained on VSNC35Synth dataset. Thus, we can see in Table 4, under More Data column, that a few synthetic videos, with accurate ground-truth, were sufficient to train our model to learn affine transformation. Thus, increasing the number and diversity of the training videos did not boost the overall performance.

Method RE LT LL SF LF Average
ORB [28] + RANSAC [9]
ORB [28] + MAGSAC [2]
ORB [28] + LMEDS
SIFT [23] + RANSAC [9]
SIFT [23] + MAGSAC [2]
SIFT [23] + LMEDS
Supervised [7]
Unsupervised [37] 3.84
Ours 4.55 5.58 5.68 5.17 6.14
Lower is better
Table 5: Affine matrix estimation. Comparison among different methods for affine matrix estimation on CA-Unsupervised dataset [37] using  distance. Bold indicates the best result.

To further investigate the accuracy of our estimated affine transformation matrix, we carry a special set of experiments shown in Table 5. We compare our learning-based affine transformation estimation model to three types of affine estimation approaches: i) the traditional ones, including ORB and SIFT with RANSAC, MAGSAC, and LMEDS for outliers rejection; ii) supervised; and iii) unsupervised approaches. We note that traditional approaches can be used to estimate the affine transformation directly, but supervised and unsupervised methods are designed to estimate the homography matrix. Thus, for fair comparison, we extract the affine transformation from the estimated homography one. We utilize the dataset of [37] which contains pairs of images where each image-pair includes 6-matching human-annotated pairs of points. The dataset covers regular-texture (RE), low-texture (LT), low-light (LL), small-foregrounds (SF), and large-foregrounds (LF). We use distance to measure the error between the warped and ground-truth points similar to [37, 33]. Table 5 demonstrates that our method can estimate the affine transformation better for both standard and challenging conditions. Figure 7 shows a qualitative comparison among our model, traditional (e.g., SIFT and ORB), supervised [7] and unsupervised [37] methods. While other methods fail under such challenging conditions, our method performs well because it learned how to extract resilient features using our specially designed synthetic data.

 Comparison between unsupervised 
Figure 7: Qualitative comparison for affine transformation. Comparison between unsupervised [37], supervised [7], SIFT [23], ORB [28] and ours for affine transformation estimation. Our method can handle challenging conditions such as low textures where other methods perform poorly.

7 Conclusion and Limitations

We showed that most of the state-of-the-art video stabilization methods could not perform well under adverse weather conditions. We proposed a novel synthetic-aware video stabilization algorithm that requires only synthetic data for training. Our experimental results demonstrated that our method surpasses all other baselines. We also provided one real dataset (VSAC105Real) for video stabilization under adverse conditions and two synthetic datasets (VSNC35Synth and VSAC65Synth) for training purposes. Our ablation studies demonstrated that current affine transformation matrix estimation methods fail under challenging conditions. Despite the fact that repetitive textures caused by rain and snowdrops, along with smoothing filters caused by fog and the absence of enough features at night altogether, pose certain challenges on current video stabilization algorithms, our video stabilization performed well under these challenges.

Although our approach achieved state-of-the-art results, it has a few limitations. Increasing the number of synthetic training samples did not boost the performance of our video stabilization method. The reason could be that our model has already learned how to estimate the affine transformation matrix, and any further training videos were not advantageous. At the same time, our affine estimation model did not perform well under straightforward scenarios such as large-foreground images. Additionally, it achieved comparable results on selfie video dataset. That is due to the fact that our generated synthetic dataset did not include any selfie videos. Even though, our video stabilization algorithm obtained satisfactory results. Hence, we have demonstrated that synthetic data can be very useful, but not only as an additional data source, as how it was usually expected, but also as an essential element for novel synthetic-aware computer vision algorithms.

References

  • [1] M. K. Ali, S. Yu, and T. H. Kim (2020) Learning deep video stabilization without optical flow. arXiv preprint arXiv:2011.09697. Cited by: §1, §2.1.
  • [2] D. Barath, J. Matas, and J. Noskova (2019) MAGSAC: marginalizing sample consensus. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10197–10205. Cited by: §2.2, Table 5.
  • [3] H. Bay, T. Tuytelaars, and L. Van Gool (2006) SURF: Speeded up robust features. In European conference on computer vision, pp. 404–417. Cited by: §2.2.
  • [4] A. Bradley, J. Klivington, J. Triscari, and R. van der Merwe (2021) Cinematic-L1 Video Stabilization with a Log-Homography Model. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1041–1049. Cited by: §2.1.
  • [5] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012) A Naturalistic Open Source Movie for Optical Flow Evaluation. In European conference on computer vision, pp. . Cited by: §2.3.
  • [6] J. Choi and I. S. Kweon (2020) Deep iterative frame interpolation for full-frame video stabilization. ACM Transactions on Graphics (TOG) 39 (1), pp. 1–9. Cited by: §2.1, §6.1, §6.1, Table 2, Table 3.
  • [7] D. DeTone, T. Malisiewicz, and A. Rabinovich (2016) Deep image homography estimation. arXiv preprint arXiv:1606.03798. Cited by: §2.2, Figure 7, §6.3, Table 5.
  • [8] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: An open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §2.3, §4.
  • [9] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §2.2, Table 5.
  • [10] M. Gonzalez-Franco, Ofek, et al. (2020) The Rocketbox Library and the Utility of Freely Available Rigged Avatars. Frontiers in virtual reality 1 (article 561558), pp. . Cited by: §4.
  • [11] M. Grundmann, V. Kwatra, and I. Essa (2011) Auto-directed video stabilization with robust l1 optimal camera paths. In CVPR 2011, pp. 225–232. Cited by: §2.1, §2.2, §3.1, §3.1, §3.2, §3.2, §6.1, §6.3, Table 2, Table 3.
  • [12] Z. Hao, A. Mallya, S. Belongie, and M. Liu (2021) GANcraft: Unsupervised 3D Neural Rendering of Minecraft Worlds. arXiv preprint arXiv:2104.07659. Cited by: §2.3.
  • [13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017-07) FlowNet 2.0: evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §3.1.
  • [14] A. Kerim, U. Celikcan, E. Erdem, and A. Erdem (2021) Using synthetic data for person tracking under adverse weather conditions. Image and Vision Computing 111, pp. 104187. Cited by: §1, §2.3.
  • [15] A. Kerim, L. Soriano Marcolino, and R. Jiang (2021) Silver: novel rendering engine for data hungry computer vision models. In 2nd International Workshop on Data Quality Assessment for Machine Learning, Cited by: footnote 1.
  • [16] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.1.
  • [17] K. Lee, Y. Chuang, B. Chen, and M. Ouhyoung (2009) Video stabilization using robust feature trajectories. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1397–1404. Cited by: §3.1.
  • [18] S. Li, L. Yuan, J. Sun, and L. Quan (2015) Dual-feature warping-based motion model estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4283–4291. Cited by: §2.2.
  • [19] S. Liu, Y. Wang, L. Yuan, J. Bu, P. Tan, and J. Sun (2012) Video stabilization with a depth camera. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 89–95. Cited by: §3.1.
  • [20] S. Liu, L. Yuan, P. Tan, and J. Sun (2013) Bundled camera paths for video stabilization. ACM Transactions on Graphics (TOG) 32 (4), pp. 1–10. Cited by: Table 1, §5.1.
  • [21] Y. Liu, F. Xue, and H. Huang (2021) UrbanScene3D: a large scale urban scene dataset and simulator. arXiv preprint arXiv:2107.04286. Cited by: §1, §2.3, §4.
  • [22] Y. Liu, W. Lai, M. Yang, Y. Chuang, and J. Huang (2021) Hybrid neural fusion for full-frame video stabilization. arXiv preprint arXiv:2102.06205. Cited by: §1, §2.1, §6.1, §6.1, Table 2, Table 3.
  • [23] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §1, §2.2, Figure 7, Table 5.
  • [24] Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y. Yao, S. Li, T. Fang, and L. Quan (2020) ASLFeat: learning local features of accurate shape and localization. Computer Vision and Pattern Recognition (CVPR). Cited by: §1.
  • [25] F. Reda, R. Pottorff, J. Barker, and B. Catanzaro (2017) Flownet2-pytorch: Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. GitHub. Note: https://github.com/NVIDIA/flownet2-pytorch Cited by: §3.1.
  • [26] J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humenberger (2019) R2D2: repeatable and reliable detector and descriptor. In NeurIPS, Cited by: §1.
  • [27] S. R. Richter, V. Vineet, S. Roth, and V. Koltun (2016) Playing for Data: Ground Truth from Computer Games. In European conference on computer vision, pp. . Cited by: §2.3.
  • [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski (2011) ORB: An efficient alternative to SIFT or SURF. In 2011 International conference on computer vision, pp. 2564–2571. Cited by: §2.2, Figure 7, Table 5.
  • [29] A. Savitzky and M. J. Golay (1964) Smoothing and differentiation of data by simplified least squares procedures.. Analytical chemistry 36 (8), pp. 1627–1639. Cited by: §3.2, §3.2.
  • [30] A. Shafaei, J. J. Little, and M. Schmidt (2016) Play and Learn: Using Video Games to Train Computer Vision Models. arXiv:1608.01745. Cited by: §1, §2.3.
  • [31] A. Tsirikoglou (2022) Synthetic data for visual machine learning: a data-centric approach. Ph.D. Thesis, Linköping University Electronic Press. Cited by: §1, §2.3.
  • [32] M. Wang, G. Yang, J. Lin, S. Zhang, A. Shamir, S. Lu, and S. Hu (2018) Deep online video stabilization with multi-grid warping transformation learning. IEEE Transactions on Image Processing 28 (5), pp. 2283–2292. Cited by: §1, §2.1, Table 1, §5.1, §6.1, §6.1, Table 2, Table 3.
  • [33] N. Ye, C. Wang, H. Fan, and S. Liu (2021) Motion basis learning for unsupervised deep homography estimation with subspace projection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13117–13125. Cited by: §2.2, §6.3.
  • [34] J. Yu and R. Ramamoorthi (2018) Selfie video stabilization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 551–566. Cited by: Table 1, §5.1, §6.1, §6.2, Table 3.
  • [35] J. Yu and R. Ramamoorthi (2020) Learning video stabilization using optical flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8159–8167. Cited by: §1, §2.1, §6.1, §6.1, Table 2, Table 3.
  • [36] J. Zhang, D. Sun, Z. Luo, A. Yao, L. Zhou, T. Shen, Y. Chen, L. Quan, and H. Liao (2019) Learning two-view correspondences and geometry using order-aware network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5845–5854. Cited by: §2.2.
  • [37] J. Zhang, C. Wang, S. Liu, L. Jia, N. Ye, J. Wang, J. Zhou, and J. Sun (2020) Content-aware unsupervised deep homography estimation. In European Conference on Computer Vision, pp. 653–669. Cited by: §2.2, Figure 7, §6.3, Table 5.
  • [38] L. Zhang, Q. Zheng, H. Liu, and H. Huang (2018) Full-reference stability assessment of digital video stabilization based on riemannian metric. IEEE Transactions on Image Processing 27 (12), pp. 6051–6063. Cited by: Table 1, §5.1.