PyTorch Image Quality:
Metrics for Image Quality Assessment

Sergey Kastryulin
Computational Imaging Laboratory
Skolkovo Institute of Science and Technology
Moscow, Russia
sergey.kastryulin@skoltech.ru
&Jamil Zakirov
Independent Researcher
Skolkovo Institute of Science and Technology
Moscow, Russia
jamil.zakirov@skoltech.ru
&Denis Prokopenko
Biomedical Engineering Department
School of Biomedical Engineering and Imaging Sciences
King’s College London, London, UK
d.prokopenko@outlook.com
&Dmitry V. Dylov
Computational Imaging Laboratory
Skolkovo Institute of Science and Technology
Moscow, Russia
d.dylov@skoltech.ru

Abstract

Image Quality Assessment (IQA) metrics are widely used to quantitatively estimate the extent of image degradation following some forming, restoring, transforming, or enhancing algorithms. We present PyTorch Image Quality (PIQ), a usability-centric library that contains the most popular modern IQA algorithms, guaranteed to be correctly implemented according to their original propositions and thoroughly verified. In this paper, we detail the principles behind the foundation of the library, describe the evaluation strategy that makes it reliable, provide the benchmarks that showcase the performance–time trade-offs, and underline the benefits of GPU acceleration given the library is used within the PyTorch backend. PyTorch Image Quality is an open source software: https://github.com/photosynthesis-team/piq/.

\keywords

Image Quality Metrics Computer Vision

1 Introduction

PIQ open source library implements 38+ metrics in three main categories: Full-Reference (FR), No-Reference (NR), and Distribution-Based (DB).
Our implementations enable quantitative estimation of the quality of images from various domains, including medical scans.
Right column shows the counters of the currently implemented metrics.
Note that DB metrics can be used with different feature extractors, further diversifying the pool of available metrics. — Figure 1: PIQ open source library implements 38+ metrics in three main categories: Full-Reference (FR), No-Reference (NR), and Distribution-Based (DB). Our implementations enable quantitative estimation of the quality of images from various domains, including medical scans. Right column shows the counters of the currently implemented metrics. Note that DB metrics can be used with different feature extractors, further diversifying the pool of available metrics.

With the ever-rising interest towards generation, recovery, and enhancement of images, a number of computational Image Quality Assessment (IQA) methods have become available. These methods aim to produce a score that would perfectly correlate with the human perception of the Image Quality (IQ). Given the large number of the assorted IQA options and frequent mismatch between different implementations of the same metric, a demand for reliable and community-validated resource has become evident. Specifically, such a resource should embrace a plethora of possible modern image metrics, eliminating the need to search, install, or re-implement the algorithms.

Although popular IQA libraries provide the access to some Image Quality Metrics (IQMs) in the form of user-friendly packages, they do have a set of limitations. Blind Image Quality Toolbox [81] and Image Quality Assessment Toolbox [91] are verified MATLAB implementations of no-reference (NR) and full-reference (FR) IQMs. Unfortunately, the growing popularity of Python makes these solutions less relevant for the hands-on research efforts, where the majority of image and vision computing experts rely on Python for its well-tested open source repositories. Kornia [71], PIQA [72], and IQA-PyTorch [10] do provide Python interfaces for the metrics and the IQA-optimization [21], while also focusing on their use as the loss functions. However, these libraries implement a rather limited number of algorithms¹¹1Number of IQMs implemented: Kornia - 2, PIQA - 11, IQA-PyTorch - 25..

This manuscript introduces PyTorch Image Quality (PIQ) open source library that provides an extended collection of implementations of IQA metrics and the corresponding loss functions, using PyTorch [66] to enable fast and efficient computations on the graphics processing unit (GPU). This work is the result of three years of collecting and comparing the best image quality metrics in one place, with independent implementation, optimization, and testing. At the time of writing this manuscript, our GitHub project receives over 4000 monthly views, having accumulated 640 stars and having been forked over 57 open-source projects. This tool has been actively used in the research community, including the development of new neural network architectures [59] and the large-scale medical image quality study [45].

2 Design principles

We created PyTorch Image Quality to maximise the benefit for the research community working on a plethora of image-to-image translation problems in a variety of visual data domains (Fig. 1). To do that, we build the library based on the following design principles:

Be user-friendly. Unfortunately, the complexity inherent to the field of Computer Vision is multiplied by that of the sophisticated algorithms designed to boost the performance of the IQA approaches. We hide this internal complexity behind easy-to-use APIs that follow the principle of the least astonishment [80] to provide a seamless user experience.

Be reliable. Today, publicly available implementations may produce inconsistent results, which applies even to the most well-known and the widely used IQMs (e.g. SSIM [85]). PyTorch Image Quality library is focused on a thorough testing to provide consistency with the formal metrics definitions and the original implementations proposed by their authors (if these implementations exist).

Be pragmatic. A majority of modern (and some classical) IQMs are computationally inefficient by their design, which hinders the ultimate performance. Therefore, PyTorch Image Quality purposely enables an optimization when its extra complexity is worth delivering a compelling performance. Being inspired by the same principle in PyTorch, we state that trading 10% of speed for a model that is significantly easier to use is acceptable; and 100% is not [66].

3 Usability–centric approach

Currently, PyTorch Image Quality library contains implementations of the following 38 metrics: 21 Full-Reference (FR) IQMs (PSNR, SSIM [85], MS-SSIM [88], IW-SSIM [87], VIF [76], GMSD [92], MS-GMSD and MS-GMSDc [98], FSIM and FSIMc [101], SR-SIM and SR-SIMc [102], VSI [100], MDSI [64], HaarPSI [70], Content and Style Perceptual Scores [44], LPIPS [103], DISTS [19], PieAPP [68], DSS [3]), 2 No-Reference (NR) IQMs (BRISQUE [62], Total Variation), and 15 Distribution-Based (DB) IQMs (KID [6], FID [38], GS [47], Inception Score (IS) [73], MSID [24], all implemented with three different feature extractors: Inception Net [25], VGG16, and VGG19 [79]). The general taxonomy of IQA metrics, their detailed description, and a pertinent discussion can be found in A.

Remark. Throughout the work, we use the term “metrics” to describe the IQA algorithms. Technically, this term is not mathematically correct, because a metric is a function for which the identity of indiscernibles, the symmetry, and the triangle inequality must hold [choudhary1993metric_spaces]. For the majority of implemented algorithms, one or several of these attributes do not hold. However, the term is still used here, which reflects the commonplace convention in the community.

Metrics as loss functions

Previous studies of IQMs showed their efficiency in reflecting the human perception of visual quality [2, 1, 18, 45]. That brings a temptation to use the best-performing differentiable IQMs as the loss functions for a direct optimization of the image processing models [20]. To enable that possibility, all PIQ metrics are well-integrated with the PyTorch [66] backend, enabling the automatic computing of the gradients of the differentiable models.

Besides, the use of PyTorch enables the GPU acceleration, yielding a faster computation of the metrics and the losses (see Fig. 3 for more details). Moreover, our implementation strategy allows a seamless integration with the most common deep learning pipelines in PyTorch. For instance, the flexible interface enables our metric implementations to be used as additional layers of a deep neural network.

Integrated feature extractors

DB IQMs are computed on features obtained from images with feature extractors, which are typically represented by pre-trained convolutional neural networks. Most methods are evaluated using a single feature extractor (typically, Inception Net [25]). A recent study on IQMs for Magnetic Resonance Imaging [45] showed that the choice of feature extractor plays a critical role, heavily influencing the performance of DB metrics. These result may inspire our users to experiment with various feature extractor options. To address that, we added a possibility to provide their own feature extractors or choose one of the models integrated into the library (Inception Net [25], VGG16, and VGG19 [79]).

Chromatic versions of luminance based metrics

The majority of methods are designed to be applied to images in RGB color space. However, some of them (e.g. FSIM [101] and SR-SIM [102]) are designed for grayscale images or for the luminance component of the color images. Because the chrominance information also affects human visual system (HVS) in understanding the images, better performance can be expected if the chrominance information is incorporated for color IQA. For that, we follow the common approach of first converting the RGB images into the YIQ colour space. The Y channel is used for the computations of the initial grayscale variants, while the I and the Q components are added to obtain the chromatic versions of the metrics (e.g. FSIMc [101] and SR-SIMc [102] respectively).

4 Evaluation

New metrics and measures for IQA that claim to be better than their predecessors emerge every day. To prove that, authors show IQMs’ performance on specifically designed IQA datasets consisting of pairs of distorted and reference images accompanied with similarity scores estimated by human assessors. The availability of such datasets allows to compute correlations between human scores and metrics’ estimates of quality to find algorithms that best reflect judgement of HVS.

Database	Year	Reference	Distorted	Subjective Score
IVC [50]	2005	10	235	MOS (1 $\sim$ 5)
LIVE IQA [77, 75]	2006	29	779	DMOS (0 $\sim$ 100)
A57 [8, 7]	2007	3	54	DMOS (0 $\sim$ 1)
Toyama/MICT [39]	2008	14	168	MOS (1 $\sim$ 5)
TID2008 [67]	2008	25	1,700	MOS (0 $\sim$ 9)
CSIQ [49, 22]	2009	30	866	DMOS (0 $\sim$ 1)
IVC-LAR [82]	2009	8	120	MOS (1 $\sim$ 5)
WIQ [23]	2009	7	80	DMOS (0 $\sim$ 100)
IRSQ [60]	2012	57	171	MOS (0 $\sim$ 5)
VCLFER [96, 97]	2012	23	552	MOS (0 $\sim$ 100)
LIVE MD [43, 42]	2013	15	405	DMOS (0 $\sim$ 100)
TID2013 [26]	2013	25	3,000	MOS (0 $\sim$ 9)
MDID2013 [37]	2014	12	324	DMOS (0.3 $\sim$ 0.6)
CID2013 [84]	2013	8	480	MOS (0 $\sim$ 9)
CIDIQ [58]	2014	23	690	MOS (0 $\sim$ 9)
SIQAD [93]	2015	20	980	DMOS (0 $\sim$ 100)
LIVE in the wild (CLIVE) [29, 28]	2015	-	1,162	MOS (1 $\sim$ 5)
MD-IVL [13, 14]	2017	10	750	MOS (0 $\sim$ 100)
MDID [83]	2017	20	1,600	MOS (0 $\sim$ 9)
KonIQ-10k [40]	2018	10,073	10,073	MOS (1 $\sim$ 100)
KADID-10k [55]	2019	81	10,125	DMOS (1 $\sim$ 5)
KADIS-700k [56]	2019	140,000	700,000	DMOS (1 $\sim$ 5)
PaQ-2-PiQ [94]	2019	-	40,000	MOS (0 $\sim$ 100)
SPAQ [27]	2020	11,125	-	MOS (0 $\sim$ 100)
PIPAL [33, 34, 32]	2020	250	29,000	MOS (Elo rating system)

Table 1: Overview of existing IQA datasets

We use these results to evaluate the correctness of our implementations. For that, implemented metrics are computed on selected IQA datasets and obtained values are compared with the ones reported in corresponding research papers.

	TID2013	KADID-10k	PIPAL
	PIQ / Reference	PIQ / Reference	PIQ / Reference
PSNR	$0.69$ / $0.69$ [26]	$0.68$ / $-$	$0.41$ / $0.41$ [33]
SSIM [85]	$0.72$ / $0.64$ [26]	$0.72$ / $0.72$ [55]	$0.50$ / $0.53$ [33]
MS-SSIM [88]	$0.80$ / $0.79$ [26]	$0.80$ / $0.80$ [55]	$0.55$ / $0.46$ [33]
IW-SSIM [87]	$0.78$ / $0.78$ [2]	$0.85$ / $0.85$ [55]	$0.60$ / $-$
VIFp [76]	$0.61$ / $0.61$ [26]	$0.65$ / $0.65$ [55]	$0.50$ / $-$
GMSD [92]	$0.80$ / $0.80$ [98]	$0.85$ / $0.85$ [55]	$0.58$ / $-$
MS-GMSD [98]	$0.81$ / $0.81$ [98]	$0.85$ / $-$	$0.59$ / $-$
MS-GMSDc [98]	$0.89$ / $0.89$ [98]	$0.87$ / $-$	$0.59$ / $-$
FSIM [101]	$0.80$ / $0.80$ [26]	$0.83$ / $0.83$ [55]	$0.59$ / $0.60$ [33]
FSIMc [101]	$0.85$ / $0.85$ [26]	$0.85$ / $0.85$ [55]	$0.59$ / $-$
SR-SIM [102]	$0.81$ / $0.81$ [2]	$0.84$ / $0.84$ [55]	$0.54$ / $-$
SR-SIMc [102]	$0.87$ / $-$	$0.87$ / $-$	$0.57$ / $-$
VSI [100]	$0.90$ / $0.90$ [2]	$0.88$ / $0.86$ [55]	$0.54$ / $-$
MDSI [64]	$0.89$ / $0.89$ [64]	$0.89$ / $0.89$ [55]	$0.59$ / $-$
HaarPSI [70]	$0.87$ / $0.87$ [70]	$0.89$ / $0.89$ [55]	$0.59$ / $-$
Content_VGG16 [44]	$0.71$ / $-$	$0.72$ / $-$	$0.45$ / $-$
Style_VGG16 [44]	$0.54$ / $-$	$0.65$ / $-$	$0.34$ / $-$
LPIPS_VGG16 [103]	$0.67$ / $0.67$ [19]	$0.72$ / $-$	$0.57$ / $0.58$ [33]
DISTS [19]	$0.81$ / $0.83$ [19]	$0.88$ / $-$	$0.62$ / $0.66$ [33]
PieAPP [68]	$0.84$ / $0.88$ [19]	$0.87$ / $-$	$0.70$ / $0.71$ [33]
DSS [3]	$0.79$ / $0.79$ [2]	$0.86$ / $0.86$ [55]	$0.63$ / $-$
No-reference metrics
BRISQUE [62]	$0.37$ / $0.84$ [2]	$0.33$ / $0.53$ [55]	$0.21$ / $-$
Distribution-based metrics
KID_InceptionV3 [6]	$0.42$ / $-$	$0.66$ / $-$	$0.12$ / $-$
FID_InceptionV3 [38]	$0.67$ / $-$	$0.66$ / $-$	$0.18$ / $-$
GS_InceptionV3 [47]	$0.37$ / $-$	$0.37$ / $-$	$0.02$ / $-$
IS_InceptionV3 [73]	$0.26$ / $-$	$0.25$ / $-$	$0.09$ / $-$
MSID_InceptionV3 [24]	$0.21$ / $-$	$0.32$ / $-$	$0.02$ / $-$

Note 1: Typically, the distance between correlation values obtained with new and reference implementations is used to verify the accurateness of the former. Zero difference is an indication of correctness. However, we find it peculiar to find that for some metrics (e.g., SSIM, MS-SSIM, VSI) the same implementation exactly matches with only one of two reference values.
Note 2: All three considered datasets are widely used in the IQA research community to assess metrics on their ability to estimate IQ in the same domain - general natural images. However, we observe a significant drop of SRCC values for all metrics on the larger PIPAL dataset compared to smaller TID2013 and KADID10k datasets. While a domain shift caused by larger variety or distortions introduced in the newer PIPAL dataset could explain the observation, more investigations are required.

Table 2: Comparison of correlation values (in terms of SRCC) reported in literature with PIQ implementations.

IQA datasets

IQA datasets are typically designed to evaluate existing IQMs on their ability to answer previously not considered questions such as IQ from two different view distances (CID dataset [58]), IQ on images perturbed with multiple types of distortions (MDID [83], MD-IVL [13] and VLC [95] datasets), evaluation of GAN-based image restoration algorithms (PIPAL [33] and NTIRE 2021 [34] challenges and datasets). Another commonly used datasets are LIVE [42], TID2013 [26] and KADID-10k [55]. Some works even try to propose a method to construct a general-case IQA dataset with extremely diverse image characteristics [54].

Subjective annotations

The majority of the listed datasets provide image pairs accompanied by subjective quality scores. Typically, subjective scores are estimated by human assessors and aggregated in the form of Mean Opinion Scores (MOS) or Differential Mean Opinion Scores (DMOS). In some cases, raw scoring data is first converted to z-scores averaged and re-scaled from 0 to 100 to account for different scoring across respondents as proposed in [74].

Datasets selection

A significant number of IQA databases have come out over the last 15 years. There is currently no gold standard dataset. Table 1 shows that IQA datasets use a variety of subjective testing methodologies, number of images, and number of distortions.

Typically, usage of datasets with a small number of images and distortions results in high variance between evaluation results. Considering that, our main criteria for selection were the size of the dataset and distortions variety.

We selected TID2013 [26] and KADID-10k [55] databases as one of the most popular in the research community and PIPAL [33] as the one with a higher variety of introduced distortions. After that, we selected an evaluation criterion (type of correlation) that can be computed on selected datasets and compared with the reference values.

Evaluation criteria

Among numerous evaluation criteria, PLCC, SRCC, and KRCC are the most commonly used in large-scale studies of IQ metrics.

Pearson linear correlation coefficient (PLCC) requires produced scores to be linear with respect to subjective ratings. Previous studies [1, 45] showed non-linear relation between IQM scores and human accessors’ scores. A common solution is to apply a non-linear regression by adopting the five-parameter modified logistic function [77]. Even though this approach solves the non-linearity problem, we do not use it due to the poor reproducibility of model fitting results among studies.

Spearman’s rank-order correlation coefficient (SRCC) and Kendall rank correlation coefficient (KRCC) measure monotonicity between two measured quantities. Despite different calculation methods, KRCC is found to be highly consistent with the SRCC [2]. Considering that, we use only the most popular SRCC score for further evaluations.

SRCC = 1 - \frac{6 \sum_{i = 1}^{n} d_{i}^{2}}{n (n^{2} - 1)},

(1)

where $d_{i}$ is the difference between the i-th image’s ranks in the objective and the subjective ratings and $n$ is the number of observations.

Figure 2: Relationship between the computation time of a metric on CPU and their performance in terms of SRCC score on Natural Images from KADID-10k dataset [55] (top row) and MRI Images [45] (bottom row) for all metrics (left column) and zoomed-in region indicated in red (right column). Metrics with the best time-quality relation are located in the top-left corner.

Figure 3: Relationship between metrics’ computation time on GPU and their performance in terms of SRCC score on Natural Images from KADID-10k dataset [55] (top row) and MRI Images [45] (bottom row) for all metrics (left column) and zoomed-in region indicated in red (right column). Metrics with the best time-quality relation are located in the top-left corner.

Implementation details of DB IQMs

DB IQMs are not originally designed for pair-wise comparison of images. Instead, they are intended to be used to compare distributions of two image sets. However, we investigate DB metrics for pair-wise comparison of images following the computation strategy proposed in [45]. To encounter for the initial setting, we represent each image in a pair as a set of overlapping patches of size $96 \times 96$ with stride = $32$ , which allows us to reformulate the pair-wise comparison of images as a comparison of patch distributions. After we extract features from two sets of patches, we proceed with the initial flow of metrics’ computation.

Confirmation of implementations correctness

All implementations in the PyTorch Image Quality library are verified to be consistent with the original implementations proposed by the authors of each metric on selected IQA datasets. Refer to Table 2 for a detailed comparison.

Typically, IQMs are evaluated on a single dataset. Our verification results allow comparing metrics performance across different image sources to show that even correct implementations (that match correlation values on initial datasets) may not match values reported on different datasets. It is also worth mentioning that the performance of some feature extraction-based metrics (e.g. BRISQUE) cannot be fully reproduced on IQA datasets even though the code match the official implementation of the metric provided by their authors.

5 Performance – complexity trade-off

Figures 2 and 3 describe performance comparison of the implemented metrics in terms of SRCC and computational time on CPU and GPU for Natural Images from KADID-10k [55] dataset and MRI Images [45]. Benchmarks were performed on a dedicated instance of NVIDIA DGX Station with Intel Xeon E5-2698 v4 CPU and four NVIDIA V100 32Gb GPUs. A single graphic card was used in all experiments for convenience and simplicity of comparison. Even though no other processes except from system utilities of the GNU/Linux 5.4.0-91-generic x86-64 operating system were running during the performance of the benchmarks, we recommend to focus on relative performance of metrics rather than the exact numbers of their computation time.

Several key observations can be made based on results of the benchmarks. DB IQMs tend to group in the lower-right corner, steadily showing poor performance both in terms of CPU and GPU computation time and SRCC values (except from FID_VGG16 on MRI data). Even though GPU acceleration significantly speeds up their computation, the computation time gap remains to be considerable.

The best trade-off between quality and performance is achieved by metrics located in the upper-left corner. In all experiments there is a group of algorithms that tend to group together, forming a category of methods attractive for practical application. While the composition of this group varies slightly, there are several metrics that consistently perform well regardless of the domain and computing device in question such as MDSI, VSI, HaarPSI, DSS, GMSD and several others. Feature-based FR IQMs such as DISTS and PieAPP show high SRCC values on Natural Images but take long to be computed on CPU. GPU acceleration plays the key role for these metrics, putting them to the lower-left group of top performers. Widely used PSNR and SSIM are easy to compute both on CPU and GPU but they compromise IQA quality on both considered domains.

6 Conclusion and future work

The main contribution of our work is the PyTorch Image Quality (PIQ) Assessment toolbox [46] with a diverse set of measures, verified according to their formal definitions and the original authors’ implementations. The PIQ package facilitates the performance evaluation of any computer vision solution where an image-to-image task is performed. In addition, we provide the comparison of common Image Quality Metrics on the image datasets from the general and the medical domains, assessing the SRCC and the computation time.

The future development of the PyTorch Image Quality library will be aimed at supporting the latest trends and advances in the objective Image Quality Assessment and at the improvement of the usability and the scalability of the implemented algorithms.

7 Acknowledgments

We greatly appreciate all contributions from the members of the PIQ community, including questions, discussions, design suggestions, and technical implementations.

References

[1] A. M. at. al.; (2020) Comparison of Objective Image Quality Metrics to Expert Radiologists’ Scoring of Diagnostic Quality of MR Images. IEEE Transactions on Medical Imaging 39 (4), pp. 1064–1072. External Links: Document Cited by: §3, §4.
[2] S. Athar and Z. Wang (2019) A comprehensive performance evaluation of image quality assessment algorithms. IEEE Access 7, pp. 140030–140070. Cited by: §3, §4, Table 2.
[3] A. Balanov, A. Schwartz, Y. Moshe, and N. Peleg (2015) Image quality assessment based on dct subband similarity. In 2015 IEEE ICIP, Vol. , pp. 2105–2109. External Links: Document Cited by: §3, Table 2.
[4] A. Balanov, A. Schwartz, and Y. Moshe (2016) Reduced-reference image quality assessment based on dct subband similarity. In 2016 Eighth International Conference on Quality of Multimedia Experience (QoMEX), pp. 1–6. Cited by: Appendix A.
[5] S. Barratt and R. Sharma (2018) A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: Appendix A.
[6] M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying MMD gans. In ICLR 2018, External Links: Link Cited by: Appendix A, §3, Table 2.
[7] D. M. Chandler and S. S. Hemami (2007) VSNR: a wavelet-based visual signal-to-noise ratio for natural images. IEEE transactions on image processing 16 (9), pp. 2284–2298. Cited by: Table 1.
[8] D. Chandler and S. Hemami (2007) A57 database. Cited by: Table 1.
[9] C. Chang and C. Lin (2011) LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2 (3), pp. 1–27. Cited by: Appendix A.
[10] C. Chen (2021) IQA PyTorch. Note: https://github.com/chaofengc/IQA-PyTorch[Online; accessed 15-June-2022] Cited by: §1.
[11] G. Chen, C. Yang, L. Po, and S. Xie (2006) Edge-based structural similarity for image quality assessment. In 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Vol. 2, pp. II–II. Cited by: Appendix A.
[12] C. Chou and Y. Li (1995) A perceptually tuned subband image coder based on the measure of just-noticeable-distortion profile. IEEE Transactions on circuits and systems for video technology 5 (6), pp. 467–476. Cited by: Appendix A.
[13] S. Corchs and F. Gasparini (2017) A multidistortion database for image quality. In International Workshop on Computational Color Imaging, pp. 95–104. Cited by: §4, Table 1.
[14] S. Corchs and F. Gasparini (2017) Multiply distorted database md-ivl.. External Links: Link Cited by: Table 1.
[15] S. J. Daly (1992) Visible differences predictor: an algorithm for the assessment of image fidelity. In Human Vision, Visual Processing, and Digital Display III, Vol. 1666, pp. 2–15. Cited by: Appendix A.
[16] N. Damera-Venkata, T. D. Kite, W. S. Geisler, B. L. Evans, and A. C. Bovik (2000) Image quality assessment based on a degradation model. IEEE transactions on image processing 9 (4), pp. 636–650. Cited by: Appendix A.
[17] V. De Silva and G. E. Carlsson (2004) Topological estimation using witness complexes.. SPBG 4, pp. 157–166. Cited by: Appendix A.
[18] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli (2021) Comparison of full-reference image quality models for optimization of image processing systems. Int. J. Comput. Vis. 129 (4), pp. 1258–1281. External Links: Link, Document Cited by: Appendix A, §3.
[19] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli (2020) Image quality assessment: unifying structure and texture similarity. arXiv:2004.07728. Cited by: Appendix A, Appendix A, §3, Table 2.
[20] K. Ding, K. Ma, S. Wang, and E. P. Simoncelli (2021) Comparison of full-reference image quality models for optimization of image processing systems. International Journal of Computer Vision 129 (4), pp. 1258–1281. Cited by: §3.
[21] K. Ding (2020) IQA Optimization. Note: https://github.com/dingkeyan93/IQA-optimization[Online; accessed 15-June-2022] Cited by: §1.
[22] E.C.Larson and D.M.Chandler (2010) Computational and subjective image quality (csiq) database. External Links: Link Cited by: Table 1.
[23] U. Engelke, H.-J. Zepernick, and M. Kusuma (2010) Wireless imaging quality database. Note: https://computervisiononline.com/dataset/1105138665Online; Accessed 03 March 2022 Cited by: Table 1.
[24] A. T. et. al. (2020) The shape of data: intrinsic distance for data distributions. In ICLR 2020: Proceedings of the International Conference on Learning Representations, Cited by: Appendix A, §3, Table 2.
[25] C. S. et. al. (2015) Going deeper with convolutions. In Proceedings of the IEEE CVPR, pp. 1–9. Cited by: §3, §3.
[26] N. P. et. al. (2015) Image database tid2013: peculiarities, results and perspectives. Sig proces: Image com 30, pp. 57–77. Cited by: Appendix A, Appendix A, Figure 5, Figure 6, Appendix B, §4, §4, Table 1, Table 2.
[27] Y. Fang, H. Zhu, Y. Zeng, K. Ma, and Z. Wang (2020) Perceptual quality assessment of smartphone photography. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3677–3686. Cited by: Table 1.
[28] D. Ghadiyaram and A. C. Bovik (2015) LIVE In the Wild Image Quality Challenge Database. Note: http://live.ece.utexas.edu/research/ChallengeDB/index.htmlOnline; Accessed 03 March 2022 Cited by: Table 1.
[29] D. Ghadiyaram and A. C. Bovik (2016) Massive online crowdsourced study of subjective and objective picture quality. IEEE Trans. Image Process. 25 (1), pp. 372–387. External Links: Link, Document Cited by: Table 1.
[30] J. Gu, H. Cai, H. Chen, X. Ye, R. Jimmy, and C. Dong (2020) Pipal: a large-scale image quality assessment dataset for perceptual image restoration. In European Conference on Computer Vision, pp. 633–651. Cited by: Appendix A.
[31] J. Gu, H. Cai, H. Chen, X. Ye, J. Ren, and C. Dong (2020) Image quality assessment for perceptual image restoration: a new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002. Cited by: Appendix A, Appendix A.
[32] J. Gu, H. Cai, H. Chen, X. Ye, J. Ren, and C. Dong (2020) Image quality assessment for perceptual image restoration: a new dataset, benchmark and metric. arXiv preprint arXiv:2011.15002. Cited by: Table 1.
[33] J. Gu, H. Cai, H. Chen, X. Ye, J. Ren, and C. Dong (2020) PIPAL: a large-scale image quality assessment dataset for perceptual image restoration. In European Conference on Computer Vision (ECCV) 2020, Cham, pp. 633–651. Cited by: Figure 5, Figure 6, Appendix B, §4, §4, Table 1, Table 2.
[34] J. Gu, H. Cai, C. Dong, J. S. Ren, Y. Qiao, S. Gu, R. Timofte, M. Cheon, S. Yoon, B. Kang, J. Lee, Q. Zhang, H. Guo, Y. Bin, Y. Hou, H. Luo, J. Guo, Z. Wang, H. Wang, W. Yang, Q. Bai, S. Shi, W. Xia, M. Cao, J. Wang, Y. Chen, Y. Yang, Y. Li, T. Zhang, L. Feng, Y. Liao, J. Li, W. Thong, J. C. Pereira, A. Leonardis, S. McDonagh, K. Xu, L. Yang, H. Cai, P. Sun, S. M. Ayyoubzadeh, A. Royat, S. A. Fezza, D. Hammou, W. Hamidouche, S. Ahn, G. Yoon, K. Tsubota, H. Akutsu, and K. Aizawa (2021) NTIRE 2021 Challenge on Perceptual Image Quality Assessment. arXiv. External Links: 2105.03072, ISBN 0.77440.7468, Link Cited by: §4, Table 1.
[35] K. Gu, G. Zhai, X. Yang, W. Zhang, and M. Liu (2013) Subjective and objective quality assessment for images with contrast change. In 2013 IEEE International Conference on Image Processing, pp. 383–387. Cited by: Appendix A.
[36] K. Gu, G. Zhai, X. Yang, and W. Zhang (2013) A new reduced-reference image quality assessment using structural degradation model. In 2013 IEEE international symposium on circuits and systems (ISCAS), pp. 1095–1098. Cited by: Appendix A.
[37] K. Gu, G. Zhai, X. Yang, and W. Zhang (2014) Hybrid no-reference quality metric for singly and multiply distorted images. IEEE Transactions on Broadcasting 60 (3), pp. 555–567. Cited by: Table 1.
[38] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Adv Neural Inform Process Syst, pp. 6626–6637. Cited by: Appendix A, §3, Table 2.
[39] Y. Horita, K. Shibata, Y. Kawayoke, and Z. P. Sazzad (2011) MICT image quality evaluation database. Online], http://mict. eng. u-toyama. ac. jp/mictdb. html. Cited by: Table 1.
[40] V. Hosu, H. Lin, T. Sziranyi, and D. Saupe (2020) KonIQ-10k: an ecologically valid database for deep learning of blind image quality assessment. IEEE Transactions on Image Processing 29, pp. 4041–4056. Cited by: Table 1.
[41] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer (2016) SqueezeNet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360. Cited by: Appendix A.
[42] D. Jayaraman, A. Mittal, K. Moorthy, and A. C. Bovik (2012) LIVE multiply distorted image quality database.. External Links: Link Cited by: §4, Table 1.
[43] D. Jayaraman, A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) Objective quality assessment of multiply distorted images. In 2012 Conference record of the forty sixth asilomar conference on signals, systems and computers (ASILOMAR), pp. 1693–1697. Cited by: Table 1.
[44] J. Johnson, A. Alahi, and L. Fei-Fei (2016) Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711. Cited by: Appendix A, §3, Table 2.
[45] S. Kastryulin, J. Zakirov, N. Pezzotti, and D. V. Dylov (2022) Image quality assessment for magnetic resonance imaging. arXiv preprint arXiv:2203.07809. Cited by: Appendix B, §1, §3, §3, Figure 2, Figure 3, §4, §4, §5.
[46] S. Kastryulin, D. Zakirov, and D. Prokopenko (2019) PyTorch Image Quality: metrics and measure for image quality assessment. Note: Open source software available from https://github.com/photosynthesis-team/piq External Links: Link Cited by: §6.
[47] V. Khrulkov and I. V. Oseledets (2018) Geometry score: A method for comparing generative adversarial networks. In ICML, Proceedings of Machine Learning Research, Vol. 80, pp. 2626–2634. External Links: Link Cited by: Appendix A, §3, Table 2.
[48] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25. Cited by: Appendix A, Appendix A.
[49] E. C. Larson and D. M. Chandler (2010) Most apparent distortion: full-reference image quality assessment and the role of strategy. Journal of electronic imaging 19 (1), pp. 011006. Cited by: Appendix A, Table 1.
[50] P. Le Callet and F. Autrusseau (2005) Subjective quality assessment irccyn/ivc database. IEEE transactions on image processing. Cited by: Table 1.
[51] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: Appendix A.
[52] Y. Liang, J. Wang, X. Wan, Y. Gong, and N. Zheng (2016) Image quality assessment using similar scene as reference. In European Conference on Computer Vision, pp. 3–18. Cited by: Appendix A.
[53] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: Appendix A.
[54] H. Lin, V. Hosu, and D. Saupe (2018-03) KonIQ-10k: Towards an ecologically valid and large-scale IQA database. IEEE Transactions on Image Processing 29, pp. 4041–4056. External Links: Document, 1803.08489, Link Cited by: §4.
[55] H. Lin, V. Hosu, and D. Saupe (2019) KADID-10k: a large-scale artificially distorted iqa database. In 2019 Tenth Intern Conf on Quality of Multimedia Experience (QoMEX), pp. 1–3. Cited by: Appendix A, Appendix B, Figure 2, Figure 3, §4, §4, Table 1, Table 2, §5.
[56] H. Lin, V. Hosu, and D. Saupe (2020) DeepFL-iqa: weak supervision for deep iqa feature learning. arXiv preprint arXiv:2001.08113. Cited by: Table 1.
[57] A. Liu, W. Lin, and M. Narwaria (2012) Image quality assessment based on gradient similarity. IEEE Transactions on Image Processing 21 (4), pp. 1500–1512. Cited by: Appendix A.
[58] X. Liu, M. Pedersen, and J. Y. Hardeberg (2014) CID: IQ - A new image quality database. In Image and Signal Processing - 6th International Conference, ICISP 2014, Cherbourg, France, June 30 - July 2, 2014. Proceedings, A. Elmoataz, O. Lezoray, F. Nouboud, and D. Mammass (Eds.), Lecture Notes in Computer Science, Vol. 8509, pp. 193–202. External Links: Link, Document Cited by: §4, Table 1.
[59] W. T. Lunardi, M. A. Lopez, and J. Giacalone (2022) ARCADE: adversarially regularized convolutional autoencoder for network anomaly detection. arXiv preprint arXiv:2205.01432. Cited by: §1.
[60] L. Ma, W. Lin, C. Deng, and K. N. Ngan (2012) Image retargeting quality assessment: A study of subjective scores and objective metrics. IEEE J. Sel. Top. Signal Process. 6 (6), pp. 626–639. External Links: Link, Document Cited by: Table 1.
[61] J. Mannos and D. Sakrison (1974) The effects of a visual fidelity criterion of the encoding of images. IEEE transactions on Information Theory 20 (4), pp. 525–536. Cited by: Appendix A.
[62] A. Mittal, A. K. Moorthy, and A. C. Bovik (2012) No-reference image quality assessment in the spatial domain. IEEE Transactions on Image Processing 21 (12), pp. 4695–4708. External Links: Link, Document Cited by: §3, Table 2.
[63] I. Motoyoshi, S. Nishida, L. Sharan, and E. H. Adelson (2007) Image statistics and the perception of surface qualities. Nature 447 (7141), pp. 206–209. Cited by: Appendix A.
[64] H. Z. Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet (2016) Mean deviation similarity index: efficient and reliable full-reference image quality evaluator. IEEE Access 4, pp. 5579–5590. External Links: Link, Document Cited by: §3, Table 2.
[65] H. Z. Nafchi, A. Shahkolaei, R. Hedjam, and M. Cheriet (2016) Mean deviation similarity index: efficient and reliable full-reference image quality evaluator. Ieee Access 4, pp. 5579–5590. Cited by: Appendix A.
[66] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: §1, §2, §3.
[67] N. Ponomarenko, V. Lukin, A. Zelensky, K. Egiazarian, M. Carli, and F. Battisti (2009) TID2008-a database for evaluation of full-reference visual quality assessment metrics. Advances of Modern Radioelectronics 10 (4), pp. 30–45. Cited by: Table 1.
[68] E. Prashnani, H. Cai, Y. Mostofi, and P. Sen (2018) Pieapp: perceptual image-error assessment through pairwise preference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1808–1817. Cited by: Appendix A, §3, Table 2.
[69] R. Reisenhofer, S. Bosse, G. Kutyniok, and T. Wiegand (2018) A haar wavelet-based perceptual similarity index for image quality assessment. Signal Processing: Image Communication 61, pp. 33–43. Cited by: Appendix A.
[70] R. Reisenhofer, S. Bosse, G. Kutyniok, and T. Wiegand (2018) A haar wavelet-based perceptual similarity index for image quality assessment. Signal Process. Image Commun. 61, pp. 33–43. External Links: Link, Document Cited by: §3, Table 2.
[71] E. Riba, D. Mishkin, D. Ponsa, E. Rublee, and G. Bradski (2020) Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, External Links: Link Cited by: §1.
[72] F. Rozet (2020) PyTorch Image Quality Assessment. Note: https://github.com/francois-rozet/piqa[Online; accessed 15-June-2022] Cited by: §1.
[73] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Adv Neural Inform Process Syst, pp. 2234–2242. Cited by: Appendix A, §3, Table 2.
[74] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on Image Processing 15 (11), pp. 3440–3451. External Links: Document, ISSN 10577149 Cited by: §4.
[75] H. R. Sheikh, Z. Wang, L. Cormack, and A. C. Bovik (2006) LIVE image quality assessment database release 2. External Links: Link Cited by: Table 1.
[76] H. R. Sheikh and A. C. Bovik (2005) A visual information fidelity approach to video quality assessment. In The First International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Vol. 7, pp. 2. Cited by: Appendix A, §3, Table 2.
[77] H. R. Sheikh, M. F. Sabir, and A. C. Bovik (2006) A statistical evaluation of recent full reference image quality assessment algorithms. IEEE Transactions on image processing 15 (11), pp. 3440–3451. Cited by: Appendix A, §4, Table 1.
[78] E. P. Simoncelli, W. T. Freeman, E. H. Adelson, and D. J. Heeger (1992) Shiftable multiscale transforms. IEEE transactions on Information Theory 38 (2), pp. 587–607. Cited by: Appendix A.
[79] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, External Links: Link Cited by: Appendix A, Appendix A, §3, §3.
[80] J. P. Smith (2015) What makes a good software library?. Note: https://www.thereformedprogrammer.net/what-makes-a-good-software-library[Online; accessed 15-June-2022] Cited by: §2.
[81] D. Söllinger (2017) Blind Image Quality Toolbox. Note: https://github.com/dsoellinger/blind_image_quality_toolbox[Online; accessed 15-June-2022] Cited by: §1.
[82] C. Strauss, F. Pasteau, F. Autrusseau, M. Babel, L. Bédat, and O. Déforges (2009) Subjective and objective quality evaluation of lar coded art images. In Proceedings of the 2009 IEEE International Conference on Multimedia and Expo, ICME 2009, June 28 - July 2, 2009, New York City, NY, USA, pp. 674–677. External Links: Link, Document Cited by: Table 1.
[83] W. Sun, F. Zhou, and Q. Liao (2017) MDID: a multiply distorted image database for image quality assessment. Pattern Recognition 61, pp. 153–168. Cited by: §4, Table 1.
[84] T. Virtanen, M. Nuutinen, M. Vaahteranoksa, P. Oittinen, and J. Häkkinen (2015) CID2013: A database for evaluating no-reference image quality assessment algorithms. IEEE Trans. Image Process. 24 (1), pp. 390–402. External Links: Link, Document Cited by: Table 1.
[85] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE TIP 13 (4), pp. 600–612. Cited by: Appendix A, §2, §3, Table 2.
[86] Z. Wang and A. C. Bovik (2006) Modern image quality assessment. Synthesis Lectures on Image, Video, and Multimedia Processing 2 (1), pp. 1–156. Cited by: Appendix A, Appendix A, Appendix A.
[87] Z. Wang and Q. Li (2010) Information content weighting for perceptual image quality assessment. IEEE TIP 20 (5), pp. 1185–1198. Cited by: Appendix A, Appendix A, §3, Table 2.
[88] Z. Wang, E. P. Simoncelli, and A. C. Bovik (2003) Multiscale structural similarity for image quality assessment. In The 37th Asilomar Conference on Signals, Systems & Computers, Vol. 2, pp. 1398–1402. Cited by: Appendix A, §3, Table 2.
[89] Z. Wang and A. C. Bovik (2009) Mean squared error: love it or leave it? a new look at signal fidelity measures. IEEE signal processing magazine 26 (1), pp. 98–117. Cited by: Appendix A, Appendix A, Appendix A.
[90] J. Wu, W. Lin, G. Shi, L. Li, and Y. Fang (2016) Orientation selectivity based visual pattern for reduced-reference image quality assessment. Information Sciences 351, pp. 18–29. Cited by: Appendix A.
[91] Q. Xing (2021) Image Quality Assessment Toolbox. Note: https://github.com/ryanxingql/image-quality-assessment-toolbox[Online; accessed 11-April-2021] Cited by: §1.
[92] W. Xue, L. Zhang, X. Mou, and A. C. Bovik (2013) Gradient magnitude similarity deviation: a highly efficient perceptual image quality index. IEEE TIP 23 (2), pp. 684–695. Cited by: Appendix A, §3, Table 2.
[93] H. Yang, Y. Fang, and W. Lin (2015) Perceptual quality assessment of screen content images. IEEE Transactions on Image Processing 24 (11), pp. 4408–4421. Cited by: Table 1.
[94] Z. Ying, H. Niu, P. Gupta, D. Mahajan, D. Ghadiyaram, and A. Bovik (2019) From patches to pictures (paq-2-piq): mapping the perceptual space of picture quality. arXiv. External Links: Document, Link Cited by: Table 1.
[95] A. Zaric, N. Tatalovic, N. Brajkovic, H. Hlevnjak, M. Loncaric, E. Dumic, and S. Grgic (2011) VCL@fer image quality assessment database. In Proceedings ELMAR-2011, Vol. , pp. 105–110. External Links: Document Cited by: §4.
[96] A. Zarić, N. Tatalović, N. Brajković, H. Hlevnjak, M. Lončarić, E. Dumić, and S. Grgić (2012) VCL@ fer image quality assessment database. AUTOMATIKA: časopis za automatiku, mjerenje, elektroniku, računarstvo i komunikacije 53 (4), pp. 344–354. Cited by: Table 1.
[97] A. Zarić (2012) VCLFER image quality assessment database.. External Links: Link Cited by: Table 1.
[98] B. Zhang, P. V. Sander, and A. Bermak (2017) Gradient magnitude similarity deviation on multiple scales for color image quality assessment. In 2017 IEEE ICASSP, Vol. , pp. 1253–1257. External Links: Document Cited by: §3, Table 2.
[99] B. Zhang, P. V. Sander, and A. Bermak (2017) Gradient magnitude similarity deviation on multiple scales for color image quality assessment. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1253–1257. Cited by: Appendix A.
[100] L. Zhang, Y. Shen, and H. Li (2014) VSI: a visual saliency-induced index for perceptual image quality assessment. IEEE Transactions on Image processing 23 (10), pp. 4270–4281. Cited by: Appendix A, §3, Table 2.
[101] L. Zhang, L. Zhang, X. Mou, and D. Zhang (2011) FSIM: a feature similarity index for image quality assessment. IEEE TIP 20 (8), pp. 2378–2386. Cited by: Appendix A, §3, §3, Table 2.
[102] L. Zhang and H. Li (2012) SR-sim: a fast and high performance iqa index based on spectral residual. In 2012 19th IEEE international conference on image processing, pp. 1473–1476. Cited by: §3, §3, Table 2.
[103] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang (2018) The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference CVPR, pp. 586–595. Cited by: Appendix A, §3, Table 2.
[104] D. Zoran and Y. Weiss (2009) Scale invariance and noise in natural images. In 2009 IEEE 12th International Conference on Computer Vision, pp. 2209–2216. Cited by: Appendix A.

Appendix A Appendix. Quality Metrics

Figure 4: General taxonomy of IQA metrics. Italic metrics are available with PIQ package. ADD WSPNR, IW-PSNR, SWDN.

Historically, the first works on perceptual full-reference IQA appeared almost half a century ago, with the pioneering work of Sakrison and Mannos [61] focusing on a class of visual fidelity criterion in the context of image encoding. Over the past few decades, a number of alternative models mimicking certain functionalities of the Human Visual System (HVS) were proposed.

Such approaches couldn’t model the real HVS, which is a complex and highly nonlinear system, while most models rely on simplifications and strong assumptions (e.g. linearity or quasi-linearity for visual stimuli) and exhibit shortcomings regarding the definition of visual quality, quantification of suprathreshold distortions, and generalization to natural images [86]. Hence, IQMs can be classified and put into a certain category based on their main computation mechanism as shown in figure 4.

In our work, we choose to evaluate representative methods from those categories. A short description of the design philosophies is given below.

Error Based Methods

Point-by-point comparisons between pixels or convolution responses (e.g, wavelets, CNNs) is the simplest way of measuring perceptual quality.

MSE [85], the Mean Squared Error ( $ℓ_{2}$ -norm), and closely related PSNR [86], the Peak Signal-to-Noise Ratio, are the most frequently used quality metrics, which are a de-facto standart way of measuring image quality. MSE is easy to use, has clear physical meaning (energy of image distortions), satisfies the Parseval’s theorem and can be used for algorithm optimization leading to a closed-form solutions [86, 89]. PSNR is defined as a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of signal representation [86].

MSE and PSNR have been repeatedly shown to poorly correlate with human judgements in controlled experiments [26, 55]. The main reason for this is four strong underlying assumptions about visual quality [89]: 1. Independence of spatial relationships between samples, 2. Independence of relationship between signal and error, 3. Independence of sign of error samples, 4. Equal importance of all signal samples and errors. None of this assumptions hold on in real life and accurately describe human’s visual system.

A number of subsequent papers addressed the weaknesses of PSNR and modified it to better suit for IQA. WSNR [16], the Weighted Signal-to-Noise Ratio, used contrast sensitivity function to approximate HVS and assign different weights to signal and noise components, leading to a linear quality measure. NQM [16], the Noise Quality Measure, also used nonlinear quasi-local processing model of the HVS to accomplish quality assessment.

IW-PSNR [87], the Information Weighted PSNR, is a continued idea of applying additional "weights" to address importance of different areas. It uses information theoretic principles to add weights for regions of visual content, which are perceptually more important than others, either due to the visual attention property of the HVS or due to the influence of distortions.

MAD [49], the Most Apparent Distortion measure, explicitly models adaptive strategies of the human visual system. For high quality images with only near-threshold distortions a detection-based strategy is employed, and an appearance-based strategy is activated if the distortions are clearly visible. The results are then combined into a single score by weighting scheme, where importance of each strategy is dependent on the distortions strength.

Following the success in image classification [48] and recent developments in deep learning [53, 51] usage of convolutional neural networks (CNN) became a popular way for constructing IQA methods. A double-path CNN was introduced by the pioneering work of Liang et al. [52]. Here it was proposed to crop images into small patches and propagate them through a dedicated network branches to obtain patch scores. The scores determined the final overall image quality score by averaging predicted patch-wise values. The model was trained using regression to MOS scores from TID2013 [26] database.

Another popular approach, proven to be useful for many image processing tasks, is the usage of generic features obtained from pre-traind networks, such as AlexNet [48], VGG [79], SqueezeNet [41] and others. Perceptual Loss and Style Score were introduced by Johnson et al. [44]. Perceptual Loss used a single-path CNN trained on ImageNet ILSVRC 2012 [48] to extract deep feature representations of reference and distorted images. Those features were then compared by taking MSE and averaging error between spatial and feature dimensions. Style Score extracts and compares texture information from the images by computing Gram matrices between features from a CNN and taking MSE between them.

In LPIPS [103], the Learned Perceptual Image Patch Similarity, Zhang et al. pointed out that methods based on utilization of deep features significantly outperform traditional algorithms on a wide range of distortions. LPIPS computes the distance between feature representations on multiple levels, similar to Perceptual Score, building on a premise that different layers represent different structures of the image. To proper combine features, they are first unit-normalized to have similar scales and then summed with weights learned on BAPPS [103] dataset to minimize error between model and humans preference over two images.

PieAPP [68], the Perceptual Image-Error Assessment through Pairwise Preference metric, uses a pairwise-learning framework to predict the preference of one distorted image over the other. PieAPP gets features from the feature-extraction (FE) network and then computes the similarity using a score-computation (SC) network. Both FE and SC are trained from scratch targeting humans pairwise-preference between images.

SWDN [31], the Space Warping Difference Network, is specifically designed to handle geometric distortions. When comparing two image features, a new Space Warping Difference layer takes into account not only pixels on corresponding positions as in all pixel-wise methods, but also looks into a small range around them.

Motivated by the above-mentioned results, a number of other FR-IQA algorithms have been proposed relying on different deep features, pooling strategies and pre-trained CNNs [30, 19, 31].

Structural Similarity Based Methods

IQMs that use certain properties of HVS can be divided into 2 groups: top-down and bottom-up ones. In previous section we described bottom-up approaches to IQA design, which used properties of human visual system to modify error-based MSE and PSNR to better simulate different components of HVS. For example, by adding adaptation to luminance, contrast sensitivity and contrast masking [15, 12].

In contrary, approaches following top-down IQA design try to mimic the functionality of HVS as a whole and do not model it by individual components. Structural similarity is a perception-based model that considers image degradation as perceived change in structural information, while also including luminance and contrast masking terms. These methods are based on the idea that pixels have strong inter-dependencies especially when they are spatially close.

SSIM [86], the Structural Similarity Index Measure, is the most popular top-down approach which has become a de-facto standard in the field of perceptual image processing (along with PSNR) and has inspired subsequent IQA models based on feature similarity [89].

The main idea behind SSIM is to split image distortions into structural and not-structural and focus on the first ones, as the latter are less noticeable by HVS. The comparison is done between luminance (average pixel intensity), contrast (standard deviation of the local image regions) and structure (cross correlation values between two local image regions) components, which are later combined into a single quality map by averaging of local quality scores. The main disadvantage of SSIM is that it takes into account only single image scale, and thus can’t adapt to different sets of viewing conditions.

MS-SSIM [88], the Multi-scale SSIM, measures SSIM on 5 different scales, computing contrast and similarity on all levels, while measuring luminance only at the final scale. Resulting scores than combined through a weighted product using weights adjusted on human dataset of mean opinion scores.

IW-SSIM [87], the Information content Weighted SSIM, is an extension of MS-SSIM, that added computation of additional content weights based on information theoretic principles.

After the success of SSIM, a lot of new works searched for effective image features that be used to describe contrast, structural information and textures.

ESSIM [11], the Edge-based SSIM, uses first order difference operators to compare information between image edges and claimed, that edges are the most important structure information for the HVS.

GSM [57], the Gradient Similarity Measure, computes image gradients in horizontal and vertical directions to use those features as an input to the structural similarity computation.

GMSD [92], the Gradient Magnitude Similarity Deviation, is focused on computational efficiency of quality predictions. Based on the idea that global variation of image local quality degradation can reflect its overall quality, authors proposed to compute the standard deviation of the pixel-wise gradient similarity map as an IQA index. This method is, however, problematic because an image with a large but constant local distortion yields a standard deviation of zero, indicating the best predicted quality.

MS-GMSD [99], the Multi-scale GMSD, accounts for the variations in viewing conditions by measuring GMSD on 5 different scales, similarly to MS-SSIM. The chromatic version, named MS-GMSDc, additionally measures chrominance dissimilarity on the last scale, motivated by lower HSV sensitivity to colour distortions.

FSIM [101], the Feature Similarity Index Measure, is built on assumption that low-level features obtained in the early stage of HVS information processing are used for understanding the image content. Two core features used in similarity computations are phase congruency, which is a contrast-invariant dimensionless measure of the local structure, and gradient magnitude maps obtained by Scharr filter. During pooling, phase congruency component serves as an adaptive local weighting factor to derive an overall visual quality score. The color-sensitive version of FSIM, is called FSIMc [101]. It first converts RGB images into YIQ color space, and then computes FSIM on luminance channel and chrominance similarity maps on I and Q channels.

VSI [100], the Visual Saliency Induced quality index, states that the change of salience in degraded areas is the major predictor of image quality. The visual saliency index is computed using phase congruency and two simple priors (color temperature and center priors). Gradient magnitude maps are used as an additional feature and pooling strategy mimics FSIM measure. VSI shows a good correlation with human judgments on localized distortions, such as compression artefacts or local patch substitutions.

Recently proposed HaarPSI [69], the Haar perceptual similarity index, decomposes both distorted and reference images into Haar wavelets and computes the structural similarity between magnitudes of high-frequency coefficients. The last level of Haar wavelet is used to weight the importance of different regions.

MDSI [65], Mean Deviation Similarity Index, combines the gradient similarity, chrominance similarity and deviation pooling (used in GMSD) into a single IQA model. MDSI contains 7 configurable parameters that are jointly optimized on a mean opinion scores dataset to provide best correlation results.

DISTS [19], the Deep Image Structure and Texture Similarity measure, is a deep-learning based IQA algorithm that follows LPIPS design principles. Image representations are extracted from the pre-trained convolutional neural network (VGG16 [79]) and combined using SSIM-like structure and texture similarity measurements. It is sensitive to structural distortions but at the same time robust to texture resampling and modest geometric transformations [18].

DSS, the DCT Subbands Similarity, aims to measure changes in structural information using sub-bands in the discrete cosine transform (DCT) domain. DSS extracts features in block-based DCT subbands and measures the distances between corresponding DCT sub-bands. The final quality index is computed by pooling those distances together with weights. The main motivation behind quality assessment in the DCT domain is the observation that the statistics of DCT coefficients change with the degree and type of image distortion.

Natural Scene Statistics based

These methods attempt to measure some approximation of the mutual information between the perceived reference and distorted images as an indication of perceptual image quality. Statistical modeling of the image source, the distortion process, and the HVS is critical in algorithm development.

Visual Information Fidelity (VIF) [76] uses a Gaussian scale mixture to statistically model the wavelet coefficients of a steerable pyramid decomposition of an image [78]. After that, it predicts the distorted image quality by quantifying the amount of preserved information from the reference image. Its spatial domain, named VIFp, computes fidelity on raw pixels.

Distribution Statistics based

This class of methods originated from the domain of generative modeling, where evaluation by a direct comparison is not possible and only the distance between model distributions can be measured.

Freshet Inception Distance (FID) [38] considers embeddings of the two data distributions as a product of continuous multivariate Gaussian. The mean and covariance are estimated for both, and the Fréchet distance between these two Gaussians (a.k.a Wasserstein-2 distance) is then used to quantify the quality of the generated sample.

Kernel Inception Distance (KID) [6], sometimes referred to as Maximum Mean Discrepancy (MMD) - computes the dissimilarity between two probability distributions $P$ and $Q$ by measuring squared MMD for some fixed characteristic kernel function (e.g., Gaussian kernel).

Multi-Scale Intrinsic Distance (MSID) [24] develops an intrinsic and multi-scale method for characterizing and comparing data manifolds, using a lower-bound of the spectral Gromov-Wasserstein inter-manifold distance, which compares all data moments.

Geometry Score (GS) [47] constructs a performance measure by comparing geometrical properties of the underlying data manifolds. In particular, topological approximation for computation of connected components ("holoes") in homology using witness complex [17] is introduced.

No-Reference Opinion Aware Methods

No-Reference (NR) IQA methods evaluate distorted image’s quality in the absence of any reference information [77]. Thus they are also referred to as blind IQA methods.

Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) operates on locally normalized luminance values in the spatial domain, called Mean Subtracted Contrast Normalized (MSCN) coefficients. Various features are extracted from the MSCN coefficients and their pairwise products, which are then used to estimate Generalized Gaussian Distribution (GGD) and Asymmetric GGD parameters. SVR model is then used to learn to map the features to the quality score.

Inception score (IS) [73] uses the pre-trained InceptionNet model for feature extraction and captures properties of generated sample to capture properties of generated samples: diversity with respect to class labels and high classifiability.

Improved Inception score (IS’) [5] proposed a different way of feature aggregation, which improved both calculation and interpretability of the Inception Score.

Reduced Reference Methods

Reduced-reference (RR) image quality assessment methods estimate the quality of distorted images using only partial information about the reference images when the full-reference approaches are unfeasible. Partial information represented by extracted features should be relevant to the human judgement of image quality and sensitive to a variety of image distortions. Consequently, the challenge of a reduced reference metric is to find optimal trade-off between the amount of information represented by extracted features and their correlation with HVS.

RR-DSS [4], a reduced-reference image quality assessment based on DCT sub-band similarity measure. It is an adapted full-reference DSS quality assessment measure, which assumes that HVS adapts for extracting structural information. RR-DSS uses only a few lowest frequency sub-bands for the quality assessment. In order to maintain good IQA results, RR-DSS should use at least 3 to 10 sub-bands. RR-DSS uses spatial down-sampling to reduce amount of information about reference image. As typical distortions are usually spread over the image, uniform down-sampling of the local variances preserves important features of the distortions. However, down-sampling makes it impossible to compute the cross-correlation for the DC sub-band. Therefore, local similarity score for the DC sub-band is computed in the same way as for the AC sub-bands. RR-DSS showed high correlation with subjective results on average and outperformed most other metrics examined in [4], both RR and FR, including SSIM and MS-SSIM. In addition, the method has a simple implementation and incurs low computational complexity, while the trade-off between side information and image quality estimation accuracy can be adjusted according to the task.

SDM [36], a reduced-reference IQA using Structural Degradation Model, consists of two main stages: acquiring structural degradation information for distorted and original images and consecutive aggregation into a single score. Structural degradation information (SD) can be defined as structural similarity indexes between mean features and variance features obtained using kernels with various variation values computed for different parts of the images. The SD information is used to derive distances between distorted and reference images, which are linearly combined using regression model parameters optimised by training. In particular, one can use SVM [9] for regression as a model and get an S-SDM score. The SDM approach relies on various spatial responses providing low computational complexity and fast execution. However, optimisation of model parameters leads to dependency on the dataset used for training, which might limit the generalisation properties of the metric between different image domains.

RIQMC [35], a reduced-reference image quality metric for contrast-changed images, aims to evaluate image quality according to contrast features of a picture such as skewness and kurtosis of the image histograms, which correlates with HVS [63, 104]. RIQMC is based on the first four order statistics, which are evaluated for a distorted image, and values of entropy for distorted and reference images. The final value of RIQMC is a linear combination of the entropy and the first four order statistics. Even though the RIQMC metric needs only an entropy of a reference image, it outperforms full-reference metrics such as PSNR, SSIM, MS-SSIM, IW-SSIM, and MAD, on the CID2013 dataset and TID2008 subset according to PLCC and SROCC [35]. However, the performance of RIQMC may vary for general image distortions as the method was proposed exclusively for contrast-changed images.

OSVP [90], an orientation selectivity-based visual pattern IQA, is a reduced-reference measure inspired by the orientation selectivity (OS) mechanism for visual content extraction by the human visual system. When visual content is observed, the input signal interacts with the visual cortex depending on the spatial arrangement in local perceptual fields generating OS visual patterns. In order to represent the OS mechanism, gradient directions are extracted per pixel, building the spatial relationship between each pixel and its local neighbours (OSVP). Next, the content of an image is mapped into a histogram according to spatial relationship patterns. The final score is calculated as changes between the histograms for reference and distorted images. Being inspired by neuroscience findings in orientation selectivity mechanism, OSVP RR-IQA showed performance consistent with HVS perception on five publicly available databases with limited reference data.

Appendix B Appendix. Additional Benchmarks

The main part of the work considered the performance – complexity trade-off for KADID10k [55] and MRI reconstructions [45] datasets. Here we present additional benchmarks for two commonly used datasets: TID2013 [26] and PIPAL [33] of CPU (Fig. 5) and GPU (Fig. 6).

PyTorch Image Quality: Metrics for Image Quality Assessment