nnOOD: A Framework for Benchmarking Self-supervised Anomaly Localisation Methods

Matthew Baugh\orcidlink0000-0001-6252-7658 1Imperial College London, SW7 2AZ, London, UK
1matthew.baugh17@imperial.ac.uk
   Jeremy Tan\orcidlink0000-0002-9769-068X 1Imperial College London, SW7 2AZ, London, UK
1matthew.baugh17@imperial.ac.uk
   Athanasios Vlontzos\orcidlink0000-0002-7672-2574 1Imperial College London, SW7 2AZ, London, UK
1matthew.baugh17@imperial.ac.uk
   Johanna P. Müller\orcidlink0000-0001-8636-7986 2Friedrich–Alexander University Erlangen–Nürnberg, DE2    Bernhard Kainz\orcidlink0000-0002-7813-5023 1Imperial College London, SW7 2AZ, London, UK
1matthew.baugh17@imperial.ac.uk 2Friedrich–Alexander University Erlangen–Nürnberg, DE2
Abstract

The wide variety of in-distribution and out-of-distribution data in medical imaging makes universal anomaly detection a challenging task. Recently a number of self-supervised methods have been developed that train end-to-end models on healthy data augmented with synthetic anomalies. However, it is difficult to compare these methods as it is not clear whether gains in performance are from the task itself or the training pipeline around it. It is also difficult to assess whether a task generalises well for universal anomaly detection, as they are often only tested on a limited range of anomalies. To assist with this we have developed nnOOD, a framework that adapts nnU-Net to allow for comparison of self-supervised anomaly localisation methods. By isolating the synthetic, self-supervised task from the rest of the training process we perform a more faithful comparison of the tasks, whilst also making the workflow for evaluating over a given dataset quick and easy. Using this we have implemented the current state-of-the-art tasks and evaluated them on a challenging X-ray dataset.

Keywords:
Anomaly Localisation Self-supervised Learning

1 Introduction

Out-of-distribution detection, i.e. learning a normative distribution from a single class and classifying anomalous test samples without supervised training, is a notoriously challenging task. Many methods struggle in scenarios where humans can easily detect an outlier. For example, failures in detecting potholes under different lighting conditions [4], a jet ski in a road scene [7] or determining whether a picture of a cat is a dog [9]. It becomes even more difficult in medical imaging, where abnormalities frequently go unseen by expert observers, often because of inattentional blindness [8]. This issue worsens in high-pressure environments such as emergency and trauma care. There, the main cause of misdiagnosis is the misinterpretation of radiographs, with miss rates of up to 80% [18]. In these situations an automated tool that acts as a second reader, alerting clinicians to any unusual features, would be useful.

Many publications in the field of medical anomaly detection limit their experiments to datasets with a narrow range of abnormalities [3, 15, 20, 24], raising questions regarding their ability to generalise to anomalies seen in other medical applications or modalities. Recently there has been a trend of training end-to-end anomaly detection models using synthetic anomalies to alter healthy data. These methods exhibit good performance on both manufacturing [21] and medical tasks [24, 25], with the majority of the top submissions to the recent MICCAI 2021 Medical Out-of-Distribution challenge (MOOD) [29] being trained in this manner. However, as these tasks are synthetic by nature, it is even more important that they are thoroughly tested, as there is a risk of the task being overly tuned to the target evaluation dataset.

Contribution: In this paper we present the nnOOD framework. It builds on top of the nnU-Net framework [12], maintaining the core principle of adapting the architecture to the provided dataset, but gearing it toward self-supervised anomaly localisation. This makes validating a synthetic anomaly task on a given dataset as simple as organising a dataset for nnU-Net. We also provide a common interface for self-supervised tasks, simplifying the process of creating and applying a new method. We hope that making this paradigm of anomaly detection more accessible can encourage more interest in this research area. Ultimately, we hope that nnOOD provides a standardised framework for comparing these methods fairly. Another goal is to gain insight into what makes a synthetic task useful and elucidate why a certain task fails or succeeds in different situations. Source code, including a guide on applying the framework to a new dataset, is available at https://github.com/matt-baugh/nnOOD.

Related Work: A common way to perform anomaly localisation is by measuring the difference between an image and its reconstruction. The assumption is that a model trained on normal data will not be able to reproduce anomalous regions, leading them to larger deviations. This method lends itself most easily to autoencoder-based architectures [3, 15], but has also been applied by using generative adversarial networks [11, 20]. However, reconstruction loss often fails as models are not always able to reconstruct healthy regions in potentially unhealthy samples [2], and anomalies with extreme textures but a normal intensity distribution are difficult to identify [16]. Sample-level anomaly detection can be done as an auxiliary task using a classification network, although performance can vary greatly between training epochs [28]. Methods using normalizing flows [10, 27] are currently the state-of-the-art benchmarks on the MVTec [6] dataset. Unfortunately, these strategies require pre-trained model which is often not possible for medical imaging tasks, and high performance on computer vision tasks does not guarantee similar performance on medical imaging tasks [5].

Foreign Patch Interpolation (FPI) [24] was the first method to train end-to-end models for out-of-distribution localisation using synthetic anomalies. FPI creates subtle anomalies by interpolating a patch of the current sample with a patch extracted from the same location of a different sample. The model is then trained to predict the pixel-wise interpolation factor, causing the model to learn a score correlated with how anomalous the pixel was. This had good performance on both brain MRI and abdominal CT data, winning the MICCAI 2020 MOOD challenge [29] in both the sample-wise and pixel-wise categories. Poisson image interpolation (PII) was introduced to mitigate the discontinuities at the edges of FPI’s patches [25], allowing for a more seamless blend between the patch and its surroundings. However, there was still limited variation in the synthetic anomalies, due to the interpolated patches being extracted from the same location in the secondary image. Independently, CutPaste [13] used a similar sort of augmentation by copying a patch from one place and pasting it at a random different location within the same image. The patches are sometimes altered through rotation or jitter in the pixel values. Rather than training end-to-end, [13] trained a one-class classifier, using a Gaussian density estimator on the output to enable outlier detection, and applying GradCAM [22] for anomaly localisation. Natural synthetic anomalies (NSA) [21] combines the aforementioned image augmentations, further increasing the variety of synthetic samples by resizing the patches and randomising the number of patches introduced in an image. Arguing that in previous methods the difference between the distributions of the blended patches can cause the same label to be applied to vastly different levels of abnormality, they opted to use a scaled logistic function applied to the mean absolute intensity difference across each channel.

The results of the MICCAI 2021 MOOD challenge [29] displayed the success of these techniques, with the majority of the presented works being models trained end-to-end with synthetic anomalies. Despite their achievements, many of those methods faced practical challenges in engineering the synthetic task. For example, many 3D methods used image resizing which leads to loss of information and obscures small anomalies. In addition, some methods apply non-overlapping patches to the larger abdominal data, resulting in prediction artefacts around the edges.

We see the structure of nnU-Net [12] as a natural solution to these issues. The nnU-Net framework uses a set of heuristic rules to dynamically adapt a U-Net [19] to a given biomedical image segmentation dataset. Combining these architectural decisions with a solid pipeline of adaptive preprocessing, extensive data augmentation, model ensembling and aggregating tiled predictions, nnU-Net consistently performs well across a wide range of tasks. The ease with which the framework can be applied has made semantic segmentation more accessible, even to those without machine learning expertise.

2 Method

In this section we discuss adaptations to the well-known nnU-Net pipeline that make it suitable for anomaly detection. Graphically summarized in Figure 1, our method differs significantly from others such as nnDetection [1] because we aim for pixel-wise predictions, allowing for a greater overlap with the original nnU-Net pipeline.

Figure 1: Overview of the nnOOD framework. The green components are entirely new to nnOOD, orange components differ significantly from their nnU-Net counterparts and grey components have only minor changes.

One of the primary challenges of applying the nnU-Net framework to anomaly detection is overfitting to the synthetic task. This is expected, as nnU-Net is intended for use on the same task at training and test time. Therefore, it is designed to perform its training task as accurately as possible, using heavy data augmentation to avoid overfitting to the specific training data. One way we reduce overfitting to a synthetic task is in model selection. Originally, nnU-Net uses a five-fold cross-validation process across three different network configurations: a 2D U-Net, a 3D U-Net, and, if the data is sufficiently large, a 3D U-Net cascade. Then the final model is chosen as the model or ensemble of two models which achieves the highest mean foreground dice score on the training set cross-validation. In our framework we do not assume we have access to a validation set of real anomalies, and we do not want to select a model based on performance on synthetic data. Instead, we select a model which matches the dimensionality of the data, i.e. using n-dimensional models for n-dimensional data, which performed well in nnU-Net’s experiments. We also omit the cascade setting because the dynamic nature of self-supervised tasks is less amenable to cascaded training.

Another area that requires restrictions is the training regime. The original nnU-Net training schedule is much longer than the training duration seen in many self-supervised methods. For instance, the training schedule in Schlüter et al. [21] terminates well before the earliest stopping point of nnU-Net, after less than a third of the number of iterations. To mitigate this we add early stopping based on a moving average of the average precision score (AP) on the synthetic data validation set. We found that a threshold of 0.875 allows the models to learn useful features without overfitting to the fine-grain details of the synthetic tasks. If a model fails to reach this threshold we utilise the original nnU-Net early stopping based on the loss plateauing as a backup. We train each method to an equal level of validation performance (on its own task) so that we always learn each task to the same extent, regardless of it’s difficulty. By contrast, using a set number of epochs for every task would likely lead to overtraining for simple tasks, while undertraining on more complex ones.

Although data augmentation is one of the key factors for nnU-Net’s success, we cannot naïvely apply it, as many of the augmentations carry the risk of moving the sample out of the distribution of the normal data. By learning invariance to an augmentation, the representation is trained to ignore it, preventing it from being identified as anomalous. For example, a model trained with random rotations and translations would see misaligned data as normal. Instead we opt to allow the user to define which augmentations are “safe” as part of the dataset description. This allows us to utilise as much data augmentation as possible within valid parameters for the given dataset.

The appearance of normal data can vary vastly depending on the location within the image. This leads to an inconsistent training signal for the model when attempting to learn to identify normality within incomplete patches. This is particularly true when applying synthetic tasks such as CutPaste, where the patch is anomalous specifically because it appears at a different location within the image. To allow the model to perceive the spatial context when evaluating a patch we incorporate a positional encoding, similar to ConvCoord [14]. An additional channel is concatenated per spatial dimension of the data, with values ranging from -1 to 1, representing the coordinate value of that pixel for that dimension. This has been shown to assist convergence in other anomaly detection tasks [23].

When selecting a patch to train on, nnU-Net chooses uniformly from all patches that lie fully within the image bounds. This leads to the model rarely learning from regions towards the edges of the images, as only a small proportion of valid locations include them. This is an issue at inference time where many patches are taken from the boundaries of the image, which are then more likely to be considered anomalous, due to the infrequency of observing them during training. To rectify this, we simply randomly select from the inference patch locations at training time. As the synthetic anomalies are often quite small relative to the size of the image, during training we oversample the anomalies. For 30% of each batch we choose the random patch location such that the centre of at least one anomaly was present within the patch.

When integrating the synthetic task into the framework we want to be as flexible as possible, to avoid pigeonholing future tasks into the structure of the current ones. Formally, we define a synthetic anomaly task as:

(1)

where is the anomaly generating function, producing an augmented sample with pixel-wise label from two samples of the distribution of normal data and . and denote the foreground masks for the corresponding samples, which are provided if present. These are created by applying a simple Sobel operator on the image, calculating the magnitude of the gradient and using region growing from a number of locations around the image corners. We chose to use the image gradient to determine the background as different modalities have different background intensities, but most will be uniform with a low gradient magnitude. Hence we only apply this if the dataset description indicates that the dataset has a consistent, uniform background. An example of this would be in brain MRI, where the background occupies a large portion of the image, but we would not attempt to generate foreground masks for X-ray data, such as ChestX-ray14, as they are normally already cropped to the region of interest. The parameters of the task, , are determined by a calibration function when applied to the dataset of normal data and the current experiment plan . The calibration function is necessary because although some tasks have constant parameters, others (such as NSA [21]) use different parameters depending on the dataset. We place no further restrictions on the implementation of these functions, allowing users to utilise or ignore the provided arguments as they see fit to produce a useful synthetic anomaly task.

To provide more flexibility, we allow the user to define the loss function used when training with their task, which is given the raw network logits and the synthetic label as input. They can similarly define the inference function , which is applied to the network logits at test time to produce pixel-wise anomaly scores.

Finally, if the synthetic task happens to follow the structure of patch-based methods such as FPI and CutPaste, we provide a compartmentalised framework to build such tasks. This divides the task into the initial creation of the patch shape, a sequence of patch transformations, which can be spatial or alter the pixel values, the blending of the patch into the destination image and the labelling of the resulting image. This makes it much easier to tweak existing tasks and isolate the factors that contribute the most to performance.

3 Experiments and Results

Synthetic Tasks: As an initial baseline for future methods we have implemented the FPI [24], CutPaste [13], PII [25] and NSA [21] tasks using our framework. Due to their simplicity, FPI and PII can be directly generalised to arbitrary dimensions and applied to any sort of input, however the specialised nature of the other two methods requires some changes. For CutPaste this is primarily due to their use of colour jitter, which applied brightness, contrast, saturation and hue transformations. We omit the saturation and hue operations as the concepts do not translate to other modalities. For contrast, we move the pixel intensity values to the mean across each channel (as opposed to the weighted average used for colour images) and for brightness we take the global minimum of the dataset as the zero value for scaling. Adapting NSA was a more involved task due to the number of hyperparameters that were originally chosen for each dataset by visual inspection of the generated anomalies. At a high level, we base the maximum number of anomalies, bounds on the size of each dimension, and the minimum object area included in the extracted patch on the average foreground dimensions (treating the entire image as foreground if no background is present). NSA also converts absolute intensity differences into labels that conform to a logistic function. To automate this, we create 100 anomalies and calculate shape and scale parameters such that anomalous regions translate to labels with a lower bound of 0.1 that saturate at the 40th percentile. We experimented with using both source and mixed gradients, denoted as NSA and NSAMixed respectively.

Data: We evaluate these tasks on ChestX-ray14 [26], a public chest X-ray dataset covering 14 common thorax disease categories as well as healthy samples. We use the same training distribution as [25]: posteroanterior (PA) views of healthy adult patients, divided by gender to create two healthy training datasets, with 17,852 male and 14,720 female samples respectively. For the test dataset we perform the same filtering on the unhealthy data, but further restrict it to samples that provide pathology bounding boxes. This leaves us with 245 male and 217 female test samples. As there are no pixel-wise annotations provided we treat the bounding boxes as anomaly masks. For preprocessing each sample is normalised to have zero mean and unit standard deviation. Note that we did not need to resize the images due to the patch-based nature of our framework.

Results: Table 1 shows our comparison of the different self-supervised tasks, with Fig. 2 displaying example test images and their predicted anomaly maps. These scores demonstrate the challenging nature of the dataset. Inexact ground truth bounding boxes and class imbalance make it difficult to achieve high pixel-level average precision (calculated using scikit-learn [17]). For reference, a random classifier (0.5 AUROC) would achieve 0.074 and 0.063 AP for the male and female datasets respectively.

Dataset Male PA Female PA
Task FPI CutPaste PII NSA NSAMixed FPI CutPaste PII NSA NSAMixed
AUROC 0.515 0.484 0.554 0.718 0.714 0.490 0.446 0.615 0.699 0.698
AP 0.075 0.071 0.084 0.162 0.167 0.064 0.060 0.086 0.139 0.133
Table 1: Pixel-wise metrics comparing models trained with different anomaly tasks using the nnOOD framework. AUROC - Area Under the Receiver Operating Characteristic curve, AP - Average Precision score

The FPI and CutPaste tasks do not seem to help the models identify medical anomalies. This is most likely because sharp, image-aligned discontinuities are unlikely to appear in real pathologies. Both of these methods generally predict low scores across the images (Fig. 2), resulting in performance similar to that of a random classifier. On the other hand, tasks which seamlessly blend their synthetic anomalies into the target image (PII, NSA, NSAMixed) help more. Although these approaches may not reach supervised performance, they are able to learn useful features without any exposure to real anomalies.

Interestingly, the use of different synthetic tasks massively altered the time taken to reach the AP threshold. The difference in average training times reflects how easily the anomalies can be seen: CutPaste uses very obvious anomalies (27.3 epochs), NSA and NSAMixed blend their patches more seamlessly (88.5 and 119.7 epochs), by not moving the extracted patch FPI’s anomalies are more subtle (272.5 epochs), and PII’s addition of Poisson image blending to that formula increases the subtlety even further (312.7 epochs).

(a) Male posteroanterior dataset
(b) Female posteroanterior dataset
Figure 2: Examples of test predictions on each X-ray dataset. The disease labels are keywords extracted from the sample’s radiologist report [26].

4 Conclusion

In this paper we present a framework that makes self-supervised anomaly localisation more accessible and facilitates evaluation on a unified platform. By automating the training configuration independently from the synthetic task, we are able to compare the true ability of each method under more controlled settings and free from unequal hyperparameter tuning. Using this framework, we compare the current state-of-the-art methods and show that there is still much room for improvement. We hope that nnOOD will enable further investigation of self-supervised, synthetic anomaly localisation methods across a wider variety of modalities. Our modular design also serves as a foundation for continued research in paradigms other than patch blending, such as using deformations.

In our experiments, we focused on anomaly localisation at the pixel level. Although sample-level detection is often reported, these scores sometimes inflate performance. We believe that pixel-level evaluation better reflects the usefulness of these methods in clinical practice. For example, an accurate sample-level score with poor localisation may actually mislead a clinician to pursue a tangential diagnosis. This is particularly concerning in anomaly detection, where scores do not correspond to any specific disease classes. We hope that nnOOD will help facilitate future developments in anomaly detection and hold them to a higher standard, so that the field as a whole can move closer to real, beneficial tools.

Acknowledgements: This work was supported by the UKRI London Medical Imaging and Artificial Intelligence Centre for Value Based Healthcare.

References

  • [1] M. Baumgartner, P. F. Jäger, F. Isensee, and K. H. Maier-Hein (2021) NnDetection: a self-configuring method for medical object detection. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Cham, pp. 530–539. External Links: ISBN 978-3-030-87240-3 Cited by: §2.
  • [2] C. Baur, S. Denner, B. Wiestler, N. Navab, and S. Albarqouni (2021) Autoencoders for unsupervised anomaly segmentation in brain mr images: a comparative study. Medical Image Analysis 69, pp. 101952. External Links: ISSN 1361-8415, Document, Link Cited by: §1.
  • [3] C. Baur, B. Wiestler, S. Albarqouni, and N. Navab (2020) Scale-space autoencoders for unsupervised anomaly segmentation in brain mri. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2020, A. L. Martel, P. Abolmaesumi, D. Stoyanov, D. Mateus, M. A. Zuluaga, S. K. Zhou, D. Racoceanu, and L. Joskowicz (Eds.), Cham, pp. 552–561. External Links: ISBN 978-3-030-59719-1 Cited by: §1, §1.
  • [4] H. Bello-Salau, A. J. Onumanyi, A. T. Salawudeen, M. B. Mu’azu, and A. M. Oyinbo (2019) An examination of different vision based approaches for road anomaly detection. In 2019 2nd International Conference of the IEEE Nigeria Computer Chapter (NigeriaComputConf), Vol. , pp. 1–6. External Links: Document Cited by: §1.
  • [5] C. Berger, M. Paschali, B. Glocker, and K. Kamnitsas (2021) Confidence-based out-of-distribution detection: a comparative study and analysis. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, C. H. Sudre, R. Licandro, C. Baumgartner, A. Melbourne, A. Dalca, J. Hutter, R. Tanno, E. Abaci Turk, K. Van Leemput, J. Torrents Barrena, W. M. Wells, and C. Macgowan (Eds.), Cham, pp. 122–132. External Links: ISBN 978-3-030-87735-4 Cited by: §1.
  • [6] P. Bergmann, K. Batzner, M. Fauser, D. Sattlegger, and C. Steger (2021-04-01) The mvtec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision 129 (4), pp. 1038–1059. External Links: ISSN 1573-1405, Document, Link Cited by: §1.
  • [7] R. Chan, K. Lis, S. Uhlemeyer, H. Blum, S. Honari, R. Siegwart, M. Salzmann, P. Fua, and M. Rottmann (2021) SegmentMeIfYouCan: a benchmark for anomaly segmentation. arXiv preprint arXiv:2104.14812. Cited by: §1.
  • [8] T. Drew, M. L. Võ, and J. M. Wolfe (2013) The invisible gorilla strikes again: sustained inattentional blindness in expert observers. Psychological science 24 (9), pp. 1848–1853. Cited by: §1.
  • [9] I. Golan and R. El-Yaniv (2018) Deep anomaly detection using geometric transformations. Advances in neural information processing systems 31. Cited by: §1.
  • [10] D. Gudovskiy, S. Ishizaka, and K. Kozuka (2022) Cflow-ad: real-time unsupervised anomaly detection with localization via conditional normalizing flows. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 98–107. Cited by: §1.
  • [11] C. Han, L. Rundo, K. Murao, T. Noguchi, Y. Shimahara, Z. Á. Milacski, S. Koshino, E. Sala, H. Nakayama, and S. Satoh (2021) MADGAN: unsupervised medical anomaly detection gan using multiple adjacent brain mri slice reconstruction. BMC bioinformatics 22 (2), pp. 1–20. Cited by: §1.
  • [12] F. Isensee, P. F. Jaeger, S. A. A. Kohl, J. Petersen, and K. H. Maier-Hein (2021-02-01) NnU-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods 18 (2), pp. 203–211. External Links: ISSN 1548-7105, Document, Link Cited by: §1, §1.
  • [13] C. Li, K. Sohn, J. Yoon, and T. Pfister (2021) Cutpaste: self-supervised learning for anomaly detection and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9664–9674. Cited by: §1, §3.
  • [14] R. Liu, J. Lehman, P. Molino, F. Petroski Such, E. Frank, A. Sergeev, and J. Yosinski (2018) An intriguing failing of convolutional neural networks and the coordconv solution. Advances in neural information processing systems 31. Cited by: §2.
  • [15] S. N. Marimont and G. Tarroni (2021) Anomaly detection through latent space restoration using vector quantized variational autoencoders. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), Vol. , pp. 1764–1767. External Links: Document Cited by: §1, §1.
  • [16] F. Meissen, B. Wiestler, G. Kaissis, and D. Rueckert (2022) On the pitfalls of using the residual as anomaly score. In Medical Imaging with Deep Learning, External Links: Link Cited by: §1.
  • [17] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §3.
  • [18] A. Pinto, A. Reginelli, F. Pinto, G. Lo Re, F. Midiri, C. Muzj, L. Romano, and L. Brunese (2016-02) Errors in imaging patients in the emergency setting. The British journal of radiology 89, pp. 20150914. External Links: Document Cited by: §1.
  • [19] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: ISBN 978-3-319-24574-4 Cited by: §1.
  • [20] T. Schlegl, P. Seeböck, S. M. Waldstein, G. Langs, and U. Schmidt-Erfurth (2019) F-anogan: fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis 54, pp. 30–44. External Links: ISSN 1361-8415, Document, Link Cited by: §1, §1.
  • [21] H. M. Schlüter, J. Tan, B. Hou, and B. Kainz (2021) Self-supervised out-of-distribution detection and localization with natural synthetic anomalies (nsa). arXiv preprint arXiv:2109.15222. Cited by: §1, §1, §2, §2, §3.
  • [22] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §1.
  • [23] J. Song, K. Kong, Y. Park, S. Kim, and S. Kang (2021) Anomaly segmentation network using self-supervised learning. In AAAI 2022 Workshop on AI for Design and Manufacturing (ADAM), Cited by: §2.
  • [24] J. Tan, B. Hou, J. Batten, H. Qiu, and B. Kainz (2020) Detecting outliers with foreign patch interpolation. arXiv preprint arXiv:2011.04197. Cited by: §1, §1, §3.
  • [25] J. Tan, B. Hou, T. Day, J. Simpson, D. Rueckert, and B. Kainz (2021) Detecting outliers with poisson image interpolation. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2021, M. de Bruijne, P. C. Cattin, S. Cotin, N. Padoy, S. Speidel, Y. Zheng, and C. Essert (Eds.), Cham, pp. 581–591. External Links: ISBN 978-3-030-87240-3 Cited by: §1, §1, §3, §3.
  • [26] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. Summers (2017) ChestX-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 3462–3471. Cited by: Figure 2, §3.
  • [27] J. Yu, Y. Zheng, X. Wang, W. Li, Y. Wu, R. Zhao, and L. Wu (2021) FastFlow: unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:2111.07677. Cited by: §1.
  • [28] O. Zhang, J. Delbrouck, and D. L. Rubin (2021) Out of distribution detection for medical images. In Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, and Perinatal Imaging, Placental and Preterm Image Analysis, C. H. Sudre, R. Licandro, C. Baumgartner, A. Melbourne, A. Dalca, J. Hutter, R. Tanno, E. Abaci Turk, K. Van Leemput, J. Torrents Barrena, W. M. Wells, and C. Macgowan (Eds.), Cham, pp. 102–111. External Links: ISBN 978-3-030-87735-4 Cited by: §1.
  • [29] D. Zimmerer, J. Petersen, G. Köhler, P. Jäger, P. Full, T. Roß, T. Adler, A. Reinke, L. Maier-Hein, and K. Maier-Hein (2021-03) Medical out-of-distribution analysis challenge 2021. Zenodo. External Links: Document, Link Cited by: §1, §1, §1.