Augraphy: A Data Augmentation Library for Document Images

Samay Maini Alexander Groleau Kok Wei Chee Stefan Larson Jonathan Boarman

Sparkfish LLC Vanderbilt University Addison, TX, USA Nashville, TN, USA
Corresponding email: smaini@sparkfish.com and jboarman@sparkfish.com
Abstract

This paper introduces Augraphy,111https://github.com/sparkfish/augraphy a Python package geared toward realistic data augmentation strategies for document images. Augraphy uses many different augmentation strategies to produce augmented versions of clean document images that appear as if they have been distorted from standard office operations, such as printing, scanning, and faxing through old or dirty machines, degradation of ink over time, and handwritten markings. Augraphy can be used both as a data augmentation tool for (1) producing diverse training data for tasks such as document de-noising, and (2) generating challenging test data for evaluating model robustness on document image modeling tasks. This paper provides an overview of Augraphy and presents three example robustness testing use-cases of Augraphy.

\wacvalgorithmstrack\wacvfinalcopy

1 Introduction

The modern world provides a plethora of tasks that require the need for automated and intelligent solutions for handling unstructured data. Often, this data is in the form of documents, and these documents may appear noisy, especially if they have been captured from the physical world via printing, scanning, or photocopying processes. Such real-world phenomena may introduce many types of distortions: for instance, folds, wrinkles, or tears in a page can cause color changes and shadows in a scanned document image; low or high printer ink settings may cause some regions of a document to be lighter or darker; and human annotations like highlighting or pencil marks can add noise to the page.

Many tasks involving machine learning are impacted by document noise. High-level tasks like document classification and information extraction must often be able to perform on noisily scanned document images. For instance, the RVL-CDIP document classification corpus [harley2015icdar-rvlcdip] consists of scanned document images, many of which have substantial amounts of scanner-induced noise, as does the FUNSD form understanding benchmark [jaume2019-funsd]. Other intermediate-level tasks like optical character recognition (OCR) and page layout analysis may perform optimally if noise in a document image is minimized [character-recognition-systems, ogorman-document-image-analysis, Rotman2022-hh]. Further, the lower-level task of document de-noising tackles the document noise problem more directly by attempting to remove noise from a document image [noisyoffice, blind-denoising-iccv-2021, kulkarni-2020, patch-based-document-denoising, Mustafa_2018-wan].

Our new Augraphy augmentation tool can be used to convert clean document images (left side of figure) into noisy versions (right side).
Figure 1: Our new Augraphy augmentation tool can be used to convert clean document images (left side of figure) into noisy versions (right side).

Such tasks benefit from copious amounts of training data, and one way of generating large amounts of training data with noise-like artefacts is to use data augmentation. For this reason we introduce Augraphy, an open-source data augmentation tool for generating versions of document images that contain realistic noise artefacts commonly introduced via scanning, photocopying, and other office room procedures. Augraphy differs from most image data augmentation tools by specifically targeting the types of alterations and degradations seen in document images.

Examples of clean document images (left side of each thumbnail) with corresponding
Figure 2: Examples of clean document images (left side of each thumbnail) with corresponding Augraphy-generated noisy version (right side of each thumbnail).

Augraphy offers 24 individual augmentation methods out-of-the-box across three “phases” of augmentations, and these individual phase augmentations can be composed together along with a where different paper backgrounds can be added to the augmented image. The resulting document images are realistic, noisy versions of clean documents, as evidenced in Figure 1 and in Figure 2. Data produced by Augraphy can also be used for robustness testing, and we use Augraphy to create data for testing several machine learning model types, such as OCR, document de-noising, and face detection, finding that Augraphy is effective at producing challenging test data.

2 Related Work

This section discusses prior work related to document data augmentation, robustness testing, and document de-noising.

2.1 Data Augmentation

A wide variety of data augmentation tools and pipelines exist for machine learning tasks ranging from natural language processing (e.g., [feng-etal-2021-survey, fadaee-etal-2017-data, wei-zou-2019-eda]), audio and speech processing (e.g., [ko15_interspeech, audiogmenter, audio-framework]), and computer vision and image processing. In the image realm of image processing and computer vision, data augmentation tools and pipelines include Augly [Papakipos2022-gq-augly], Augmentor [augmentor], Albumentations [albumentations], DocCreator [doccreator], and imgaug [imgaug]. Augmentation strategies from these image-centric libraries are typically general purpose, and include image transformations like rotations, warps, and color modifications. Table 1 compares Augraphy with other image-based data augmentation libraries and tools. As can be seen, these other data augmentation libraries do not specifically provide support for imitating the corruptions commonly seen in document analysis corpora.

Library Document-
Name Centric Python License
Augmentor [augmentor] MIT
Albumentations [albumentations] MIT
imgaug [imgaug] MIT
Augly [Papakipos2022-gq-augly] MIT
Pytorch [pytorch] BSD-style
DocCreator [doccreator] LGPL-3.0
Augraphy (ours) MIT
Table 1: Comparison of various data augmentation libraries for images.

A notable exception to this is the DocCreator image synthesizing tool [doccreator], which is targeted towards creating synthetic images that mimic common corruptions seen in document collections. DocCreator differs from Augraphy in several ways, however, the first is that DocCreator’s augmentations are meant to imitate those seen in historical (e.g., ancient or medieval) documents, while Augraphy is meant to replicate noise caused by noisy office room procedures. DocCreator is also written in the C++ programming language and is a what-you-see-is-what-you-get (WYSIWYG) tool, and does not have a scripting or API interface to enable use in a broader machine learning pipeline. Augraphy, in contrast, is written in Python and can be easily integrated into machine learning model development and evaluation pipelines, and can easily be used alongside other Python packages.

2.2 Robustness Testing

The introduction of noise-like corruptions and other modifications to image data can be used as a way of estimating and evaluating model robustness. This prior work includes the use of image blurring, contrast and brightness changes, color alterations, partial occlusions, geometric transformations, pixel-level noise (e.g., salt-and-pepper noise, impulse noise, etc.), and compression artefacts (e.g., JPEG) to evaluate image classification and object detection models (e.g., [image-quality-impact, imagenet-c, pathology-recommendations, Hosseini2017-kn-google-api, face-recognition-impact, pathology-schomig, Vasiljevic2016-al-bluring-impact]). Our paper also uses robustness testing as a way to showcase the effectiveness of Augraphy, but rather than general image modifications like those described above, we use document-centric modifications.

1import augraphy; import cv2
2pipeline = augraphy.default_augraphy_pipeline()
3img = cv2.imread("image.png")
4data = pipeline.augment(img)
5augmented = data["output"]
Listing 1: Transforming an image with Augraphy.
Example Augraphy pipeline to compose several image augmentations together with a specific paper background.
Figure 3: Example Augraphy pipeline to compose several image augmentations together with a specific paper background.

3 Augraphy

Augraphy is a lightweight Python package. It is registered on the Python Package Index (PyPI) and can be installed using

pip install augraphy

Augraphy requires only a few other commonly-used Python scientific computing or image handling packages in order to run, such as NumPy [numpy] and Pillow. Augraphy has been tested on Windows, Linux, and Mac computing environments. Listing 1 shows how easy it is to get Augraphy up and running to create a straightforward augmentation pipeline and apply it to an image. Examples of output generated by Augraphy are shown in Figure 2.

A subset of various
Figure 4: A subset of various Augraphy augmentations on sample images.
Side-by-side comparison of clean image (left) with
Figure 5: Side-by-side comparison of clean image (left) with Augraphy-augmented noisy version (right).
Comparison using de-noising model trained on NoisyOffice and tested on NoisyOffice (top row) versus tested on
Figure 6: Comparison using de-noising model trained on NoisyOffice and tested on NoisyOffice (top row) versus tested on Augraphy-augmented noisy images (middle and bottom rows). Noisy versions are shown in the left column (a), and model de-noised versions are at right (b).
Ink Phase Paper Phase Post Phase
BleedThrough BrightnessTexturize BadPhotoCopy
DirtyDrum ColorPaper BindingsAndFasteners
DirtyRollers Gamma BookBinding
Dithering Geometric Folding
Faxify LightingGradient JPEG
InkBleed PageBorder NoiseTexturize
Letterpress SubtleNoise Watermark
LowInkLine
Markup
PencilScribbles
Table 2: Individual Augraphy augmentations for each augmentation phase.

3.1 Breaking Down the Augraphy Pipeline

There are three unique phases in the Augraphy pipeline: the Ink Phase, Paper Phase, and Post Phase. These three phases are highlighted in Figure 3, which shows how the three phases are composed together. Table 2 lists the various individual augmentations from each phase, and Figure 4 displays example output for several of the augmentations from various phases.

Augraphy’s augmentation pipeline starts with an image of a clean document. The pipeline begins by extracting the text and graphics from the source into an “ink” layer. (Ink is synonymous with printer toner within Augraphy.) The augmentation pipeline then distorts and degrades the ink layer through a series of augmentations. There are 10 different ink augmentations to choose from, and any combination of augmentations can work simultaneously.

In the Paper Phase, a “paper factory” provides either a white page or a randomly-selected paper texture base. Like the ink layer, the paper can also be processed through a series of augmentations, with seven effects to choose from, to further provide random realistic paper textures.

After both the ink and paper phases are completed, processing continues by applying the ink, with its desired effects, to the paper. This merged document image is then augmented further through the Post Phase with distortions, such as adding folds, watermarks, or other physical deformations that rely on simultaneous interactions of paper and ink layers. Figure 5 displays a clean document image next an Augraphy-augmented image that utilized several Ink and Paper phase augmentations. The result is a realistic-looking noisy document image.

Example images modified by
Figure 7: Example images modified by Augraphy where Microsoft Azure’s face detection model did not detect faces.

4 Applications

This section highlights several application settings where Augraphy can be used. While Augraphy is a tool that can be used for training data augmentation, it can also be used as an effective tool for altering data for robustness testing. Here, we focus on the latter scenario.

4.1 OCR Robustness Testing with Augraphy

Our first application demonstrating the effectiveness of Augraphy is robustness testing optical character recognition (OCR). Here, we use Augraphy to add noise to clean document images to test OCR performance in noisy settings. We use the widely-used open-source Tesseract 222https://github.com/tesseract-ocr/tesseract engine and first compiled 15 ground-truth, noise-free document images from a corpus of born-digital documents. We then used Tesseract to generate OCR predictions on these noise-free documents. We considered these OCR predictions as the ground-truth labels for each document. Next, we generated noisy versions of the 15 documents by running them through an Augraphy pipeline, and again used Tesseract to generate OCR predictions on these noisy documents. We compared the word accuracy rate on the noisy OCR results versus the ground-truth noise-free OCR results, and found that the noisy OCR results were on average 52% less accurate, with a range of up to 84%. This example use-case demonstrates the effectiveness of using Augraphy to create challenging test data for evaluating OCR systems.

4.2 De-noising Robustness Testing with Augraphy

This section highlights the effectiveness of Augraphy by creating a new evaluation set for the task of document de-noising. Document de-noising is the task of removing noisy artifacts from a document image, and one recent dataset that has emerged for this task is the NoisyOffice dataset [noisyoffice], which itself generated noisy versions of clean documents by applying several basic augmentations. However, both the original documents and the augmentations in NoisyOffice are quite limited, so it is natural to wonder if a model trained on NoisyOffice data can generalize to more diverse data inputs for the de-noising task.

Figure 6 exemplifies test inputs (left) to a convolutional auto-encoder, which was trained on the NoisyOffice dataset. The model’s outputs are shown on the right side of Figure 6. It can be inferred that the model does well on the NoisyOffice input (top row), but underperforms on data that was augmented by Augraphy (bottom two rows), showing that Augraphy’s augmentations are effective at producing challenging testing data for analyzing the robustness of de-noising models.

4.3 Face Detection Robustness Testing with Augraphy

In this section, we use Augraphy to investigate the robustness of face detection models on images that have been altered by Augraphy. In this way, we move beyond analyzing robustness of text-related tasks, but the images we analyze in this section can nonetheless appear in documents like newspapers, which are often scanned from the physical world by noisy scanners. Hence, it is important for image-focused models like image classifiers and object detection models to be robust to scanner-induced noise.

We begin by sampling 75 images from the FDDB face detection benchmark [fddbTech]. Then, we use Augraphy to generate 10 altered versions of each image and manually remove any augmented image where the face(s) is not reasonably visible. We then test two face detection models on the noisy and noise-free images. First, Microsoft Azure’s proprietary face detection model333https://azure.microsoft.com/en-us/services/cognitive-services/face, and second, the freely-available UltraFace444https://github.com/onnx/models/tree/main/vision/body_analysis/ultraface model. Table 3 shows face detection accuracy of both models on the noisy test set, and example images where no faces were detected by the Azure model are shown in Figure 7. We see that both models struggle on the Augraphy-augmented data, with Microsoft Azure’s model dropping to 57.1%, and UltraFace detecting only 4.5% of faces that it found in the noise-free data.

Model Accuracy
MS Azure Face Detection 57.1%
UltraFace 4.5%
Table 3: Face detection performance on noisy images.

5 Conclusion

This paper introduces Augraphy, a new data augmentation package for image-based document analysis tasks. Augraphy is unique among image-based augmentation tools and pipelines as it is a Python-based, easy to use library that focuses exclusively on augmentations tailored to mimicking real-life document noise caused by scanners and noisy printing processes. Augraphy offers 24 individual augmentation methods across three phases that can be composed together to create realistic, noisy document images. Our new tool can be used for data augmentation for both training and testing purposes, and this paper demostrated Augraphy’s utility by generating challenging test data for three different document and image-based machine learning tasks. Augraphy fills a gap in the document-based machine learning space by providing a tool for augmenting data with realistic noise artifacts commonly seen in document analysis.

References