Augmentation of documents for neural networks training

Why augmentation of documents

In training deep learning models it is often useful to perform data augmentation. Especially when working on documents for OCR training, one may need to have a very large corpus of documents as if they come from real-world offices with folds, gray areas, fax effects or ink-bleeds and scribbles. Augraphy is a library designed to do that.

This function takes a folder, browse for all the pdf documents contained (you should pass cleaned digital pdf as coming from word or pdf of books) and creates a large number of images.

def create_dataset(
    base_pdf_folder: str,
    output_path: str,
    pipeline: AugraphyPipeline,
    shuffle_list:bool=True,
    pages_per_pdf:int=50,
    dpi:int=150
):
    """
    Make sure all the pdf are upright!
    """
    if isinstance(output_path, str):
        output_path = Path(output_path)
        output_path.mkdir(exist_ok=True)
    
    pdf_list = list(Path(base_pdf_folder).rglob("*.pdf"))
    if shuffle_list:
        shuffle(pdf_list) # inplace
    
    N = len(pdf_list)
    pbar = tqdm(pdf_list, total=N)
    for f in pbar:
        pbar.set_postfix_str(f.name)
        try:
            for page_num, page in enumerate(mupdf2image(f, n_pages=pages_per_pdf, dpi=dpi, grayscale=True)):
                augmented = pipeline.augment(np.array(page))["output"]
                pi = Image.fromarray(augmented).convert("L")
                for angle in [0, 180, 90, 270]:
                    output_name = output_path / str(angle) /f.name.replace(".pdf", f"_{page_num}.jpg")
                    pi.rotate(-angle, expand=True, fillcolor=(255,)).save(output_name)
        except Exception as ex:
            print(f"{ex} - {f.name}")
            continue

create_dataset(
    "/Users/carlo/Dropbox/Books/",
    "/Users/carlo/data/textorientation/jpg/aug/",
    pipeline=create_bank_document_pipeline(),
    shuffle_list=True,
    pages_per_pdf=5,
    dpi=150
)

Pages are selected at random for every pdf