Why augmentation of documents
In training deep learning models it is often useful to perform data augmentation. Especially when working on documents for OCR training, one may need to have a very large corpus of documents as if they come from real-world offices with folds, gray areas, fax effects or ink-bleeds and scribbles. Augraphy is a library designed to do that.
This function takes a folder, browse for all the pdf documents contained (you should pass cleaned digital pdf as coming from word or pdf of books) and creates a large number of images.
def create_dataset(
base_pdf_folder: str,
output_path: str,
pipeline: AugraphyPipeline,
shuffle_list:bool=True,
pages_per_pdf:int=50,
dpi:int=150
):
"""
Make sure all the pdf are upright!
"""
if isinstance(output_path, str):
output_path = Path(output_path)
output_path.mkdir(exist_ok=True)
pdf_list = list(Path(base_pdf_folder).rglob("*.pdf"))
if shuffle_list:
shuffle(pdf_list) # inplace
N = len(pdf_list)
pbar = tqdm(pdf_list, total=N)
for f in pbar:
pbar.set_postfix_str(f.name)
try:
for page_num, page in enumerate(mupdf2image(f, n_pages=pages_per_pdf, dpi=dpi, grayscale=True)):
augmented = pipeline.augment(np.array(page))["output"]
pi = Image.fromarray(augmented).convert("L")
for angle in [0, 180, 90, 270]:
output_name = output_path / str(angle) /f.name.replace(".pdf", f"_{page_num}.jpg")
pi.rotate(-angle, expand=True, fillcolor=(255,)).save(output_name)
except Exception as ex:
print(f"{ex} - {f.name}")
continue
create_dataset(
"/Users/carlo/Dropbox/Books/",
"/Users/carlo/data/textorientation/jpg/aug/",
pipeline=create_bank_document_pipeline(),
shuffle_list=True,
pages_per_pdf=5,
dpi=150
)
Pages are selected at random for every pdf