Why augmentation of documents
In training deep learning models it is often useful to perform data augmentation. Especially when working on documents for OCR training, one may need to have a very large corpus of documents as if they come from real-world offices with folds, gray areas, fax effects or ink-bleeds and scribbles. Augraphy is a library designed to do that.
This function takes a folder, browse for all the pdf documents contained (you should pass cleaned digital pdf as coming from word or pdf of books) and creates a large number of images.
def create_dataset(
base_pdf_folder: str,
output_path: str,
pipeline: AugraphyPipeline,
shuffle_list:bool=True,
pages_per_pdf:int=50,
dpi:int=150
):
"""
Make sure all the pdf are upright!
"""
if isinstance(output_path, str):
output_path = Path(output_path)
output_path.mkdir(exist_ok=True)
pdf_list = list(Path(base_pdf_folder).rglob("*.pdf"))
if shuffle_list:
shuffle(pdf_list) # inplace
N = len(pdf_list)
pbar = tqdm(pdf_list, total=N)
for f in pbar:
pbar.set_postfix_str(f.name)
try:
for page_num, page in enumerate(mupdf2image(f, n_pages=pages_per_pdf, dpi=dpi, grayscale=True)):
augmented = pipeline.augment(np.array(page))["output"]
pi = Image.fromarray(augmented).convert("L")
for angle in [0, 180, 90, 270]:
output_name = output_path / str(angle) /f.name.replace(".pdf", f"_{page_num}.jpg")
pi.rotate(-angle, expand=True, fillcolor=(255,)).save(output_name)
except Exception as ex:
print(f"{ex} - {f.name}")
continue
create_dataset(
"/Users/carlo/Dropbox/Books/",
"/Users/carlo/data/textorientation/jpg/aug/",
pipeline=create_bank_document_pipeline(),
shuffle_list=True,
pages_per_pdf=5,
dpi=150
)
Pages are selected at random for every pdf
Let's talk!
I'm Carlo Nicolini — I am interested on the reliability of AI reasoning systems (interpretability, inference-time methods, probabilistic language programming) and on quantitative portfolio optimization (I am a maintainer of skfolio). If you're working on something in these areas and think we might collaborate, chat, discuss, I'm happy to talk about it!
The best way to reach me is on via DM on LinkedIn.