Early printed books, particularly incunabula, are invaluable archives of the beginnings of modern educational systems. However, their complex layouts, antique typefaces, and page degradation caused by bleed-through and ink fading pose significant challenges for automatic transcription. In this work, we present a modular pipeline that addresses these problems by combining modern layout analysis and language modeling techniques. The pipeline begins with historical layout-aware text segmentation using Kraken, a neural network-based tool tailored for early typographic structures. Initial optical character recognition (OCR) is then performed with Kraken’s recognition engine, followed by post-correction using a fine-tuned ByT5 transformer model trained on manually aligned line-level data. By learning to map noisy OCR outputs to verified transcriptions, the model substantially improves recognition quality. The pipeline also integrates a preprocessing stage based on our previous work on bleed-through removal using robust statistical filters, including non-local means, Gaussian mixtures, biweight estimation, and Gaussian blur. This step enhances the legibility of degraded pages prior to OCR. The entire solution is open, modular, and scalable, supporting long-term preservation and improved accessibility of cultural heritage materials. Experimental results on 15th-century incunabula show a reduction in the Character Error Rate (CER) from around 38% to around 15% and an increase in the Bilingual Evaluation Understudy (BLEU) score from 22 to 44, confirming the effectiveness of our approach. This work demonstrates the potential of integrating transformer-based correction with layout-aware segmentation to enhance OCR accuracy in digital humanities applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Momtaz et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68c19f9154b1d3bfb60dad80 — DOI: https://doi.org/10.3390/electronics14153083
Yahya Momtaz
Lorenza Laccetti
G. Russo
Electronics
University of Naples Federico II
Building similarity graph...
Analyzing shared references across papers
Loading...