Post OCR
Post-OCR correction focuses on improving the accuracy of text extracted from scanned documents by Optical Character Recognition (OCR) systems, which often produce errors, especially with historical documents, complex layouts, or noisy images. Current research emphasizes leveraging powerful language models, particularly transformer-based architectures like T5 and BERT, often augmented with techniques like glyph embedding and attention mechanisms, to correct these errors. This field is crucial for digitizing historical archives, improving accessibility to cultural heritage, and enabling downstream natural language processing tasks, with ongoing efforts to develop robust methods for various languages and document types, including the creation of benchmark datasets and synthetic data generation techniques to address data scarcity issues.
Papers
Optimizing the Neural Network Training for OCR Error Correction of Historical Hebrew Texts
Omri Suissa, Avshalom Elmalech, Maayan Zhitomirsky-Geffet
Toward a Period-Specific Optimized Neural Network for OCR Error Correction of Historical Hebrew Texts
Omri Suissa, Maayan Zhitomirsky-Geffet, Avshalom Elmalech