Historical Newspaper

Historical newspapers are increasingly being used as a rich source of data for research, driven by digitization efforts and advancements in natural language processing (NLP). Current research focuses on developing methods for accurate text extraction (improving Optical Character Recognition, or OCR), analyzing the textual content for insights into historical events and societal biases (using quantitative discourse analysis and question-answering models), and overcoming challenges posed by the unique linguistic and structural characteristics of historical documents (e.g., through rule-based and machine learning approaches to layout analysis). These efforts are significantly expanding the accessibility and analytical potential of historical newspaper archives for both humanities and computational research.

Papers

March 26, 2024

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages
Bhawna Piryani, Jamshid Mozafari, Adam Jatowt
Large Language Model Data Set Question Answer Pair Machine Reading Comprehension Long Form Question Historical Newspaper

February 4, 2024

A Quantitative Discourse Analysis of Asian Workers in the US Historical Newspapers
Jaihyun Park, Ryan Cordell
Computational Linguistics Historical Text Co Worker Discourse Analysis Historical Newspaper Different Discourse

May 18, 2023

Multilingual Event Extraction from Historical Newspaper Adverts
Nadav Borenstein, Natalia da Silva Perez, Isabelle Augenstein
NLP Model Event Extraction Historical Text Multilingual Event Historical Newspaper Historical Language

June 1, 2022

Optical character recognition quality affects perceived usefulness of historical newspaper clippings
Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen, Tuula Pääkkönen, Juha Rautiainen
Character Recognition Relevance Ranking Automatic Usefulness Prediction Historical Newspaper Optical Character Recognition Quality

March 4, 2022

OCR quality affects perceived usefulness of historical newspaper clippings -- a user study
Kimmo Kettunen, Heikki Keskustalo, Sanna Kumpulainen, Tuula Pääkkönen, Juha Rautiainen
User Study Automatic Usefulness Prediction OCR Information Historical Newspaper Optical Character Recognition Quality

February 16, 2022

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French
Nicolas Gutehrlé, Iana Atanassova
Inner Structure Document Relevance Gradient Boosting Rule Based Rule Learning Historical Newspaper

Historical Newspaper

Papers

ChroniclingAmericaQA: A Large-scale Question Answering Dataset based on Historical American Newspaper Pages

A Quantitative Discourse Analysis of Asian Workers in the US Historical Newspapers

Multilingual Event Extraction from Historical Newspaper Adverts

Optical character recognition quality affects perceived usefulness of historical newspaper clippings

OCR quality affects perceived usefulness of historical newspaper clippings -- a user study

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French