Writer identification refers to the process of determining or attributing the authorship of a document to a specific individual through the analysis of various elements such as writing style, linguistic characteristics, and other textual features. This is a relevant task in heterogeneous fields such as cybersecurity, forensics, or linguistics and becomes particularly challenging when considering historical documents. In fact, the latter might present deterioration due to time, often lack signatures, and could be authored by multiple people. Complicating matters further, scribes were trained to mimic handwriting meticulously when copying manuscripts, making author identification of such documents even more difficult. In this context, this paper introduces a curated collection of Latin documents from the Genesis and Gospel of Matthew specifically gathered for the purpose of exploring the writer identification task. In particular, the dataset comprises over 400 pages, written by nine distinct persons. The primary objective is to explore the efficacy of state-of-the-art deep learning architectures in accurately ascribing historical texts to their rightful authors. To this end, this paper conducts extensive experiments, utilizing varying training set sizes and employing diverse pre-processing techniques to assess the performance and capabilities of these renowned models on the writer identification task while also providing the community with a baseline on the introduced collection.

Writer Identification in Historical Handwritten Documents: A Latin Dataset and a Benchmark

Fagioli A.
;
Colombi E.;Foresti G. L.
2024-01-01

Abstract

Writer identification refers to the process of determining or attributing the authorship of a document to a specific individual through the analysis of various elements such as writing style, linguistic characteristics, and other textual features. This is a relevant task in heterogeneous fields such as cybersecurity, forensics, or linguistics and becomes particularly challenging when considering historical documents. In fact, the latter might present deterioration due to time, often lack signatures, and could be authored by multiple people. Complicating matters further, scribes were trained to mimic handwriting meticulously when copying manuscripts, making author identification of such documents even more difficult. In this context, this paper introduces a curated collection of Latin documents from the Genesis and Gospel of Matthew specifically gathered for the purpose of exploring the writer identification task. In particular, the dataset comprises over 400 pages, written by nine distinct persons. The primary objective is to explore the efficacy of state-of-the-art deep learning architectures in accurately ascribing historical texts to their rightful authors. To this end, this paper conducts extensive experiments, utilizing varying training set sizes and employing diverse pre-processing techniques to assess the performance and capabilities of these renowned models on the writer identification task while also providing the community with a baseline on the introduced collection.
2024
9783031510250
9783031510267
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11390/1272690
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact