U-DIADS-TL: a novel dataset for text line segmentation in historical manuscripts

Zottin, S.; De Nardin, A.; Piciarelli, C.; Foresti, G. L.

doi:10.1007/s10032-026-00585-7

Text line segmentation in historical documents remains a significant challenge due to degraded manuscripts, complex layouts, and diverse handwriting styles. Developing robust computational methods is hindered by the scarcity of high-quality ground truth annotations, which require expert knowledge and are time-intensive to produce. Few-shot learning has emerged as a promising solution by enabling model training with minimal annotated data, yet its application to historical document analysis is still largely unexplored. To address this limitation, we introduce U-DIADS-TL (Uniud - Document Image Analysis DataSet - Text Line), a dataset specifically designed for text line segmentation in ancient manuscripts. U-DIADS-TL provides noise-free annotations with non-overlapping text elements and accommodates diverse document structures, including multi-column layouts. To encourage few-shot learning approaches, we offer only three training images, allowing researchers to develop segmentation models that can generalize from limited supervision. Our dataset serves as a critical bridge between deep learning and historical document analysis, fostering the creation of efficient, adaptable segmentation models for real-world applications.