Metadata extraction from natural history collection labels is a pivotal task for the online publication of digitized specimens. However, given the scale of these collections—which are estimated to host more than 2 billion specimens worldwide, including ca. 400 million herbarium specimens—manual metadata extraction is an extremely time-consuming task. Thus, automated data extraction from digital images of specimens and their labels therefore is a promising application of state-of-the-art computer vision techniques. Extracting information from herbarium specimen labels normally involves three main steps: text segmentation, multilingual and handwriting recognition, and data parsing. The primary bottleneck in this workflow lies in the limitations of Optical Character Recognition (OCR) systems. This study explores how the general knowledge embedded in multimodal Transformer models can be transferred to the specific task of herbarium specimen label digitization. The final goal is to develop an easy-to-use, end-to-end solution to mitigate the limitations of classic OCR approaches while offering greater flexibility to adapt to different label formats. Donut-base, a pre-trained visual document understanding (VDU) transformer, was the base model selected for fine-tuning. A dataset from the University of Pisa served as a test bed. The initial attempt achieved an accuracy of 85%, measured using the Tree Edit Distance (TED), demonstrating the feasibility of fine-tuning for this task. Cases with low accuracies were also investigated to identify limitations of the approach. In particular, specimens with multiple labels, especially if combining handwritten and typewritten text, proved to be the most challenging. Strategies aimed at addressing these weaknesses are discussed.
Application of Computer Vision to the Automated Extraction of Metadata from Natural History Specimen Labels: A Case Study on Herbarium Specimens
Weiwei Liu
;Felice Andrea Pellegrino;Adriano Peron;Stefano Martellos
2026-01-01
Abstract
Metadata extraction from natural history collection labels is a pivotal task for the online publication of digitized specimens. However, given the scale of these collections—which are estimated to host more than 2 billion specimens worldwide, including ca. 400 million herbarium specimens—manual metadata extraction is an extremely time-consuming task. Thus, automated data extraction from digital images of specimens and their labels therefore is a promising application of state-of-the-art computer vision techniques. Extracting information from herbarium specimen labels normally involves three main steps: text segmentation, multilingual and handwriting recognition, and data parsing. The primary bottleneck in this workflow lies in the limitations of Optical Character Recognition (OCR) systems. This study explores how the general knowledge embedded in multimodal Transformer models can be transferred to the specific task of herbarium specimen label digitization. The final goal is to develop an easy-to-use, end-to-end solution to mitigate the limitations of classic OCR approaches while offering greater flexibility to adapt to different label formats. Donut-base, a pre-trained visual document understanding (VDU) transformer, was the base model selected for fine-tuning. A dataset from the University of Pisa served as a test bed. The initial attempt achieved an accuracy of 85%, measured using the Tree Edit Distance (TED), demonstrating the feasibility of fine-tuning for this task. Cases with low accuracies were also investigated to identify limitations of the approach. In particular, specimens with multiple labels, especially if combining handwritten and typewritten text, proved to be the most challenging. Strategies aimed at addressing these weaknesses are discussed.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


