A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models

Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving, scalable phenotype curation in clinical genetics and supports future integration and evaluation on larger clinical datasets.

(2026). A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models . Retrieved from https://hdl.handle.net/10446/322945

A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models

Bombarda, Andrea;Saletta, Martina;Bellini, Matteo;Goisis, Lucrezia;Iascone, Maria;Cazzaniga, Paolo;Savo, Domenico

2026-01-01

Abstract

Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving, scalable phenotype curation in clinical genetics and supports future integration and evaluation on larger clinical datasets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di pubblicazione
	
				2026
			
	Tutti gli autori
	
						Bombarda, Andrea; Saletta, Martina; Bellini, Matteo; Goisis, Lucrezia; Iascone, Maria; Cazzaniga, Paolo; Savo, Domenico Fabio
					
	Nelle collezioni:
	
				1.4.01 Contributi in atti di convegno - Conference presentations

File allegato/i alla scheda:

File	Dimensione del file	Formato
HEALTHINF_2026___Phenotypes_Extraction (5).pdf accesso aperto Versione: publisher's version - versione editoriale Licenza: Creative commons Dimensione del file 200.58 kB Formato Adobe PDF Visualizza/Apri	200.58 kB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/322945

Citazioni

ND

ND

social impact