Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving, scalable phenotype curation in clinical genetics and supports future integration and evaluation on larger clinical datasets.

(2026). A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models . Retrieved from https://hdl.handle.net/10446/322945

A Pipeline for Automated Phenotype Extraction from Medical Reports Using Large Language Models

Bombarda, Andrea;Cazzaniga, Paolo;Savo, Domenico
2026-01-01

Abstract

Unstructured clinical narratives are a major source of phenotypic evidence for rare-disease diagnosis and genomic variant interpretation. However, their free-text nature, often multilingual, heterogeneous in format, and inconsistent in terminology, makes automated phenotype extraction and interoperability with downstream genomic pipelines difficult. This creates a practical bottleneck for scalable and reproducible phenotype curation in medical genetics, where manual review is time-consuming and prone to variability. To address this problem, we propose a robust, open-source, and fully local pipeline for automatically extracting and standardizing patient phenotypes from medical reports while preserving data privacy. The pipeline integrates: (i) OCR-based digitization and an LLM-based translation module to produce an English version of the report; (ii) a GPT-oss–based phenotype extractor using structured, few-shot prompting to identify phenotypes relevant to the index patient; and (iii) a fuzzy standardization stage that combines lexical similarity with embedding-based semantic matching to map extracted phenotypes to Human Phenotype Ontology (HPO) concepts. Our multi-stage design improves robustness to real-world documentation issues, including multilingual acronyms, variable report structure, spelling errors, and synonym variability, and it ensures privacy compliance by keeping all computation on local infrastructure. We demonstrate the pipeline end-to-end on a representative clinical report, showing that it extracts patient-relevant phenotypes and produces HPO-aligned, machine-readable outputs suitable for downstream genomic analyses. This work provides a practical foundation for privacypreserving, scalable phenotype curation in clinical genetics and supports future integration and evaluation on larger clinical datasets.
2026
Bombarda, Andrea; Saletta, Martina; Bellini, Matteo; Goisis, Lucrezia; Iascone, Maria; Cazzaniga, Paolo; Savo, Domenico Fabio
File allegato/i alla scheda:
File Dimensione del file Formato  
HEALTHINF_2026___Phenotypes_Extraction (5).pdf

accesso aperto

Versione: publisher's version - versione editoriale
Licenza: Creative commons
Dimensione del file 200.58 kB
Formato Adobe PDF
200.58 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/322945
Citazioni
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact