Objective: To develop and evaluate a multimodal electronic health record (EHR)-based phenotyping pipeline integrating structured and unstructured clinical data to identify disease subgroups and characterize longitudinal trajectories in a real-world setting. Materials and methods: We conducted a retrospective multicenter study including 1,598 patients with autoimmune gastritis. Structured demographic and clinical variables were combined with longitudinal endoscopic and histological data extracted from routine care. A consensus clustering strategy integrating partitioning (K-medoids) and hierarchical approaches was applied to identify robust patient subgroups. Free-text endoscopic reports were processed using a fine-tuned transformer-based natural language processing (NLP) model to automatically extract structured phenotypic features. To address irregular follow-up intervals, time-normalized progression indices were developed to capture both severity and temporal dynamics of disease evolution. Results: After preprocessing, 607 patients were included in the analysis. The consensus clustering approach identified three clinically distinct subgroups. The NLP model demonstrated high performance in extracting endoscopic features (accuracy 90.2%, balanced accuracy 89.3%). Application of the proposed progression indices revealed significant differences in longitudinal patterns of mucosal damage across clusters (p < 0.01). Conclusion: This study demonstrates the feasibility of integrating clustering techniques and transformer-based clinical NLP within a unified EHR phenotyping pipeline. The proposed approach supports scalable secondary use of structured and narrative clinical data for subgroup discovery and trajectory modeling in chronic diseases.

(2026). A multimodal EHR-based phenotyping framework integrating consensus clustering and transformer-based clinical NLP: application to autoimmune gastritis [journal article - articolo]. In INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS. Retrieved from https://hdl.handle.net/10446/329265

A multimodal EHR-based phenotyping framework integrating consensus clustering and transformer-based clinical NLP: application to autoimmune gastritis

Pala, Daniele;Sirtoli, Chiara;
2026-05-26

Abstract

Objective: To develop and evaluate a multimodal electronic health record (EHR)-based phenotyping pipeline integrating structured and unstructured clinical data to identify disease subgroups and characterize longitudinal trajectories in a real-world setting. Materials and methods: We conducted a retrospective multicenter study including 1,598 patients with autoimmune gastritis. Structured demographic and clinical variables were combined with longitudinal endoscopic and histological data extracted from routine care. A consensus clustering strategy integrating partitioning (K-medoids) and hierarchical approaches was applied to identify robust patient subgroups. Free-text endoscopic reports were processed using a fine-tuned transformer-based natural language processing (NLP) model to automatically extract structured phenotypic features. To address irregular follow-up intervals, time-normalized progression indices were developed to capture both severity and temporal dynamics of disease evolution. Results: After preprocessing, 607 patients were included in the analysis. The consensus clustering approach identified three clinically distinct subgroups. The NLP model demonstrated high performance in extracting endoscopic features (accuracy 90.2%, balanced accuracy 89.3%). Application of the proposed progression indices revealed significant differences in longitudinal patterns of mucosal damage across clusters (p < 0.01). Conclusion: This study demonstrates the feasibility of integrating clustering techniques and transformer-based clinical NLP within a unified EHR phenotyping pipeline. The proposed approach supports scalable secondary use of structured and narrative clinical data for subgroup discovery and trajectory modeling in chronic diseases.
articolo
26-mag-2026
Pala, Daniele; Lenti, Marco Vincenzo; Santacroce, Giovanni; Bergomi, Laura; Curgu, Chiara; Buonocore, Tommaso; Sirtoli, Chiara; Parimbelli, Enea; Lanz...espandi
(2026). A multimodal EHR-based phenotyping framework integrating consensus clustering and transformer-based clinical NLP: application to autoimmune gastritis [journal article - articolo]. In INTERNATIONAL JOURNAL OF MEDICAL INFORMATICS. Retrieved from https://hdl.handle.net/10446/329265
File allegato/i alla scheda:
File Dimensione del file Formato  
1-s2.0-S1386505626002510-main.pdf

accesso aperto

Versione: publisher's version - versione editoriale
Licenza: Creative commons
Dimensione del file 2.97 MB
Formato Adobe PDF
2.97 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

Aisberg ©2008 Servizi bibliotecari, Università degli studi di Bergamo | Terms of use/Condizioni di utilizzo

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10446/329265
Citazioni
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact