Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

Jihao Cai; Guozhuang Li; Yongxin Yang; Kexin Xu; Sen Zhao; Timothy Hospedales; Lina Zhao; Jianle Yang; Zhihong Wu; Terry Jianguo Zhang; Zefu Chen; Nan Wu

doi:10.1016/j.jgg.2026.02.009

Turn off MathJax

Article Contents

Article Navigation > Journal of Genetics and Genomics > 2026 > Accepted Manuscript

PDF( 0 KB)

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009

Jihao Cai^a,b,c,
Guozhuang Li^a,b,c,
Yongxin Yang^d,
Kexin Xu^a,b,c,
Sen Zhao^a,b,c,
Timothy Hospedales^e,
Lina Zhao^a,b,c,
Jianle Yang^a,b,c,
Zhihong Wu^a,b,c,
Terry Jianguo Zhang^a,b,c,
Zefu Chen^a,b,c,f,
Nan Wu^a,b,c

a. Department of Orthopaedic Surgery, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China;
b. Beijing Key of Big Data Innovation and Application for Skeletal Health Medical Care, Beijing, 100730, China;
c. Key Laboratory of Big Data for Spinal Deformities, Chinese Academy of Medical Sciences, Beijing, 100730, China;
d. School of Electronic Engineering and Computer Science, Queen Mary University of London, London, E1 4NS, United Kingdom;
e. Institute of Perception, Action and Behaviour, School of Informatics, The University of Edinburgh, Edinburgh, EH3 9DR, United Kingdom;
f. Division of Spine Surgery, Department of Orthopaedics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China

Funds:

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

We thank Ran Fan for providing computation resources. This study is part of the Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) Study Group. This study was funded in part by the National Key Research and Development Program of China (2023YFC2507700 to T.J.Z. and N.W.)

Received Date: 2025-10-10
Accepted Date: 2026-02-10
Rev Recd Date: 2026-02-09

Available Online: 2026-02-28

Abstract

Abstract

Structured phenotypes are important for Mendelian disorder diagnosis, gene–phenotype association studies, and standardized phenotypic data sharing. Although electronic health records contain abundant phenotypic information, much of it is unstructured. Early automated phenotyping methods are rule-based, limiting their ability to capture semantic variability and contextual information. Recent deep learning approaches, including BERT-based models and large language models (LLMs), improve semantic understanding but still face key limitations. BERT-based methods are constrained by limited context windows, requiring text chunking and aggregation for long clinical narratives, while LLMs that directly generate Human Phenotype Ontology (HPO) identifiers may produce non-existent identifiers. To address these challenges, we propose LEAP (LLM-Enhanced Automated Phenotyping), a two-stage framework that integrates an LLM for free-text phenotype extraction with a sentence-transformer model fine-tuned on a large-scale dataset of 5,330,557 instances for HPO mapping. This design handles long inputs while ensuring valid and deterministic HPO identifier outputs. On a real-world EHR test set, LEAP achieves relative improvements of 19.68%–412.68% in precision and 44.14%–298.77% in F1 score compared with existing tools, while maintaining robust performance on external benchmarks. LEAP can be integrated with gene prioritization tools to provide standardized phenotype inputs for downstream analyses. LEAP is available at phenogemini.org/extract.
- Large language model,
- Sentence transformers,
- Human phenotype ontology,
- Electronic health records,
- Mendelian disorders