Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

Jihao Cai; Guozhuang Li; Yongxin Yang; Kexin Xu; Sen Zhao; Timothy Hospedales; Lina Zhao; Jianle Yang; Zhihong Wu; Terry Jianguo Zhang; Zefu Chen; Nan Wu

doi:10.1016/j.jgg.2026.02.009

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009

Jihao Cai^a,b,c,
Guozhuang Li^a,b,c,
Yongxin Yang^d,
Kexin Xu^a,b,c,
Sen Zhao^a,b,c,
Timothy Hospedales^e,
Lina Zhao^a,b,c,
Jianle Yang^a,b,c,
Zhihong Wu^a,b,c,
Terry Jianguo Zhang^a,b,c,
Zefu Chen^a,b,c,f, ,,
Nan Wu^a,b,c, ,

a. Department of Orthopaedic Surgery, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China;
b. Beijing Key of Big Data Innovation and Application for Skeletal Health Medical Care, Beijing, 100730, China;
c. Key Laboratory of Big Data for Spinal Deformities, Chinese Academy of Medical Sciences, Beijing, 100730, China;
d. School of Electronic Engineering and Computer Science, Queen Mary University of London, London, E1 4NS, United Kingdom;
e. Institute of Perception, Action and Behaviour, School of Informatics, The University of Edinburgh, Edinburgh, EH3 9DR, United Kingdom;
f. Division of Spine Surgery, Department of Orthopaedics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China

基金项目:

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

We thank Ran Fan for providing computation resources. This study is part of the Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) Study Group. This study was funded in part by the National Key Research and Development Program of China (2023YFC2507700 to T.J.Z. and N.W.)

详细信息

通讯作者:
Zefu Chen,E-mail:jeffchenmed@163.com

Nan Wu,E-mail:dr.wunan@pumch.cn

计量
- 文章访问数: 9
- HTML全文浏览量: 4
- PDF下载量: 0
- 被引次数: 0
出版历程
- 收稿日期: 2025-10-10
- 录用日期: 2026-02-10
- 修回日期: 2026-02-09
- 网络出版日期: 2026-02-28

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

Jihao Cai^a,b,c,
Guozhuang Li^a,b,c,
Yongxin Yang^d,
Kexin Xu^a,b,c,
Sen Zhao^a,b,c,
Timothy Hospedales^e,
Lina Zhao^a,b,c,
Jianle Yang^a,b,c,
Zhihong Wu^a,b,c,
Terry Jianguo Zhang^a,b,c,
Zefu Chen^{a,b,c,f
, ,},
Nan Wu^{a,b,c
, ,}

a. Department of Orthopaedic Surgery, State Key Laboratory of Complex Severe and Rare Diseases, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing 100730, China;
b. Beijing Key of Big Data Innovation and Application for Skeletal Health Medical Care, Beijing, 100730, China;
c. Key Laboratory of Big Data for Spinal Deformities, Chinese Academy of Medical Sciences, Beijing, 100730, China;
d. School of Electronic Engineering and Computer Science, Queen Mary University of London, London, E1 4NS, United Kingdom;
e. Institute of Perception, Action and Behaviour, School of Informatics, The University of Edinburgh, Edinburgh, EH3 9DR, United Kingdom;
f. Division of Spine Surgery, Department of Orthopaedics, Nanfang Hospital, Southern Medical University, Guangzhou, 510515, China

Funds:

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

摘要

摘要: Structured phenotypes are important for Mendelian disorder diagnosis, gene–phenotype association studies, and standardized phenotypic data sharing. Although electronic health records contain abundant phenotypic information, much of it is unstructured. Early automated phenotyping methods are rule-based, limiting their ability to capture semantic variability and contextual information. Recent deep learning approaches, including BERT-based models and large language models (LLMs), improve semantic understanding but still face key limitations. BERT-based methods are constrained by limited context windows, requiring text chunking and aggregation for long clinical narratives, while LLMs that directly generate Human Phenotype Ontology (HPO) identifiers may produce non-existent identifiers. To address these challenges, we propose LEAP (LLM-Enhanced Automated Phenotyping), a two-stage framework that integrates an LLM for free-text phenotype extraction with a sentence-transformer model fine-tuned on a large-scale dataset of 5,330,557 instances for HPO mapping. This design handles long inputs while ensuring valid and deterministic HPO identifier outputs. On a real-world EHR test set, LEAP achieves relative improvements of 19.68%–412.68% in precision and 44.14%–298.77% in F1 score compared with existing tools, while maintaining robust performance on external benchmarks. LEAP can be integrated with gene prioritization tools to provide standardized phenotype inputs for downstream analyses. LEAP is available at phenogemini.org/extract.
- Large language model /
- Sentence transformers /
- Human phenotype ontology /
- Electronic health records /
- Mendelian disorders
Abstract: Structured phenotypes are important for Mendelian disorder diagnosis, gene–phenotype association studies, and standardized phenotypic data sharing. Although electronic health records contain abundant phenotypic information, much of it is unstructured. Early automated phenotyping methods are rule-based, limiting their ability to capture semantic variability and contextual information. Recent deep learning approaches, including BERT-based models and large language models (LLMs), improve semantic understanding but still face key limitations. BERT-based methods are constrained by limited context windows, requiring text chunking and aggregation for long clinical narratives, while LLMs that directly generate Human Phenotype Ontology (HPO) identifiers may produce non-existent identifiers. To address these challenges, we propose LEAP (LLM-Enhanced Automated Phenotyping), a two-stage framework that integrates an LLM for free-text phenotype extraction with a sentence-transformer model fine-tuned on a large-scale dataset of 5,330,557 instances for HPO mapping. This design handles long inputs while ensuring valid and deterministic HPO identifier outputs. On a real-world EHR test set, LEAP achieves relative improvements of 19.68%–412.68% in precision and 44.14%–298.77% in F1 score compared with existing tools, while maintaining robust performance on external benchmarks. LEAP can be integrated with gene prioritization tools to provide standardized phenotype inputs for downstream analyses. LEAP is available at phenogemini.org/extract.
- Large language model /
- Sentence transformers /
- Human phenotype ontology /
- Electronic health records /
- Mendelian disorders

HTML全文

参考文献(0)

施引文献

资源附件(0)

访问统计

点击查看大图

计量

文章访问数: 9
HTML全文浏览量: 4
PDF下载量: 0
被引次数: 0

姓名
邮箱
手机号码
标题
留言内容
验证码

留言板

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009

通讯作者:
Zefu Chen,E-mail:jeffchenmed@163.com

Nan Wu,E-mail:dr.wunan@pumch.cn

计量

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

计量

目录

留言板

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009

通讯作者: Zefu Chen,E-mail:jeffchenmed@163.com Nan Wu,E-mail:dr.wunan@pumch.cn

计量

出版历程

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

计量

出版历程

目录

通讯作者:
Zefu Chen,E-mail:jeffchenmed@163.com

Nan Wu,E-mail:dr.wunan@pumch.cn