留言板

尊敬的读者、作者、审稿人, 关于本刊的投稿、审稿、编辑和出版的任何问题, 您可以本页添加留言。我们将尽快给您答复。谢谢您的支持!

姓名
邮箱
手机号码
标题
留言内容
验证码

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

Jihao Cai Guozhuang Li Yongxin Yang Kexin Xu Sen Zhao Timothy Hospedales Lina Zhao Jianle Yang Zhihong Wu Terry Jianguo Zhang Zefu Chen Nan Wu

Jihao Cai, Guozhuang Li, Yongxin Yang, Kexin Xu, Sen Zhao, Timothy Hospedales, Lina Zhao, Jianle Yang, Zhihong Wu, Terry Jianguo Zhang, Zefu Chen, Nan Wu. Bridging clinical narratives and structured phenotypes with large language models and sentence transformers[J]. 遗传学报. doi: 10.1016/j.jgg.2026.02.009
引用本文: Jihao Cai, Guozhuang Li, Yongxin Yang, Kexin Xu, Sen Zhao, Timothy Hospedales, Lina Zhao, Jianle Yang, Zhihong Wu, Terry Jianguo Zhang, Zefu Chen, Nan Wu. Bridging clinical narratives and structured phenotypes with large language models and sentence transformers[J]. 遗传学报. doi: 10.1016/j.jgg.2026.02.009
Jihao Cai, Guozhuang Li, Yongxin Yang, Kexin Xu, Sen Zhao, Timothy Hospedales, Lina Zhao, Jianle Yang, Zhihong Wu, Terry Jianguo Zhang, Zefu Chen, Nan Wu. Bridging clinical narratives and structured phenotypes with large language models and sentence transformers[J]. Journal of Genetics and Genomics. doi: 10.1016/j.jgg.2026.02.009
Citation: Jihao Cai, Guozhuang Li, Yongxin Yang, Kexin Xu, Sen Zhao, Timothy Hospedales, Lina Zhao, Jianle Yang, Zhihong Wu, Terry Jianguo Zhang, Zefu Chen, Nan Wu. Bridging clinical narratives and structured phenotypes with large language models and sentence transformers[J]. Journal of Genetics and Genomics. doi: 10.1016/j.jgg.2026.02.009

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009
基金项目: 

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

We thank Ran Fan for providing computation resources. This study is part of the Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) Study Group. This study was funded in part by the National Key Research and Development Program of China (2023YFC2507700 to T.J.Z. and N.W.)

详细信息
    通讯作者:

    Zefu Chen,E-mail:jeffchenmed@163.com

    Nan Wu,E-mail:dr.wunan@pumch.cn

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

Funds: 

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

We thank Ran Fan for providing computation resources. This study is part of the Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) Study Group. This study was funded in part by the National Key Research and Development Program of China (2023YFC2507700 to T.J.Z. and N.W.)

  • 摘要: Structured phenotypes are important for Mendelian disorder diagnosis, gene–phenotype association studies, and standardized phenotypic data sharing. Although electronic health records contain abundant phenotypic information, much of it is unstructured. Early automated phenotyping methods are rule-based, limiting their ability to capture semantic variability and contextual information. Recent deep learning approaches, including BERT-based models and large language models (LLMs), improve semantic understanding but still face key limitations. BERT-based methods are constrained by limited context windows, requiring text chunking and aggregation for long clinical narratives, while LLMs that directly generate Human Phenotype Ontology (HPO) identifiers may produce non-existent identifiers. To address these challenges, we propose LEAP (LLM-Enhanced Automated Phenotyping), a two-stage framework that integrates an LLM for free-text phenotype extraction with a sentence-transformer model fine-tuned on a large-scale dataset of 5,330,557 instances for HPO mapping. This design handles long inputs while ensuring valid and deterministic HPO identifier outputs. On a real-world EHR test set, LEAP achieves relative improvements of 19.68%–412.68% in precision and 44.14%–298.77% in F1 score compared with existing tools, while maintaining robust performance on external benchmarks. LEAP can be integrated with gene prioritization tools to provide standardized phenotype inputs for downstream analyses. LEAP is available at phenogemini.org/extract.
  • 加载中
计量
  • 文章访问数:  9
  • HTML全文浏览量:  4
  • PDF下载量:  0
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-10-10
  • 录用日期:  2026-02-10
  • 修回日期:  2026-02-09
  • 网络出版日期:  2026-02-28

目录

    /

    返回文章
    返回