9.9
CiteScore
7.1
Impact Factor
Turn off MathJax
Article Contents

Bridging clinical narratives and structured phenotypes with large language models and sentence transformers

doi: 10.1016/j.jgg.2026.02.009
Funds:

National High Level Hospital Clinical Research Funding (2025-PUMCH-C-003 to T.J.Z., 2022-PUMCH-C-033 to N.W., 2025-PUMCH-A-115 to L.Z.)

CAMS Innovation Fund for Medical Sciences (CIFMS, 2023-I2M-C&

Research Fund of Nanfang Hospital, Southern Medical University (2023B002 to Z.C.).

Natural Science Foundation of China (82572698 to T.J.Z., 82402889 to Z.C., 82402760 to L.Z.)

Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2019PT320025 to N.W.)

Guangdong Basic and Applied Basic Research Foundation (2023A1515110749 to Z.C.)

T-A-003 to T.J.Z., 2024-I2M-TS-002 to N.W., 2025-I2M-XHJC-002 to N.W., 2025-I2M-XHXX-020 to L.Z.)

We thank Ran Fan for providing computation resources. This study is part of the Deciphering Disorders Involving Scoliosis and COmorbidities (DISCO) Study Group. This study was funded in part by the National Key Research and Development Program of China (2023YFC2507700 to T.J.Z. and N.W.)

  • Received Date: 2025-10-10
  • Accepted Date: 2026-02-10
  • Rev Recd Date: 2026-02-09
  • Available Online: 2026-02-28
  • Structured phenotypes are important for Mendelian disorder diagnosis, gene–phenotype association studies, and standardized phenotypic data sharing. Although electronic health records contain abundant phenotypic information, much of it is unstructured. Early automated phenotyping methods are rule-based, limiting their ability to capture semantic variability and contextual information. Recent deep learning approaches, including BERT-based models and large language models (LLMs), improve semantic understanding but still face key limitations. BERT-based methods are constrained by limited context windows, requiring text chunking and aggregation for long clinical narratives, while LLMs that directly generate Human Phenotype Ontology (HPO) identifiers may produce non-existent identifiers. To address these challenges, we propose LEAP (LLM-Enhanced Automated Phenotyping), a two-stage framework that integrates an LLM for free-text phenotype extraction with a sentence-transformer model fine-tuned on a large-scale dataset of 5,330,557 instances for HPO mapping. This design handles long inputs while ensuring valid and deterministic HPO identifier outputs. On a real-world EHR test set, LEAP achieves relative improvements of 19.68%–412.68% in precision and 44.14%–298.77% in F1 score compared with existing tools, while maintaining robust performance on external benchmarks. LEAP can be integrated with gene prioritization tools to provide standardized phenotype inputs for downstream analyses. LEAP is available at phenogemini.org/extract.
  • loading
  • 加载中

Catalog

    通讯作者: 陈斌, bchen63@163.com
    • 1. 

      沈阳化工大学材料科学与工程学院 沈阳 110142

    1. 本站搜索
    2. 百度学术搜索
    3. 万方数据库搜索
    4. CNKI搜索

    Article Metrics

    Article views (9) PDF downloads (0) Cited by ()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return