Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions

Published in Under Review at Journal of Biomedical Semantics, 2024

Recommended citation: Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions. K Sasse, S Vadlakonda, R Kennedy, J Osborne - arXiv preprint arXiv:2410.07951, 2024. https://arxiv.org/pdf/2410.07951

Download paper here

Machine learning methods for Disease Entity Recognition (DER) and Normalization (DEN) face challenges with infrequently occurring concepts due to limited mentions in training corpora and sparse Knowledge Graph descriptions. Fine-tuning a LLaMa-2 13B Chat LLM to generate synthetic training data significantly improved DEN performance, particularly in Out of Distribution (OOD) data, with accuracy gains of 3-9 points overall and 20-55 points OOD, while DER showed modest improvements. This study highlights the potential of LLM-generated synthetic mentions for enhancing DEN but reveals limited benefits for DER, with all software and datasets made publicly available.

Share on

Twitter Facebook LinkedIn

Kuleen Sasse (kulin sæs)

Share on