Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions
Published in Under Review at Journal of Biomedical Semantics, 2024
Recommended citation: Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions. K Sasse, S Vadlakonda, R Kennedy, J Osborne - arXiv preprint arXiv:2410.07951, 2024. https://arxiv.org/pdf/2410.07951
Machine learning methods for Disease Entity Recognition (DER) and Normalization (DEN) face challenges with infrequently occurring concepts due to limited mentions in training corpora and sparse Knowledge Graph descriptions. Fine-tuning a LLaMa-2 13B Chat LLM to generate synthetic training data significantly improved DEN performance, particularly in Out of Distribution (OOD) data, with accuracy gains of 3-9 points overall and 20-55 points OOD, while DER showed modest improvements. This study highlights the potential of LLM-generated synthetic mentions for enhancing DEN but reveals limited benefits for DER, with all software and datasets made publicly available.
Recommended citation: Disease Entity Recognition and Normalization is Improved with Large Language Model Derived Synthetic Normalized Mentions. K Sasse, S Vadlakonda, R Kennedy, J Osborne - arXiv preprint arXiv:2410.07951, 2024.