Biomedical Corpus
Biomedical corpora are large collections of text and data from the biomedical literature used to train and evaluate natural language processing (NLP) models. Current research focuses on developing and improving these models, particularly large language models (LLMs) and transformer-based architectures, for tasks like entity recognition, relation extraction, question answering, and text generation within the biomedical domain. This work aims to improve the accuracy and efficiency of information extraction from biomedical texts, ultimately facilitating advancements in drug discovery, disease understanding, and personalized medicine. Challenges remain in addressing biases, ensuring factual accuracy, and handling the complexities of multilingual and low-resource languages within these corpora.
Papers
Taiyi: A Bilingual Fine-Tuned Large Language Model for Diverse Biomedical Tasks
Ling Luo, Jinzhong Ning, Yingwen Zhao, Zhijun Wang, Zeyuan Ding, Peng Chen, Weiru Fu, Qinyu Han, Guangtao Xu, Yunzhi Qiu, Dinghao Pan, Jiru Li, Hao Li, Wenduo Feng, Senbo Tu, Yuqi Liu, Zhihao Yang, Jian Wang, Yuanyuan Sun, Hongfei Lin
KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained Language Model
Lei Geng, Xu Yan, Ziqiang Cao, Juntao Li, Wenjie Li, Sujian Li, Xinjie Zhou, Yang Yang, Jun Zhang