Training Corpus
Training corpora are the massive datasets used to train large language models (LLMs), with current research focusing on improving their quality, diversity, and suitability for specific tasks. This involves developing methods for data selection and curation, including techniques that leverage data influence scores and address issues like data contamination and bias. The effective construction of training corpora is crucial for building high-performing and reliable LLMs, impacting various fields from scientific research to medical applications and beyond.
Papers
August 8, 2023
August 1, 2023
July 20, 2023
May 26, 2023
May 24, 2023
Large Language Models are Few-Shot Health Learners
Xin Liu, Daniel McDuff, Geza Kovacs, Isaac Galatzer-Levy, Jacob Sunshine, Jiening Zhan, Ming-Zher Poh, Shun Liao, Paolo Di Achille, Shwetak Patel
Injecting Knowledge into Biomedical Pre-trained Models via Polymorphism and Synonymous Substitution
Hongbo Zhang, Xiang Wan, Benyou Wang
May 19, 2023
May 11, 2023
April 23, 2023
April 7, 2023
April 6, 2023
March 26, 2023
December 7, 2022
November 10, 2022
October 31, 2022
October 19, 2022
October 13, 2022
September 13, 2022
September 7, 2022