Pre Training Corpus
Pre-training corpora are the massive datasets used to initially train large language models (LLMs), significantly impacting their capabilities. Current research focuses on improving corpus quality through automated methods like neural web scraping and model-driven data refinement, aiming to reduce biases, harmful content, and data contamination while enhancing efficiency. These efforts are crucial for building more reliable and robust LLMs, addressing concerns about data quality and ethical implications, and ultimately improving the performance and trustworthiness of downstream applications.
Papers
PELMS: Pre-training for Effective Low-Shot Multi-Document Summarization
Joseph J. Peper, Wenzhao Qiu, Lu Wang
Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources
Yipei Xu, Dakuan Lu, Jiaqing Liang, Xintao Wang, Yipeng Geng, Yingsi Xin, Hengkui Wu, Ken Chen, ruiji zhang, Yanghua Xiao
Leveraging Code to Improve In-context Learning for Semantic Parsing
Ben Bogin, Shivanshu Gupta, Peter Clark, Ashish Sabharwal