Web Crawled

Web-crawled data forms the foundation for training many large language models (LLMs), but its quality and potential for containing biases, copyrighted material, and personal information are significant concerns. Current research focuses on evaluating the impact of web-crawled data quality on LLM performance and developing methods to mitigate the risks of data leakage and noise, including novel training techniques like error norm truncation. These efforts are crucial for improving the reliability and trustworthiness of LLMs and ensuring responsible development of AI systems.

Papers