Web Crawled
Web-crawled data forms the foundation for training many large language models (LLMs), but its quality and potential for containing biases, copyrighted material, and personal information are significant concerns. Current research focuses on evaluating the impact of web-crawled data quality on LLM performance and developing methods to mitigate the risks of data leakage and noise, including novel training techniques like error norm truncation. These efforts are crucial for improving the reliability and trustworthiness of LLMs and ensuring responsible development of AI systems.
Papers
October 9, 2024
March 24, 2024
March 13, 2024
October 2, 2023