Web Crawled Data
Web-crawled data, encompassing vast amounts of text and multimedia from the internet, is increasingly used to train and fine-tune large language models (LLMs) and other machine learning systems. Current research focuses on improving data quality by filtering noise, aligning web data with higher-quality sources, and leveraging techniques like contrastive learning and LLM-generated parallel data to enhance model performance. This work is significant because it addresses the challenges of data scarcity and cost in training powerful models, enabling advancements in various applications, including natural language processing, machine translation, and multimodal learning.
Papers
September 26, 2024
August 15, 2024
August 13, 2024
July 8, 2024
May 17, 2024
December 20, 2023
December 15, 2023
November 2, 2023
October 23, 2023
September 19, 2023
April 28, 2023