Web Scale
Web-scale data analysis focuses on leveraging massive internet datasets to train and improve machine learning models, particularly large language models (LLMs). Current research emphasizes efficient data handling techniques, including dataset pruning and semantic deduplication, to reduce computational costs while maintaining or improving model performance. This work is crucial for advancing various applications, such as improved search ranking, more accurate speech recognition and translation, and the development of specialized LLMs for domains like finance, all while addressing biases and ensuring data integrity in the face of potential poisoning attacks.
Papers
Effective pruning of web-scale datasets based on complexity of concept clusters
Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, Ari S. Morcos
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli, Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot Branson, Aerin Kim, Somayeh Sojoudi, Kyunghyun Cho