Web Corpus
Web corpora, massive collections of text and data from the internet, are crucial for training large language models (LLMs). Current research emphasizes improving corpus quality by addressing noise, bias, redundancy, and the inclusion of sensitive information, often employing techniques like ensemble methods and data deduplication/diversification to enhance model performance. This work is vital for developing more accurate, reliable, and ethically responsible LLMs, impacting various applications from natural language processing tasks to cross-lingual model adaptation. The development of larger, higher-quality, and more diverse corpora, along with improved data processing techniques, remains a key focus.
Papers
Continual Pre-Training for Cross-Lingual LLM Adaptation: Enhancing Japanese Language Capabilities
Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Hiroki Iida, Masanari Ohi, Kakeru Hattori, Hirai Shota, Sakae Mizuki, Rio Yokota, Naoaki Okazaki
Building a Large Japanese Web Corpus for Large Language Models
Naoaki Okazaki, Kakeru Hattori, Hirai Shota, Hiroki Iida, Masanari Ohi, Kazuki Fujii, Taishi Nakamura, Mengsay Loem, Rio Yokota, Sakae Mizuki