Large Corpus
Large corpora, massive collections of text and other data, are fundamental to training advanced language models and other AI systems. Current research focuses on improving the efficiency and effectiveness of training with diverse and heterogeneous corpora, including techniques like decoupled embeddings and data augmentation to mitigate issues like the "curse of multilinguality" and domain-specific biases. This work is crucial for advancing natural language processing, enabling the development of more robust, accurate, and versatile AI systems across various languages and domains, with applications ranging from question answering to knowledge graph construction.
Papers
August 28, 2023
August 25, 2023
August 15, 2023
August 10, 2023
August 7, 2023
August 4, 2023
July 31, 2023
July 29, 2023
July 27, 2023
July 17, 2023
July 14, 2023
July 13, 2023
July 12, 2023
July 11, 2023
July 6, 2023
July 3, 2023
June 27, 2023
June 24, 2023