Carolina Corpus
The Carolina Corpus is a large, openly accessible collection of contemporary Brazilian Portuguese text data, designed to advance linguistic research and improve natural language processing (NLP) models for this under-resourced language. Current research utilizes the corpus to train and evaluate large language models (LLMs), often based on Transformer architectures like RoBERTa, focusing on comparing performance across different model sizes and training data curation strategies. This resource is significant for bridging the language gap in NLP, enabling the development of more accurate and effective tools for Brazilian Portuguese, and facilitating broader linguistic investigation.
Papers
February 29, 2024
March 28, 2023
January 29, 2023
November 29, 2022