Text Datasets
Text datasets are crucial for training and evaluating machine learning models, particularly in natural language processing. Current research focuses on improving dataset quality through methods like data augmentation, diversity incentivization, and sophisticated annotation techniques, often leveraging large language models (LLMs) for tasks such as data generation, cleaning, and analysis. These efforts aim to address issues of bias, imbalance, and lack of diversity in existing datasets, ultimately leading to more robust and reliable models with broader applicability across various domains. The development and refinement of text datasets are essential for advancing the field and ensuring the responsible deployment of AI systems.
Papers
Machine Learning Classification of Peaceful Countries: A Comparative Analysis and Dataset Optimization
K. Lian (1), L. S. Liebovitch (1), M. Wild (1), H. West (1), P. T. Coleman (1), F. Chen (2), E. Kimani (2), K. Sieck (2) ((1) Columbia University, (2) Toyota Research Institute)
ERASMO: Leveraging Large Language Models for Enhanced Clustering Segmentation
Fillipe dos Santos Silva, Gabriel Kenzo Kakimoto, Julio Cesar dos Reis, Marcelo S. Reis