Paper ID: 2411.06175

Clustering Algorithms and RAG Enhancing Semi-Supervised Text Classification with Large LLMs

Shan Zhong, Jiahao Zeng, Yongxin Yu, Bohong Lin

This paper introduces an innovative semi-supervised learning approach for text classification, addressing the challenge of abundant data but limited labeled examples. Our methodology integrates few-shot learning with retrieval-augmented generation (RAG) and conventional statistical clustering, enabling effective learning from a minimal number of labeled instances while generating high-quality labeled data. To the best of our knowledge, we are the first to incorporate RAG alongside clustering in text data generation. Our experiments on the Reuters and Web of Science datasets demonstrate state-of-the-art performance, with few-shot augmented data alone producing results nearly equivalent to those achieved with fully labeled datasets. Notably, accuracies of 95.41\% and 82.43\% were achieved for complex text document classification tasks, where the number of categories can exceed 100.

Submitted: Nov 9, 2024

Topics

Links

arXiv PDF