Categorical Data

Categorical data, representing qualitative information as categories or labels, presents unique challenges for data analysis and machine learning due to its lack of inherent numerical order. Current research focuses on developing effective encoding techniques beyond traditional one-hot encoding, exploring methods like embeddings using LLMs and novel bit vector representations to capture meaningful relationships between categories, and employing advanced clustering algorithms (e.g., k-modes and its variants) and classification models (e.g., Naive Bayes, support vector machines, and tree-based methods) tailored for categorical data. These advancements are crucial for diverse applications, improving accuracy and interpretability in fields ranging from industrial process modeling and healthcare to social sciences and cybersecurity, where categorical variables are prevalent.

Papers