Categorical Feature

Categorical features, representing discrete data values like colors or product categories, are ubiquitous in real-world datasets but pose unique challenges for machine learning models that typically require numerical input. Current research focuses on effective encoding techniques, comparing methods like one-hot encoding, target encoding, and binary encoding, often within the context of neural networks (including Transformers) and gradient boosting decision trees. These efforts aim to improve model accuracy and fairness while addressing issues like high cardinality and sparsity, ultimately impacting the performance and interpretability of various applications, from spam detection to actuarial modeling and cybersecurity.

Papers