Tabular Data
Tabular data, ubiquitous in various fields, presents unique challenges for machine learning due to its structured nature and mixed data types. Current research focuses on improving model performance through techniques like self-supervised learning (e.g., JEPA), generative models (e.g., GANs, VAEs, diffusion models) for data augmentation and synthesis, and the integration of large language models (LLMs) for enhanced feature extraction and data generation. These advancements aim to address limitations in existing methods, such as gradient boosted decision trees, and improve accuracy, efficiency, and robustness in applications ranging from medical diagnosis to anomaly detection and scientific simulations.
Papers
Model Uncertainty based Active Learning on Tabular Data using Boosted Trees
Sharath M Shankaranarayana
Assessment of Differentially Private Synthetic Data for Utility and Fairness in End-to-End Machine Learning Pipelines for Tabular Data
Mayana Pereira, Meghana Kshirsagar, Sumit Mukherjee, Rahul Dodhia, Juan Lavista Ferres, Rafael de Sousa
NameGuess: Column Name Expansion for Tabular Data
Jiani Zhang, Zhengyuan Shen, Balasubramaniam Srinivasan, Shen Wang, Huzefa Rangwala, George Karypis
A Distributed Approach to Meteorological Predictions: Addressing Data Imbalance in Precipitation Prediction Models through Federated Learning and GANs
Elaheh Jafarigol, Theodore Trafalis
TabuLa: Harnessing Language Models for Tabular Data Synthesis
Zilong Zhao, Robert Birke, Lydia Chen