Data Science Code Generation

Data science code generation focuses on automatically creating executable code from natural language descriptions of data analysis tasks, aiming to accelerate the data science workflow. Current research emphasizes improving the accuracy and reliability of code generated by large language models (LLMs), particularly addressing issues like hallucinations and inaccuracies through techniques such as iterative self-correction and instruction fine-tuning guided by input-output specifications. This field is significant because it has the potential to dramatically increase data scientists' productivity by automating tedious coding tasks and enabling faster exploration of data.

Papers