Code Datasets
Code datasets are collections of source code used to train and evaluate machine learning models for various code-related tasks, such as code generation, vulnerability detection, and program repair. Current research focuses on creating larger, more diverse datasets encompassing multiple programming languages and incorporating metadata like code comments and usage context, often leveraging large language models (LLMs) like GPT variants and transformer architectures for analysis and generation. These datasets are crucial for advancing the field of AI-assisted software development, enabling the creation of more robust and efficient tools for programmers and improving software security through automated vulnerability detection.
Papers
Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen
An Approach to Detect Abnormal Submissions for CodeWorkout Dataset
Alex Hicks, Yang Shi, Arun-Balajiee Lekshmi-Narayanan, Wei Yan, Samiha Marwan