Synthetic Data Vault

Synthetic Data Vaults (SDVs) are systems designed to generate realistic synthetic datasets, addressing the need for privacy-preserving data sharing and model training. Current research focuses on improving the quality and diversity of synthetic data, particularly for sequential data, using techniques like conditional probabilistic autoregressive models and comparing their performance against generative adversarial networks. These advancements are significant because high-quality synthetic data enables the development and evaluation of machine learning models, especially in sensitive domains like software security, where real data is often restricted, while also facilitating code understanding and generation tasks through large language model training.

Papers

January 3, 2024

Using AI/ML to Find and Remediate Enterprise Secrets in Code & Document Sharing Platforms
Gregor Kerr, David Algorry, Senad Ibraimoski, Peter Maciver, Sean Moran
Machine Learning Artificial Intelligence Real World Code Programming Community Synthetic Data Vault

May 9, 2023

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Dung Nguyen Manh, Nam Le Hai, Anh T. V. Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, Nghi D. Q. Bui
Large Language Model Code Generation Faithful Generation Multilingual Dataset Code Pair CodeSearchNet Corpus Synthetic Data Vault

July 28, 2022

Sequential Models in the Synthetic Data Vault
Kevin Zhang, Neha Patki, Kalyan Veeramachaneni
Synthetic Data Sequential Model Synthetic Data Vault

Synthetic Data Vault

Papers

Using AI/ML to Find and Remediate Enterprise Secrets in Code & Document Sharing Platforms

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Sequential Models in the Synthetic Data Vault