Data Synthesis

Data synthesis focuses on generating artificial datasets that mimic the statistical properties and structure of real-world data, primarily to address data scarcity, privacy concerns, and the need for diverse training data in machine learning. Current research emphasizes the synthesis of complex data types, including relational databases and time series, often employing generative models like diffusion models and large language models (LLMs) to achieve high fidelity and utility. These techniques are proving valuable in various applications, from improving the performance of large language models and vision systems to enhancing medical image analysis and enabling privacy-preserving data sharing. The field is also actively developing robust evaluation metrics and methods to ensure the quality and reliability of synthetic data.

50papers

Papers - Page 2

January 18, 2025

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments
User Interaction Log Agent System Data Synthesis Real World Environment Large Language Model Data Centric

January 13, 2025

CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory
Data Synthesis Cognitive Diagnosis Model Failure

December 30, 2024

HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving
Search Query Tree Search Theorem Proving Data Synthesis Interactive Theorem Language Model

December 29, 2024

"Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market
Signal to Noise Ratio Stock Market Length Sequence Data Synthesis Financial Time Series Generative Model Data Scarcity

December 22, 2024

Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration
Single Agent Scalable Inference Inference Framework Multi Agent Collaboration Model Collaboration Data Synthesis

December 19, 2024

December 12, 2024

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions
Synthetic Data Generation Pipeline Human Instruction Reasoning Datasets Data Synthesis Synthetic Graph

December 9, 2024

AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion
Data Synthesis Prompt Expansion Large Language Model Training Data Task Specific

December 2, 2024

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition
Text to Image Generation Scene Text Recognition Data Synthesis Text Transmission

November 28, 2024

Unleashing the Power of Data Synthesis in Visual Localization
Synthetic Data Real Power Data Synthesis Pose Regression 3D Gaussian Splat Visual Localization

November 23, 2024

Learn2Synth: Learning Optimal Data Synthesis using Hypergradients for Brain Image Segmentation
Data Synthesis Synthetic Image Segmentation Network Domain Randomization Synthetic Data

November 13, 2024

CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs
Data Synthesis Shot Prompting Diverse Task Large Language Model Classifier Free Guidance

November 4, 2024

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis
Generative Time Series Synthetic Data Generative Model State of the Art Generative Landscape Image Data Synthesis

October 27, 2024

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation
Synthetic Data Generation Training Data Large Language Model Teacher Model Instruction Tuned Model High Quality Instruction Data Abstract Interpretation Data Synthesis

October 25, 2024

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data
Multi Granularity Training Data User Interface Synthetic Data Extreme Edge Large Vision Language Model Manual Annotation Data Synthesis

October 24, 2024

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch
Medical LLM Reasoning Capability Strong Scaling Scratch Project Data Synthesis

October 22, 2024

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration
Model Generated Data Synthesis Task Specific Large Language Model

October 16, 2024

A Survey on Data Synthesis and Augmentation for Large Language Models
Data Generation Soft Augmentation Synthetic Data Data Synthesis Timely Survey

Data Synthesis

Papers - Page 2

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory

HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

"Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market

Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration

DS²-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

How to Synthesize Text Data without Model Collapse?

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Unleashing the Power of Data Synthesis in Visual Localization

Learn2Synth: Learning Optimal Data Synthesis using Hypergradients for Brain Image Segmentation

CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

A Survey on Data Synthesis and Augmentation for Large Language Models

Data Synthesis

Papers - Page 2

Learn-by-interact: A Data-Centric Framework for Self-Adaptive Agents in Realistic Environments

CDS: Data Synthesis Method Guided by Cognitive Diagnosis Theory

HunyuanProver: A Scalable Data Synthesis Framework and Guided Tree Search for Automated Theorem Proving

"Generative Models for Financial Time Series Data: Enhancing Signal-to-Noise Ratio and Addressing Data Scarcity in A-Share Market

Multi-Agent Sampling: Scaling Inference Compute for Data Synthesis with Tree Search-Based Agentic Collaboration

DS2-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis

How to Synthesize Text Data without Model Collapse?

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

A Graph-Based Synthetic Data Pipeline for Scaling High-Quality Reasoning Instructions

AIDE: Task-Specific Fine Tuning with Attribute Guided Multi-Hop Data Expansion

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

Unleashing the Power of Data Synthesis in Visual Localization

Learn2Synth: Learning Optimal Data Synthesis using Hypergradients for Brain Image Segmentation

CorrSynth -- A Correlated Sampling Method for Diverse Dataset Generation from LLMs

Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration

A Survey on Data Synthesis and Augmentation for Large Language Models

DS²-ABSA: Dual-Stream Data Synthesis with Label Refinement for Few-Shot Aspect-Based Sentiment Analysis