Paper ID: 2403.01471

Preserving correlations: A statistical method for generating synthetic data

Nicklas Jävergård, Rainey Lyons, Adrian Muntean, Jonas Forsman

We propose a method to generate statistically representative synthetic data. The main goal is to be able to maintain in the synthetic dataset the correlations of the features present in the original one, while offering a comfortable privacy level that can be eventually tailored on specific customer demands. We describe in detail our algorithm used both for the analysis of the original dataset and for the generation of the synthetic data points. The approach is tested using a large energy-related dataset. We obtain good results both qualitatively (e.g. via vizualizing correlation maps) and quantitatively (in terms of suitable $\ell^1$-type error norms used as evaluation metrics). The proposed methodology is general in the sense that it does not rely on the used test dataset. We expect it to be applicable in a much broader context than indicated here.

Submitted: Mar 3, 2024