Paper ID: 2309.15130
Understanding the Structure of QM7b and QM9 Quantum Mechanical Datasets Using Unsupervised Learning
Julio J. Valdés, Alain B. Tchagang
This paper explores the internal structure of two quantum mechanics datasets (QM7b, QM9), composed of several thousands of organic molecules and described in terms of electronic properties. Understanding the structure and characteristics of this kind of data is important when predicting the atomic composition from the properties in inverse molecular designs. Intrinsic dimension analysis, clustering, and outlier detection methods were used in the study. They revealed that for both datasets the intrinsic dimensionality is several times smaller than the descriptive dimensions. The QM7b data is composed of well defined clusters related to atomic composition. The QM9 data consists of an outer region predominantly composed of outliers, and an inner core region that concentrates clustered, inliner objects. A significant relationship exists between the number of atoms in the molecule and its outlier/inner nature. Despite the structural differences, the predictability of variables of interest for inverse molecular design is high. This is exemplified with models estimating the number of atoms of the molecule from both the original properties, and from lower dimensional embedding spaces.
Submitted: Sep 25, 2023