Paper ID: 2211.09263

Informative Initialization and Kernel Selection Improves t-SNE for Biological Sequences

Prakash Chourasia, Sarwan Ali, Murray Patterson

The t-distributed stochastic neighbor embedding (t- SNE) is a method for interpreting high dimensional (HD) data by mapping each point to a low dimensional (LD) space (usually two-dimensional). It seeks to retain the structure of the data. An important component of the t-SNE algorithm is the initialization procedure, which begins with the random initialization of an LD vector. Points in this initial vector are then updated to minimize the loss function (the KL divergence) iteratively using gradient descent. This leads comparable points to attract one another while pushing dissimilar points apart. We believe that, by default, these algorithms should employ some form of informative initialization. Another essential component of the t-SNE is using a kernel matrix, a similarity matrix comprising the pairwise distances among the sequences. For t-SNE-based visualization, the Gaussian kernel is employed by default in the literature. However, we show that kernel selection can also play a crucial role in the performance of t-SNE. In this work, we assess the performance of t-SNE with various alternative initialization methods and kernels, using four different sets, out of which three are biological sequences (nucleotide, protein, etc.) datasets obtained from various sources, such as the well-known GISAID database for sequences of the SARS- CoV-2 virus. We perform subjective and objective assessments of these alternatives. We use the resulting t-SNE plots and k- ary neighborhood agreement (k-ANA) to evaluate and compare the proposed methods with the baselines. We show that by using different techniques, such as informed initialization and kernel matrix selection, that t-SNE performs significantly better. Moreover, we show that t-SNE also takes fewer iterations to converge faster with more intelligent initialization.

Submitted: Nov 16, 2022