Visualizing high-dimensional data is a demanding task since we are restricted to our three-dimensional world. A common approach to tackle this problem is to apply some dimensionality reduction algorithm first. This maps \(n\) data points \(\fvec{x}_i \in \mathbb{R}^d\) in the feature space to \(n\) projection points \(\fvec{y}_i \in \mathbb{R}^r\) in the projection space. If we choose \(r \in \{1,2,3\}\), we reach a point where we can successfully visualize the data. However, this mapping does not come at no cost since it is just not possible to visualize high-dimensional data in a low-dimensional space without the loss of at least some information. Hence, different algorithms focus on different aspects. \(t\)-Distributed Stochastic Neighbor Embedding (\(t\)-SNE) [video introduction] is such an algorithm which tries to preserve local neighbour relationships at the cost of distance or density information.

The general idea is to use probabilites for both the data points and the projections which reflect the local neighbourhood. For the data points, conditional probabilities

\begin{equation} \label{eq:ProbCondFeature} p_{j|i} = \frac {e^{-\left\| \fvec{x}_i - \fvec{x}_j \right\|^2 / 2\sigma_i^2}} {\sum_{k \neq i} e^{-\left\| \fvec{x}_i - \fvec{x}_k \right\|^2 / 2\sigma_i^2}} \quad \text{with} \quad p_{i|i} = 0. \end{equation}

are used and symmetrized

\begin{equation*} p_{ij} = \frac{p_{j|i} + p_{i|j}}{2 n}. \end{equation*}

The parameter \(\sigma_i\) controls how many neighbours are considered in the probability distribution \(p_{j|i}\) for each point. This is demonstrated in the following animation for a two-dimensional example dataset.

Figure 1: Impact of the parameter \(\sigma_i\) on the resulting probability distribution \(p_{j|9}\) (\eqref{eq:ProbCondFeature}) for the arbitrarily chosen data point \(\fvec{x}_9\). The top figure shows the impact of the corresponding (unscaled) Gaussian \(e^{-\left\| \fvec{x}_9 - \fvec{x} \right\|^2 / 2\sigma_9^2}\) centred at the data point \(\fvec{x}_9\). The colours of the points are obtained via \(k\)-means clustering (with \(k=3\)) and are used to indicate the natural clusters in the data. The same colours are also used in the projections later.

We also define probabilities for the projection points

\begin{equation} \label{eq:ProbProj} q_{ij} = \frac{1}{n} \frac {\left( 1 + \left\| \fvec{y}_i - \fvec{y}_j \right\|^2 \right)^{-1}} {\sum_{k \neq i} \left( 1 + \left\| \fvec{y}_i - \fvec{y}_k \right\|^2 \right)^{-1}} \quad \text{with} \quad q_{ii} = 0. \end{equation}

Our goal is to find points with corresponding probabilities \(p_{ij} = q_{ij}\). Even though it is unlikely that we get equality here, we can still try to make the two probability distributions \(P\) and \(Q\) as close as possible. Note that the probabilities \(p_{ij}\) are fixed and so we already know how the matrix \(P\) looks like.

Probability matrix P
Figure 2: Probability matrix \(P\) for our data points \(\fvec{x}_i\) corresponding to a perplexity of 2.

What is left is to change our projections points \(\fvec{y}_i\) so that \(Q\) is more like \(P\). To achieve this goal, \(t\)-SNE defines a cost function based on the Kullback–Leibler divergence

\begin{equation} \label{eq:KLDivergence} D_Q(P) = \sum_{i=1}^{n} \sum_{j=1}^{n} p_{ij} \cdot \log_2\left( \frac{p_{ij}}{q_{ij}} \right) \end{equation}

and calculates the gradient for this function

\begin{equation} \label{eq:Gradient} \frac{\partial D_Q(P)}{\partial \fvec{y}_i} = 4 \sum_{j=1}^{n} (p_{ij} - q_{ij}) (\fvec{y}_i - \fvec{y}_j) \left( 1 + \left\| \fvec{y}_i - \fvec{y}_j \right\|^2 \right)^{-1}. \end{equation}

This allows defining learning rules for randomly initialized projection points \(\fvec{y}_i(0)\)

\begin{align*} \fvec{y}_i(t + 1) &= \fvec{y}_i(t) - \eta \cdot \frac{\partial D_Q(P)}{\partial \fvec{y}_i} \\ &= \fvec{y}_i(t) - \eta \cdot 4 \sum_{j=1}^{n} (p_{ij} - q_{ij}(t)) (\fvec{y}_i(t) - \fvec{y}_j(t)) \left( 1 + \left\| \fvec{y}_i(t) - \fvec{y}_j(t) \right\|^2 \right)^{-1}. \end{align*}

\(\eta < 0\) is a learning rate controlling the speed of convergence of the algorithm (but setting this value too high may cause problems). How the complete algorithm iterates to a solution for the example dataset is shown in the following animation.

Figure 3: The \(t\)-SNE algorithm in action. The top part shows a matrix plot of the probability matrix \(Q\) for the current iteration. In the bottom part, the projections \(y_i\) are shown. With increasing iterations, they move apart from each other due to the attraction and repelling forces reaching to a state where we can see the original clusters (this works at least in this case here). A learning rate of \(\eta = 0.25\) was used. Note that in each iteration every point gets updated.

Plotting the cost function of \eqref{eq:KLDivergence} as a function of the iteration time \(t\) reveals how the algorithm approaches a local minimum. However, it has not reached it since only 1000 iterations of the algorithm are shown.

Error over iteration time t
Figure 4: Plot of the error \(D_Q(P)\) (\eqref{eq:KLDivergence}) over the iteration time \(t\).

← Back to the overview page