Visualizing high-dimensional data is a demanding task since we are restricted to our three-dimensional world. A common approach to tackle this problem is to apply some dimensionality reduction algorithm first. This maps $$n$$ data points $$\fvec{x}_i \in \mathbb{R}^d$$ in the feature space to $$n$$ projection points $$\fvec{y}_i \in \mathbb{R}^r$$ in the projection space. If we choose $$r \in \{1,2,3\}$$, we reach a point where we can successfully visualize the data. However, this mapping does not come at no cost since it is just not possible to visualize high-dimensional data in a low-dimensional space without the loss of at least some information. Hence, different algorithms focus on different aspects. $$t$$-Distributed Stochastic Neighbor Embedding ($$t$$-SNE) [video introduction] is such an algorithm which tries to preserve local neighbour relationships at the cost of distance or density information.

The general idea is to use probabilites for both the data points and the projections which reflect the local neighbourhood. For the data points, conditional probabilities

\begin{equation} \label{eq:ProbCondFeature} p_{j|i} = \frac {e^{-\left\| \fvec{x}_i - \fvec{x}_j \right\|^2 / 2\sigma_i^2}} {\sum_{k \neq i} e^{-\left\| \fvec{x}_i - \fvec{x}_k \right\|^2 / 2\sigma_i^2}} \quad \text{with} \quad p_{i|i} = 0. \end{equation}

are used and symmetrized

\begin{equation*} p_{ij} = \frac{p_{j|i} + p_{i|j}}{2 n}. \end{equation*}

The parameter $$\sigma_i$$ controls how many neighbours are considered in the probability distribution $$p_{j|i}$$ for each point. This is demonstrated in the following animation for a two-dimensional example dataset.

We also define probabilities for the projection points

\begin{equation} \label{eq:ProbProj} q_{ij} = \frac{1}{n} \frac {\left( 1 + \left\| \fvec{y}_i - \fvec{y}_j \right\|^2 \right)^{-1}} {\sum_{k \neq i} \left( 1 + \left\| \fvec{y}_i - \fvec{y}_k \right\|^2 \right)^{-1}} \quad \text{with} \quad q_{ii} = 0. \end{equation}

Our goal is to find points with corresponding probabilities $$p_{ij} = q_{ij}$$. Even though it is unlikely that we get equality here, we can still try to make the two probability distributions $$P$$ and $$Q$$ as close as possible. Note that the probabilities $$p_{ij}$$ are fixed and so we already know how the matrix $$P$$ looks like. Figure 2: Probability matrix $$P$$ for our data points $$\fvec{x}_i$$ corresponding to a perplexity of 2.

What is left is to change our projections points $$\fvec{y}_i$$ so that $$Q$$ is more like $$P$$. To achieve this goal, $$t$$-SNE defines a cost function based on the Kullback–Leibler divergence

\begin{equation} \label{eq:KLDivergence} D_Q(P) = \sum_{i=1}^{n} \sum_{j=1}^{n} p_{ij} \cdot \log_2\left( \frac{p_{ij}}{q_{ij}} \right) \end{equation}

and calculates the gradient for this function

\begin{equation} \label{eq:Gradient} \frac{\partial D_Q(P)}{\partial \fvec{y}_i} = 4 \sum_{j=1}^{n} (p_{ij} - q_{ij}) (\fvec{y}_i - \fvec{y}_j) \left( 1 + \left\| \fvec{y}_i - \fvec{y}_j \right\|^2 \right)^{-1}. \end{equation}

This allows defining learning rules for randomly initialized projection points $$\fvec{y}_i(0)$$

\begin{align*} \fvec{y}_i(t + 1) &= \fvec{y}_i(t) - \eta \cdot \frac{\partial D_Q(P)}{\partial \fvec{y}_i} \\ &= \fvec{y}_i(t) - \eta \cdot 4 \sum_{j=1}^{n} (p_{ij} - q_{ij}(t)) (\fvec{y}_i(t) - \fvec{y}_j(t)) \left( 1 + \left\| \fvec{y}_i(t) - \fvec{y}_j(t) \right\|^2 \right)^{-1}. \end{align*}

$$\eta < 0$$ is a learning rate controlling the speed of convergence of the algorithm (but setting this value too high may cause problems). How the complete algorithm iterates to a solution for the example dataset is shown in the following animation.

Plotting the cost function of \eqref{eq:KLDivergence} as a function of the iteration time $$t$$ reveals how the algorithm approaches a local minimum. However, it has not reached it since only 1000 iterations of the algorithm are shown. Figure 4: Plot of the error $$D_Q(P)$$ (\eqref{eq:KLDivergence}) over the iteration time $$t$$.
← Back to the overview page