Suppose you use a neural network for a classification problem and the neurons in the output layer should return a valid discrete probability distribution. If you set the number of output neurons $$n$$ equal to the number of classes of your classification problem, you have the nice interpretation that the result for each neuron $$y_i$$ gives you the probability that the corresponding input belongs to the class $$\omega_i$$. If the network is confident in its classification, you will see a strong peak in the probability distribution. On the other hand, for a noisy input where the network has not really a clue what it means (or it hasn't learned yet), the resulting distribution will be more broadened.

But how do you transform the weighted input $$u_i = \left\langle \fvec{w}_i, \fvec{x} \right\rangle + b$$ (weight $$\fvec{w}_i$$, input $$\fvec{x}$$ and bias $$b$$) into a valid probability distribution? A common approach is to apply the softmax function on the weighted input

\begin{equation} \label{eq:Softmax} y_i = \frac{e^{c \cdot u_i}}{\sum_{j=1}^{n} e^{c \cdot u_j}} \end{equation}

with the parameter $$c \in \mathbb{R}$$. Using this transformation we ensure that the resulting output vector $$\fvec{y} = (y_1, y_2, \ldots, y_n)$$ satisfies

\begin{equation*} \sum_{i=1}^{n} y_i = 1 \quad \text{and} \quad \forall i : y_i \geq 0 \end{equation*}

and is therefore indeed a valid probability distribution. A common choice is to set $$c=1$$ but it is useful to analyse the result for different values for this parameter. You can do so in the following animation based on an arbitrary example vector

\begin{equation*} \fvec{u} = (-0.2, 1, 2, 0, 0, 2.1, 1.4, 0.8). \end{equation*}

List of attached files:

← Back to the overview page