Suppose you use a neural network for a classification problem and the neurons in the output layer should return a valid discrete probability distribution. If you set the number of output neurons \(n\) equal to the number of classes of your classification problem, you have the nice interpretation that the result for each neuron \(y_i\) gives you the probability that the corresponding input belongs to the class \(\omega_i\). If the network is confident in its classification, you will see a strong peak in the probability distribution. On the other hand, for a noisy input where the network has not really a clue what it means (or it hasn't learned yet), the resulting distribution will be more broadened.

But how do you transform the weighted input \(u_i = \left\langle \fvec{w}_i, \fvec{x} \right\rangle + b\) (weight \(\fvec{w}_i\), input \(\fvec{x}\) and bias \(b\)) into a valid probability distribution? A common approach is to apply the softmax function on the weighted input

\begin{equation} \label{eq:Softmax} y_i = \frac{e^{c \cdot u_i}}{\sum_{j=1}^{n} e^{c \cdot u_j}} \end{equation}

with the parameter \(c \in \mathbb{R}\). Using this transformation we ensure that the resulting output vector \(\fvec{y} = (y_1, y_2, \ldots, y_n)\) satisfies

\begin{equation*} \sum_{i=1}^{n} y_i = 1 \quad \text{and} \quad \forall i : y_i \geq 0 \end{equation*}

and is therefore indeed a valid probability distribution. A common choice is to set \(c=1\) but it is useful to analyse the result for different values for this parameter. You can do so in the following animation based on an arbitrary example vector

\begin{equation*} \fvec{u} = (-0.2, 1, 2, 0, 0, 2.1, 1.4, 0.8). \end{equation*}

Figure 1: Result of the softmax function (\eqref{eq:Softmax}) for different values for the parameter \(c\). Pay particular attention to the cases \(c < 0, c = 0\) and \(c > 0\).

List of attached files:

← Back to the overview page