This showcase presents some simulation results for a deep neural network consisting of 21 layers. Based on randomly generated data, the distribution of network activations and gradients is analysed for different activation functions. This reveals how the flow of activations from the first to the last and the flow of the gradients from the last to the first layer behaves for different activation functions.

The simulation works as follows: random numbers from a normal distribution $$\mathcal{N}(\mu, \sigma) = \mathcal{N}(0, 1)$$ serve as input $$X$$ to the network. Then, one forward pass through the network is calculated. This is basically a loop of matrix operations

\begin{equation*} Y_i = f(Y_{i-1} \cdot W_i) \end{equation*}

with the input $$Y_{i-1}$$ from the previous layer (with $$Y_0 = X$$) to the current layer (on $$n_{\text{in}}$$ connections), the weight matrix $$W$$ initialized with randomly generated numbers from a $$\mathcal{N}(0, 1/\sqrt{n_{\text{in}}})$$ distribution and the transfer function $$f(x)$$. Note that for simplicity no bias is included. After the forward pass, one backwards pass is calculated where the main interest lies in the values of the gradients (to see if the network learns). The target values (e.g. labels in a real-world application) are also faked with random numbers. This is, of course, not a realistic example but should reflect the main behaviour of the activation functions. The idea for the simulation is borrowed from the post Training very deep networks with Batchnorm (the effect of batch normalization is similar to what the SELU activation function is capable of).

In order to calculate the statistical measures, all activations from the entire input set are converted to a list of values per layer. That is, for every input $$\fvec{x}$$ (e.g. a row of $$X$$), all activations $$\smash{y_i^{(l)}}$$ from the $$i$$ neurons in the $$l$$-th layer are converted to one list. Then, statistics from this list are calculated. The same procedure is applied to every layer.

The first plot shows the range $$[\mu - \sigma; \mu + \sigma]$$ (one standard deviation range from the mean) of network activations per layer. It should reveal the main range of the activations throughout the layer hierarchy.

The next plot shows how the gradients flow back from the end to the beginning. It is a measure of how much the network learns in one epoch in each layer. The statistics are calculated in a similar way as the network activations.

Finally, a closer look at the distributions of network activations. In each layer, a histogram of the activations is calculated and all histograms from all layers are combined in one surface plot. With this, we can see how the distribution of activations changes as going deeper into the hierarchy.

List of attached files:

← Back to the overview page