Visualizing neural networks

These are my notes from the book Grokking Deep Learning by Andrew Trask. Feel free to check my first post on this book to get my overall thoughts and recommendations on how to approach this series. The rest of my notes for this book can be found here

Time to simplify

Impractical to think about everything all the time

Use mental tools. This chapter focuses on the construction of efficient concepts in your mind.

What have we learned so far?

Idea of machine learning in general
How individual linear nodes (neurons) learn
Horizontal groups of neurons (layers)
Vertical groups (stacks of layers)
Learning is just reducing error to 0
Use calculus to discover how to change each weight in network to move error in direction of 0.

All of the previous lessons lead to a single idea:

Neural networks search for (and create) correlation between input and output datasets.

To prevent us from being overwhelemed by the complexity of deep learning, let’s hold on to this idea of correlation, rather than all the previous smaller ideas. Call it correlation summarization

Correlation summarization

Correlation summarization - At the 10,000 foot level:

“Neural networks seek to find direct and indirect correlation between an input layer and an output layer, which are determined by the input and output datasets, respectively."

Local correlation summarization - What if only 2 layers?

“Any given set of weights optimizes to learn how to correlate its input layer with what the output layer says it should be."

Global correlation summarization - Backpropogation

“What an earlier layer says it should be can be determined by taking what a later layer says it should be and multiplying it by the weights in between them. This way, later layers can tell earlier layers what kind of signal they need, to ultimately find correlation with the output. This cross-communication is called backpropogation."

Previously overcomplicated visualization

Rather than looking at each layer in detail, Adam wants us to now focus on the concept of correlation summarization. In the previous visualization:

Weights are matrices (weights_0_1, weights_1_2)
Layers are vectors (layer_0, layer_1, layer_2)

alt_mage

The simplified visualization

Think of neural networks as LEGO bricks, with each brick representing a vector or matrix. In the image below, the strips are vectors, and blocks are matrices.

alt_mage

Simplifying even further

The number of dimensions in a matrix is determined by the layers. Each matrix’s number of rows and columns has a direct relationship to the dimensions of the layers before and after them.

Going back to this example, we can infer that weights_0_1 is a 3x4 matrix because layer_0 has three dimensions, and layer_1 has four dimensions.

This neural network will adjust the weights to find correlation between layer_0 and layer_2. The different configurations of weights and layers will have a strong impact on determining success at finding correlation.

The specific confguratino of weights and layers in a network is called its architecture.

alt_mage

Visualizing using letters instead of pictures

We can use simple algebra to explain what’s happening in these pictures. Let’s label each component of the network like this:

alt_mage

Let’s label layers with lowercase l, and weights with uppercase W. Then put a little subscript next to it so we know which layer or weight it’s referring to.

Linking the variables

Now we can combine the letters to indicate functions and operations.

Vector-matrix multiplication

Take layer 0 vector, and perform vector matrix multiplication with weight matrix 0: $$l _ { 0 }W _ { 0 }$$
Same thing, but using layer vector 1 and weight matrix 1: $$l _ { 1 }W _ { 1 }$$

Create assignments, and apply functions:

Create layer 1 by taking layer 0 vector, and perform vector matrix multiplication with weight matrix 0. Then perform relu function on the output: $$l _ { 1 } = relu(l _ { 0 }W _ { 0 })$$
Create layer 2 by taking layer 1 vector and perform vector matrix multiplication against weight matrix 1: $$l _ { 2 } = l _ { 1 } W _ { 1 }$$

All together in 1 line!

$$l _ { 2 } = r e l u \left( l _ { 0 } W _ { 0 } \right) W _ { 1 }$$

Everything side by side

Here’s the visualization, algebra formula, and Python code all together for forward propogation!

Python

layer_2 = relu(layer_0.dot(weights_0_1)).dot(weights_1_2)

Algebra

$$l _ { 2 } = r e l u \left( l _ { 0 } W _ { 0 } \right) W _ { 1 }$$

Visualization

alt_mage

Key takeaway

“Good neural architectures channel signal so that correlation is easy to discover. Great architectures also filter noise to prevent overfitting."