Intro to regularization and batching

Published March 16, 2019

These are my notes from the book Grokking Deep Learning by Andrew Trask. Feel free to check my first post on this book to get my overall thoughts and recommendations on how to approach this series. The rest of my notes for this book can be found here

Regularization and batching

This chapter will focus on making the network hone on signal, and ignore the noise. Key concepts:

  • Overfitting
  • Dropout
  • Batch gradient descent

An example of overfitting

Let’s train a three-layer network on the MNIST dataset. We’ll use this dataset to produce an example of overfitting, and then apply regularization to combat it.

import sys
import numpy as np
from keras.datasets import mnist
/Users/howie/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
(x_train, y_train), (x_test, y_test) = mnist.load_data()
images, labels = (x_train[0:1000].reshape(1000,28*28) / 255, y_train[0:1000])

one_hot_labels = np.zeros((len(labels), 10))
for i,l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels
labels
array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])
test_images = x_test.reshape(len(x_test), 28*28) / 255
test_labels = np.zeros((len(y_test), 10))
for i,l in enumerate(y_test):
    test_labels[i][l] = 1
np.random.seed(1)
relu = lambda x:(x>=0) * x # returns x if x > 0, return 0 otherwise
relu2deriv = lambda x: x>=0 # returns 1 for input > 0, return 0 otherwise
alpha, iterations, hidden_size, pixels_per_image, num_labels = (0.005, 350, 40, 784, 10)

weights_0_1 = 0.2*np.random.random((pixels_per_image,hidden_size)) - 0.1
weights_1_2 = 0.2*np.random.random((hidden_size,num_labels)) - 0.1

for j in range(iterations):
    error, correct_cnt = (0.0, 0)
    
    for i in range(len(images)):
        layer_0 = images[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)

        error += np.sum((labels[i:i+1] - layer_2) ** 2)
        correct_cnt += int(np.argmax(layer_2) == \
                                        np.argmax(labels[i:i+1]))

        layer_2_delta = (labels[i:i+1] - layer_2)
        layer_1_delta = layer_2_delta.dot(weights_1_2.T)\
                                    * relu2deriv(layer_1)
        weights_1_2 += alpha * layer_1.T.dot(layer_2_delta)
        weights_0_1 += alpha * layer_0.T.dot(layer_1_delta)

    sys.stdout.write("\r I:"+str(j)+ \
                     " Train-Err:" + str(error/float(len(images)))[0:5] +\
                     " Train-Acc:" + str(correct_cnt/float(len(images))))
 I:349 Train-Err:0.108 Train-Acc:1.099

The above neural network perfectly learned to predict all 1,000 images

The training accuracy above shows 100%. Unfortunately this is likely due to overfitting. If we run the code against test_images and test_labels, we can see how this network would perform against images it has never seen before.

if(j % 10 == 0 or j == iterations-1):
    error, correct_cnt = (0.0, 0)

    for i in range(len(test_images)):

        layer_0 = test_images[i:i+1]
        layer_1 = relu(np.dot(layer_0,weights_0_1))
        layer_2 = np.dot(layer_1,weights_1_2)

        error += np.sum((test_labels[i:i+1] - layer_2) ** 2)
        correct_cnt += int(np.argmax(layer_2) == \
                                        np.argmax(test_labels[i:i+1]))
    sys.stdout.write(" Test-Err:" + str(error/float(len(test_images)))[0:5] +\
                     " Test-Acc:" + str(correct_cnt/float(len(test_images))) + "\n")
    print()
 Test-Err:0.653 Test-Acc:0.7073

The network only predicted with an accuracy of 70%. This test accuracy is important because it simulates how well the network will perform in the real world. So why did the network do well on the training set, but so terribly on the test set?

Memorization vs. generalization

Neural networks are only usefull if they can be generalized. If the network overfits (meaning trained to the point where it exactly matches the input data), then it is basically memorizing the pre-labled images. This makes it kind of pointless, because we already know the labels of those images. We want the neural network to be general enough so that it can predict images that it has not seen before.

Overfitting in neural networks

Neural networks can ge worse if you train them too much.

Fork mold example:

  • Say we are creating a mold for a dinner fork as a tool to determine whether a particular utensil is a fork.
  • If object fits in the mold, then we say it’s a fork
  • Start with clay, and bucket of 3-pronged forks, spoons, knives
  • Press each fork into the same place to create an outline
  • Let the mold dry. None of the knives or spoons fit. Only 3-pronged forks fit.
  • What happens if you try a 4-pronged fork?
  • It won’t fit… even though it’s a fork. The mold only has 3 prongs.
  • The mold has been overfit to 3-pronged forks!

What causes networks to overfit?

In the fork example, what if we only pushed in 1 or 2 forks? Assuming the clay was very thick, it wouldn’t have much detail. Just a general shape of a fork. This shape might be compatible with both 3 and 4-pronged forks.

The mold got worse at the testing dataset as more forks were imprinted because it learned detailed information that was too specific to the forks being used (training set). In this case, it was the number of prongs. In images, this is generally referred to as noise. How do we get a neural network to train only on the signal (the shape of the fork), and not the noise (the prongs)?

Simplest regularization: Early Stopping

Stop training the network when it starts getting worse! Early stopping is the cheapest form of regularization.

Regularization is a way for getting models to generalize to new datapoints as opposed to just memorizing the data. Helps neural networks learn the signal and ignore the noise.

The only real way to know when to stop training is to run the model on a valiation set. Don’t use the test set, because the network may overfit to the test set.

Industry standard regularization: Dropout

During training, randomly set neurons in the network to 0. This causes the network to train exclusively using random subsections of the network.

The smaller the network, the less it’s able to overfit. Going back to clay example - imagine clay made of very fine grained sand vs. larger rocks. The larger rocks would not be able to express the same amount of detail as the fine grained sand. Larger networks are like fine grained sand. More room or capacity.

Randomly turning off nodes makes a big network behave like a small one, but the sum of the total of the entire network still maintains its expressive power!

Why dropout works

If you train 100 randomly initialized neural networks, they will each latch onto different noise, but similar signal. When they make mistakes, they will be differing mistakes. Their noise would tend to cancel out, revealing only what they all learned in common, the signal.

  • It’s likely large unregularized networks will overfit to noise, but it’s unlikely it will be the same noise.
  • Neural networks start by learning the biggest most broadly sweeping features before learning miuch about noise

Batch gradient descent

A method for increasing speed of training and the rate of convergence.

Rather than training one example at a time, and updating the weights after each example - we train 100 examples at a time, and average the weight updates among all 100 examples.

Individual training examples are very noisy in terms of the weight updates they generate.

Running in batches is much faster. Each np.dot function is now performing 100 vector dot products at a time.

Batch gradient descent also allows for higher learning rates (alpha), because the example takes an average of a noisy signals, thus it can take bigger steps.