Coursera: Machine Learning; Exercise 4 Tutorial

Coursera: Machine Learning; Exercise 4 Tutorial

Editor's Notes

This document was adapted from the section titled Tutorial for Ex.4 Forward and Backpropagation (Spring 2014 session) from Coursera Programming Exercise 4 document on 2018-05-27.

I've reformatted the document, and made a few minor edits for legibility and consistency.

I do not know who the original author was; if you know please email [email protected], or just leave a comment in this GitHub gist. Thanks! 🙇

Step 1: Matrixify `y`

Expand the y output values into a matrix of single values (see ex4.pdf Page 5).

This is most easily done using an eye() matrix of size num_labels, with vectorized indexing by y, as in eye(num_labels)(y,:). A typical variable name would be y_matrix.

Step 2: Forward Propagation

a1 equals the X input matrix with a column of 1's added (bias units)
z2 equals the product of a1 and Θ1
a2 is the result of passing z2 through g()
a2 then has a column of 1st added (bias units)
z3 equals the product of a2 and Θ2
a3 is the result of passing z3 through g()

Step 3: Cost Function (non-regularized)

Compute the unregularized cost according to ex4.pdf (top of Page 5).

ℹ️ Note: I had a hard time understanding this equation mainly that I had a misconception that y(i)k is a vector, instead it is just simply one number.

Using a3, y_matrix, and m (the number of training examples).
Cost should be a scalar value. If you get a vector of cost values, you can sum that vector to get the cost.
Remember to use element-wise multiplication with the log() function.
Now you can run ex4.m to check the unregularized cost is correct, then you can submit Part 1 to the grader.

Step 4: Cost Regularization

Compute the regularized component of the cost according to ex4.pdf Page 6, using Θ1 and Θ2 (ignoring the columns of bias units), along with λ, and m.

The easiest method to do this is to compute the regularization terms separately, then add them to the unregularized cost from Step 3.

You can run ex4.m to check the regularized cost, then you can submit Part 2 to the grader.

Step 5: Sigmoid Gradient

You'll need to prepare the sigmoid gradient function g′(), as shown in ex4.pdf Page 7.

You can submit Part 3 to the grader.

Step 6: Random Initialization

Implement the random initialization function as instructed on ex4.pdf, top of Page 8.

You do not submit this function to the grader.

Step 7: Backpropagation

Now we work from the output layer back to the hidden layer, calculating how bad the errors are.

See ex4.pdf Page 9 for reference.

δ3 equals the difference between a3 and the y_matrix.
δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units), then multiplied element-wise by the g′() of z2 (computed back in Step 2).

Note that at this point, the instructions in ex4.pdf are specific to looping implementations, so the notation there is different.

Δ2 equals the product of δ3 and a2. This step calculates the product and sum of the errors.
Δ1 equals the product of δ2 and a1. This step calculates the product and sum of the errors.

Step 8: Gradient (non-regularized)

Now we calculate the non-regularized theta gradients, using the sums of the errors we just computed (See ex4.pdf bottom of Page 11).

Θ1 gradient equals Δ1 scaled by 1/m.
Θ2 gradient equals Δ2 scaled by 1/m.

The ex4.m script will also perform gradient checking for you, using a smaller test case than the full character classification example.

So if you're debugging your nnCostFunction() using the keyboard command during this, you'll suddenly be seeing some much smaller sizes of X and the Θ values. Do not be alarmed.

If the feedback provided to you by ex4.m for gradient checking seems OK, you can now submit Part 4 to the grader.

Step 9: Gradient Regularization

For reference see ex4.pdf, top of Page 12, for the right-most terms of the equation for j>=1.

Now we calculate the regularization terms for the theta gradients.

The goal is that regularization of the gradient should not change the theta gradient(:,1) values (for the bias units) calculated in Step 8.

There are several ways to implement this (in Steps 9a and 9b).

Method 1

a. Calculate the regularization for indexes (:,2:end)
b. Add ☝️ them to theta gradients (:,2:end).

Method 2

a. Calculate the regularization for the entire theta gradient, then overwrite the (:,1) value with 0
b. Add ☝️ to the entire matrix.

Pick a method, and calculate the regularization terms as follows:

(λ/m)*Θ1 (using either Method 1 or Method 2), and...
(λ/m)*Θ2 (using either Method 1 or Method 2)

Add these regularization terms to the appropriate Θ1 gradient and Θ2 gradient terms from Step 8 (using either Method 1 or Method 2).

Avoid modifying the bias unit of the theta gradients.

⚠️ Note: there is an errata in the lecture video and slides regarding some missing parenthesis for this calculation. The ex4.pdf file is correct.

The ex4.m script will provide you feedback regarding the acceptable relative difference. If all seems well, you can submit Part 5 to the grader.Now pat yourself on the back.

Appendix

Here are the sizes for the character recognition example, using the method described in this tutorial:

a1: 5000x401
z2: 5000x25
a2: 5000x26
a3: 5000x10
d3: 5000x10
d2: 5000x25
Theta1, Delta1 and Theta1grad: 25x401
Theta2, Delta2 and Theta2grad: 10x26

ℹ️ Note: The ex4.m script uses a several test cases of different sizes, and the submit grader uses yet another different test case.

namuol/ex_4_tutorial.md