Coursera: Machine Learning; Exercise 4 Tutorial
This document was adapted from the section titled Tutorial for Ex.4 Forward and Backpropagation (Spring 2014 session) from Coursera Programming Exercise 4 document on 2018-05-27.
I've reformatted the document, and made a few minor edits for legibility and consistency.
I do not know who the original author was; if you know please email [email protected], or just leave a comment in this GitHub gist. Thanks! 🙇
Expand the y
output values into a matrix of single values (see ex4.pdf
Page 5).
This is most easily done using an eye()
matrix of size num_labels
, with vectorized indexing by y
, as in eye(num_labels)(y,:)
. A typical variable name would be y_matrix
.
a1
equals theX
input matrix with a column of 1's added (bias units)z2
equals the product ofa1
andΘ1
a2
is the result of passingz2
throughg()
a2
then has a column of 1st added (bias units)z3
equals the product ofa2
andΘ2
a3
is the result of passingz3
throughg()
Compute the unregularized cost according to ex4.pdf
(top of Page 5).
ℹ️ Note: I had a hard time understanding this equation mainly that I had a misconception that
y(i)k
is a vector, instead it is just simply one number.
- Using
a3
,y_matrix
, andm
(the number of training examples). - Cost should be a scalar value. If you get a vector of cost values, you can sum that vector to get the cost.
- Remember to use element-wise multiplication with the
log()
function. - Now you can run
ex4.m
to check the unregularized cost is correct, then you can submit Part 1 to the grader.
Compute the regularized component of the cost according to ex4.pdf
Page 6, using Θ1
and Θ2
(ignoring the columns of bias units), along with λ
, and m
.
The easiest method to do this is to compute the regularization terms separately, then add them to the unregularized cost from Step 3.
You can run ex4.m
to check the regularized cost, then you can submit Part 2 to the grader.
You'll need to prepare the sigmoid gradient function g′()
, as shown in ex4.pdf
Page 7.
You can submit Part 3 to the grader.
Implement the random initialization function as instructed on ex4.pdf
, top of Page 8.
You do not submit this function to the grader.
Now we work from the output layer back to the hidden layer, calculating how bad the errors are.
See ex4.pdf
Page 9 for reference.
δ3
equals the difference betweena3
and they_matrix
.δ2
equals the product ofδ3
andΘ2
(ignoring theΘ2
bias units), then multiplied element-wise by theg′()
ofz2
(computed back in Step 2).
Note that at this point, the instructions in
ex4.pdf
are specific to looping implementations, so the notation there is different.
Δ2
equals the product ofδ3
anda2
. This step calculates the product and sum of the errors.Δ1
equals the product ofδ2
anda1
. This step calculates the product and sum of the errors.
Now we calculate the non-regularized theta gradients, using the sums of the errors we just computed (See ex4.pdf
bottom of Page 11).
Θ1
gradient equalsΔ1
scaled by1/m
.Θ2
gradient equalsΔ2
scaled by1/m
.
The
ex4.m
script will also perform gradient checking for you, using a smaller test case than the full character classification example.So if you're debugging your
nnCostFunction()
using thekeyboard
command during this, you'll suddenly be seeing some much smaller sizes ofX
and theΘ
values. Do not be alarmed.
If the feedback provided to you by ex4.m
for gradient checking seems OK, you can now submit Part 4 to the grader.
For reference see ex4.pdf
, top of Page 12, for the right-most terms of the equation for j>=1
.
Now we calculate the regularization terms for the theta gradients.
The goal is that regularization of the gradient should not change the theta gradient(:,1)
values (for the bias units) calculated in Step 8.
There are several ways to implement this (in Steps 9a and 9b).
Method 1
- a. Calculate the regularization for indexes
(:,2:end)
- b. Add ☝️ them to theta gradients
(:,2:end)
.
Method 2
- a. Calculate the regularization for the entire theta gradient, then overwrite the
(:,1)
value with0
- b. Add ☝️ to the entire matrix.
Pick a method, and calculate the regularization terms as follows:
(λ/m)*Θ1
(using either Method 1 or Method 2), and...(λ/m)*Θ2
(using either Method 1 or Method 2)
Add these regularization terms to the appropriate Θ1
gradient and Θ2
gradient terms from Step 8 (using either Method 1 or Method 2).
Avoid modifying the bias unit of the theta gradients.
⚠️ Note: there is an errata in the lecture video and slides regarding some missing parenthesis for this calculation. Theex4.pdf
file is correct.
The ex4.m
script will provide you feedback regarding the acceptable relative difference. If all seems well, you can submit Part 5 to the grader.Now pat yourself on the back.
Here are the sizes for the character recognition example, using the method described in this tutorial:
a1
:5000x401
z2
:5000x25
a2
:5000x26
a3
:5000x10
d3
:5000x10
d2
:5000x25
Theta1
,Delta1
andTheta1grad
:25x401
Theta2
,Delta2
andTheta2grad
:10x26
ℹ️ Note: The
ex4.m
script uses a several test cases of different sizes, and the submit grader uses yet another different test case.