ML Exam Cheat Sheet — Worked Examples

Formulas

Topic	Formula	Variables
Z-score	`z = (x - μ) / σ`	x=value, μ=mean, σ=std dev
Min-max	`x' = (x - xmin) / (xmax - xmin)`	xmin/xmax=dataset min/max
Accuracy	`(TP + TN) / N`	N=total samples
Precision	`TP / (TP + FP)`	"of all predicted positive, how many were right"
Recall	`TP / (TP + FN)`	"of all actual positive, how many did we catch"
F1	`2 × (Precision × Recall) / (Precision + Recall)`	harmonic mean of P and R
OLS slope	`m = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)²`	x̄=mean of x, ȳ=mean of y
OLS intercept	`b = ȳ - m × x̄`	plug in after finding m
MSE	`(1/n) × Σ(yᵢ - ŷᵢ)²`	yᵢ=actual, ŷᵢ=predicted
R²	`1 - SS_res/SS_tot`	SS_res=Σ(yᵢ-ŷᵢ)², SS_tot=Σ(yᵢ-ȳ)²
GD Loss	`L = 1/(2n) × Σ(ŷᵢ - yᵢ)²`	n=number of points
GD ∂L/∂m	`(1/n) × Σ(ŷᵢ - yᵢ)xᵢ`	error × x for each point
GD ∂L/∂b	`(1/n) × Σ(ŷᵢ - yᵢ)`	just the errors
GD update	`m₁ = m₀ - α × ∂L/∂m`	α=learning rate
K-Means distance	`d = √((x₂-x₁)² + (y₂-y₁)²)`	Euclidean distance
K-Means centroid	`new_c = mean of all points in cluster`	x and y averaged separately
Naive Bayes	`P(class\|features) ∝ P(class) × P(f1\|class) × ...`	∝ means "proportional to"
Gini impurity	`Gini = 1 - Σ(pᵢ)²`	pᵢ=proportion of class i at node
Gini gain	`Gain = Gini(parent) - weighted avg Gini(children)`	pick feature with highest gain

Topic 1: Feature Scaling

A dataset contains the following recorded temperatures (°C) from a weather station over 5 days: [23, 45, 31, 52, 18]

Apply z-score normalization and min-max scaling to all values.

Z-score

Thinking: "I need mean and std first. Mean = sum/n."

(23+45+31+52+18)/5 = 169/5 = 33.8

std = √[((23-33.8)² + (45-33.8)² + (31-33.8)² + (52-33.8)² + (18-33.8)²) / 5] = √[(116.64 + 125.44 + 7.84 + 331.24 + 249.64) / 5] = √[830.8/5] = √165.76 = 12.875

Thinking: "Now plug each value into (x - mean)/std."

z1 = (23-33.8)/12.875 = -0.839
z2 = (45-33.8)/12.875 = +0.870
z3 = (31-33.8)/12.875 = -0.217
z4 = (52-33.8)/12.875 = +1.413
z5 = (18-33.8)/12.875 = -1.227

Check: values centered around 0, some negative some positive. ✓

Min-max

Thinking: "Find min and max first. Min=18, Max=52. Range = 52-18 = 34."

x1 = (23-18)/34 = 0.147
x2 = (45-18)/34 = 0.794
x3 = (31-18)/34 = 0.382
x4 = (52-18)/34 = 1.0
x5 = (18-18)/34 = 0.0

Check: min maps to 0, max maps to 1. ✓

Topic 2: Confusion Matrix

A spam detection model was tested on 200 emails. The results are:

45 spam emails correctly identified as spam
12 spam emails missed (predicted as not spam)
18 legitimate emails wrongly flagged as spam
125 legitimate emails correctly identified as not spam

Calculate accuracy, precision, recall, and F1. Is this dataset balanced or imbalanced? If FN is more costly than FP, which metric matters most?

Thinking: "First map the story to TP, TN, FP, FN. Spam = positive class."

TP = 45 (spam correctly caught)
FN = 12 (spam missed — predicted not spam)
FP = 18 (legit flagged as spam)
TN = 125 (legit correctly cleared)
N = 200

Accuracy = (45+125)/200 = 170/200 = 0.85

Precision = 45/(45+18) = 45/63 = 0.714

Recall = 45/(45+12) = 45/57 = 0.789

F1 = 2 × (0.714 × 0.789)/(0.714 + 0.789) = 2 × 0.563/1.503 = 0.750

Thinking: "Balanced? Spam = 57, Not spam = 143. 57/200 = 28.5%. That's less than ~50%. Imbalanced."

Thinking: "FN more costly = missing spam is worse than false alarms. That means Recall is the key metric — it measures how many actual spams we caught."

Recall = 0.789 → catching ~79% of spam. Decent but not great for a high-stakes filter.

Topic 3: OLS Linear Regression

A researcher records hours of sunlight per day and crop yield (kg) for 5 farms:

Hours of sunlight (x)	Crop yield (y)
3	42
5	61
7	78
8	85
11	103

Fit a linear regression model. Predict yield for 9 hours. If actual yield at 9 hours was 91kg, what is the residual?

Thinking: "Calculate means first."

x̄ = (3+5+7+8+11)/5 = 34/5 = 6.8 ȳ = (42+61+78+85+103)/5 = 369/5 = 73.8

Thinking: "Build a table for slope numerator and denominator."

xᵢ	yᵢ	xᵢ-x̄	yᵢ-ȳ	(xᵢ-x̄)(yᵢ-ȳ)	(xᵢ-x̄)²
3	42	-3.8	-31.8	120.84	14.44
5	61	-1.8	-12.8	23.04	3.24
7	78	0.2	4.2	0.84	0.04
8	85	1.2	11.2	13.44	1.44
11	103	4.2	29.2	122.64	17.64
Σ				280.8	36.8

m = 280.8 / 36.8 = 7.63

Thinking: "Now intercept: b = ȳ - m×x̄"

b = 73.8 - 7.63×6.8 = 73.8 - 51.88 = 21.92

Model: ŷ = 7.63x + 21.92

Thinking: "Predict for x=9."

ŷ = 7.63×9 + 21.92 = 68.67 + 21.92 = 90.59

Thinking: "Residual = actual - predicted."

Residual = 91 - 90.59 = +0.41

Positive residual = model slightly underestimated.

Topic 4: MSE + R²

Using the crop yield model from Topic 3 (ŷ = 7.63x + 21.92), calculate MSE and R² for the 5 training points. Interpret the R² value.

Thinking: "Generate predictions for all 5 points, then compute errors."

xᵢ	yᵢ	ŷᵢ	(yᵢ-ŷᵢ)	(yᵢ-ŷᵢ)²
3	42	44.81	-2.81	7.90
5	61	60.07	0.93	0.86
7	78	75.33	2.67	7.13
8	85	82.96	2.04	4.16
11	103	105.85	-2.85	8.12

MSE = (7.90+0.86+7.13+4.16+8.12)/5 = 28.17/5 = 5.63

Thinking: "For R² I need SS_res and SS_tot. SS_res = sum of squared errors above. SS_tot uses ȳ=73.8."

SS_res = 28.17

yᵢ	yᵢ-ȳ	(yᵢ-ȳ)²
42	-31.8	1011.24
61	-12.8	163.84
78	4.2	17.64
85	11.2	125.44
103	29.2	852.64

SS_tot = 2170.8

R² = 1 - 28.17/2170.8 = 1 - 0.013 = 0.987

Thinking: "R²=0.987 means the model explains 98.7% of the variance in crop yield. Excellent fit."

Topic 5: Gradient Descent

A dataset shows hours studied vs exam score:

x	y
2	35
4	55
6	70
8	88

Starting at m=0, b=0, learning rate α=0.1. Perform one full iteration of gradient descent.

Part a: Initial loss L(0,0)

Thinking: "All predictions are 0 since m=0, b=0. Errors are just -y."

L = 1/(2×4) × [(0-35)² + (0-55)² + (0-70)² + (0-88)²] = 1/8 × [1225 + 3025 + 4900 + 7744] = 1/8 × 16894 = 2111.75

Part b: Gradient formulas

Thinking: "Write these down — the 2 from the power cancels the 2 in 1/(2n)."

∂L/∂m = (1/n) × Σ(ŷᵢ - yᵢ) × xᵢ
∂L/∂b = (1/n) × Σ(ŷᵢ - yᵢ)

Part c: Evaluate gradients at m=0, b=0

xᵢ	yᵢ	ŷᵢ-yᵢ	(ŷᵢ-yᵢ)×xᵢ
2	35	-35	-70
4	55	-55	-220
6	70	-70	-420
8	88	-88	-704

∂L/∂m = (1/4) × (-70-220-420-704) = (1/4) × (-1414) = -353.5

∂L/∂b = (1/4) × (-35-55-70-88) = (1/4) × (-248) = -62

Part d: Update parameters

Thinking: "Negative gradient → step in positive direction."

m₁ = 0 - 0.1 × (-353.5) = 35.35 b₁ = 0 - 0.1 × (-62) = 6.2

Part e: New loss at m=35.35, b=6.2

Thinking: "Build new predictions with ŷ = 35.35x + 6.2"

xᵢ	yᵢ	ŷᵢ	ŷᵢ-yᵢ	(ŷᵢ-yᵢ)²
2	35	77.1	42.1	1772.41
4	55	147.6	92.6	8574.76
6	70	218.3	148.3	21992.89
8	88	289.0	201.0	40401.00

L = 1/8 × 72741.06 = 9092.63

Thinking: "Loss went up — α=0.1 is too large, causing overshoot. Just report the number on the exam."

Topic 6: K-Means Clustering

A gym tracks 6 members by age and weekly workout hours. Perform one full iteration of K-Means with initial centroids C1=(2,20) and C2=(8,50).

Member	Age (x)	Workouts/week (y)
A	1	15
B	3	25
C	4	30
D	7	45
E	9	55
F	10	60

Step 1: Assign to nearest centroid

Thinking: "For every point compute d=√((x-cx)²+(y-cy)²) to both centroids. Assign to closer one."

Member	d to C1=(2,20)	d to C2=(8,50)	Assigned
A(1,15)	√(1+25)=5.10	√(49+1225)=35.70	C1
B(3,25)	√(1+25)=5.10	√(25+625)=25.50	C1
C(4,30)	√(4+100)=10.20	√(16+400)=20.40	C1
D(7,45)	√(25+625)=25.50	√(1+25)=5.10	C2
E(9,55)	√(49+1225)=35.70	√(1+25)=5.10	C2
F(10,60)	√(64+1600)=40.79	√(4+100)=10.20	C2

Step 2: Update centroids

Thinking: "New centroid = average of all points in that cluster."

C1 cluster: A(1,15), B(3,25), C(4,30)

x: (1+3+4)/3 = 2.67
y: (15+25+30)/3 = 23.33
New C1 = (2.67, 23.33)

C2 cluster: D(7,45), E(9,55), F(10,60)

x: (7+9+10)/3 = 8.67
y: (45+55+60)/3 = 53.33
New C2 = (8.67, 53.33)

Topic 7: Naive Bayes

A doctor wants to predict if a patient has a disease based on 3 symptoms:

Patient	Fever	Cough	Fatigue	Disease
1	Yes	Yes	Yes	Yes
2	No	Yes	No	No
3	Yes	No	Yes	Yes
4	No	No	No	No
5	Yes	Yes	No	Yes
6	No	Yes	Yes	No

New patient: Fever=Yes, Cough=Yes, Fatigue=No. Does this patient have the disease?

Step 1: Priors

Thinking: "Count each class. Yes=3, No=3, total=6."

P(Yes) = 3/6 = 0.5
P(No) = 3/6 = 0.5

Step 2: Likelihoods

Thinking: "For each class, count how often each feature value appears. Yes patients: 1,3,5. No patients: 2,4,6."

For Yes (patients 1, 3, 5):

P(Fever=Yes | Yes) = 3/3 = 1.0
P(Cough=Yes | Yes) = 2/3 = 0.667
P(Fatigue=No | Yes) = 1/3 = 0.333

For No (patients 2, 4, 6):

P(Fever=Yes | No) = 0/3 = 0
P(Cough=Yes | No) = 2/3 = 0.667
P(Fatigue=No | No) = 2/3 = 0.667

Step 3: Multiply

Thinking: "Multiply prior × all likelihoods for each class."

P(Yes | features) ∝ 0.5 × 1.0 × 0.667 × 0.333 = 0.111

P(No | features) ∝ 0.5 × 0 × 0.667 × 0.667 = 0

Thinking: "0.111 > 0. One zero kills the entire No product — expected, just report it."

Prediction: Disease = Yes

Topic 8: Gini Impurity + Gini Gain

A dataset of 10 patients, predicting if they need surgery (Yes/No) based on 3 features:

Patient	Age	BMI	Smoker	Surgery
1	Old	High	Yes	Yes
2	Young	Low	No	No
3	Old	Low	Yes	Yes
4	Young	High	No	No
5	Old	High	Yes	Yes
6	Young	Low	No	No
7	Old	Low	No	No
8	Young	High	Yes	Yes
9	Old	High	Yes	Yes
10	Young	Low	No	No

Which feature should be the root node?

Step 1: Gini of whole dataset

Thinking: "Count Yes and No. Yes=5, No=5, total=10."

Gini(root) = 1 - (5/10)² - (5/10)² = 1 - 0.25 - 0.25 = 0.5

Step 2: Gini for Age split

Thinking: "Split into Old and Young. Count Yes/No in each group."

Old: patients 1,3,5,7,9 → Yes=4, No=1
Young: patients 2,4,6,8,10 → Yes=1, No=4

Gini(Old) = 1 - (4/5)² - (1/5)² = 1 - 0.64 - 0.04 = 0.32 Gini(Young) = 1 - (1/5)² - (4/5)² = 0.32

Gini(Age) = (5/10)×0.32 + (5/10)×0.32 = 0.32

Gini Gain(Age) = 0.5 - 0.32 = 0.18

Step 3: Gini for BMI split

High: patients 1,4,5,8,9 → Yes=4, No=1
Low: patients 2,3,6,7,10 → Yes=1, No=4

Gini(High) = 1 - (4/5)² - (1/5)² = 0.32 Gini(Low) = 1 - (1/5)² - (4/5)² = 0.32

Gini(BMI) = (5/10)×0.32 + (5/10)×0.32 = 0.32

Gini Gain(BMI) = 0.5 - 0.32 = 0.18

Step 4: Gini for Smoker split

Thinking: "Smoker=Yes: patients 1,3,5,8,9 → Yes=5, No=0. Smoker=No: patients 2,4,6,7,10 → Yes=0, No=5."

Gini(Smoker=Yes) = 1 - (5/5)² - (0/5)² = 1 - 1 - 0 = 0.0 Gini(Smoker=No) = 1 - (0/5)² - (5/5)² = 0.0

Thinking: "Both groups are perfectly pure — Gini=0 means no mixing at all."

Gini(Smoker) = (5/10)×0 + (5/10)×0 = 0.0

Gini Gain(Smoker) = 0.5 - 0.0 = 0.50

Conclusion

Thinking: "Compare all three Gini Gains. Pick the highest — that feature splits the data most cleanly."

Feature	Gini Gain
Age	0.18
BMI	0.18
Smoker	0.50

Root node = Smoker — it perfectly separates the classes.

Topic 9: Cosine Similarity + Item-Based Collaborative Filtering

4 users rated 4 movies (0 = not rated):

User	M1	M2	M3	M4
U1	5	3	0	1
U2	4	0	4	1
U3	1	1	0	5
U4	0	0	5	4

(a) Calculate cosine similarity between M1 and M2, M1 and M3, M1 and M4. (b) Predict U1's rating for M3 using item-based collaborative filtering.

Part a: Cosine Similarity

Thinking: "Treat each movie as a vector of user ratings. Cosine similarity = dot product divided by product of magnitudes."

Formula: cos(A,B) = (A·B) / (||A|| × ||B||)

M1 = [5, 4, 1, 0]
M2 = [3, 0, 1, 0]
M3 = [0, 4, 0, 5]
M4 = [1, 1, 5, 4]

sim(M1, M2):

A·B = (5×3)+(4×0)+(1×1)+(0×0) = 15+0+1+0 = 16
||M1|| = √(25+16+1+0) = √42 = 6.48
||M2|| = √(9+0+1+0) = √10 = 3.16
sim = 16/(6.48×3.16) = 16/20.48 = 0.781

sim(M1, M3):

A·B = (5×0)+(4×4)+(1×0)+(0×5) = 0+16+0+0 = 16
||M3|| = √(0+16+0+25) = √41 = 6.40
sim = 16/(6.48×6.40) = 16/41.47 = 0.386

sim(M1, M4):

A·B = (5×1)+(4×1)+(1×5)+(0×4) = 5+4+5+0 = 14
||M4|| = √(1+1+25+16) = √43 = 6.56
sim = 14/(6.48×6.56) = 14/42.51 = 0.329

Part b: Predict U1's rating for M3

Thinking: "U1 rated M1=5, M2=3, M4=1. Use similarities of those movies to M3 as weights. Need sim(M2,M3) and sim(M3,M4) as well."

sim(M2, M3):

A·B = (3×0)+(0×4)+(1×0)+(0×5) = 0 → no overlap

sim(M3, M4):

A·B = (0×1)+(4×1)+(0×5)+(5×4) = 0+4+0+20 = 24
sim = 24/(6.40×6.56) = 24/41.98 = 0.572

Formula: predicted = Σ(sim(M3,Mₓ) × rating(U1,Mₓ)) / Σ|sim(M3,Mₓ)|

Thinking: "Drop M2 — sim=0 contributes nothing."

predicted(U1,M3) = (0.386×5 + 0×3 + 0.572×1) / (0.386 + 0 + 0.572) = (1.93 + 0 + 0.572) / 0.958 = 2.502 / 0.958 = 2.61

U1 would rate M3 around 2.6 — below average interest.

Topic 10: MLP Forward Pass with ReLU

A neural network has:

Input: X = [2, 3]
Activation: ReLU(z) = max(0, z)
No bias terms

W1 = [[ 0.5, -1.0],
      [-0.5,  0.8],
      [ 1.0,  0.3]]

W2 = [[0.6, -0.4, 0.9]]

Perform a full forward pass and compute ŷ.

Step 1: Compute Z1 = W1 · X

Thinking: "W1 is (3×2), X is (2×1). Each row of W1 dot-products with X. Result is 3×1 — one value per hidden neuron."

Neuron 1: (0.5×2) + (-1.0×3) = 1.0 - 3.0 = -2.0
Neuron 2: (-0.5×2) + (0.8×3) = -1.0 + 2.4 = +1.4
Neuron 3: (1.0×2) + (0.3×3) = 2.0 + 0.9 = +2.9

Z1 = [-2.0, 1.4, 2.9]

Step 2: Apply ReLU → H = ReLU(Z1)

Thinking: "ReLU = max(0, z). Negatives become 0. Positives pass through unchanged."

ReLU(-2.0) = 0
ReLU(1.4) = 1.4
ReLU(2.9) = 2.9

H = [0, 1.4, 2.9]

Thinking: "Neuron 1 was suppressed — only neurons 2 and 3 pass signal forward."

Step 3: Compute Z2 = W2 · H

Thinking: "W2 is (1×3), H is (3×1). Result is a single scalar."

Z2 = (0.6×0) + (-0.4×1.4) + (0.9×2.9) = 0 + (-0.56) + 2.61 = 2.05

Output

Thinking: "No activation on output layer — return Z2 directly."

ŷ = 2.05

Key things to remember

W1 shape: (hidden_nodes × input_size)
W2 shape: (output_nodes × hidden_nodes)
Always apply ReLU after Z1, before passing to next layer
ReLU kills negatives — check each value carefully
Final output has no activation unless the question specifies one

Patient	Fever	Cough	Fatigue	Disease
1	Yes	Yes	Yes	Yes
2	No	Yes	No	No
3	Yes	No	Yes	Yes
4	No	No	No	No
5	Yes	Yes	No	Yes
6	No	Yes	Yes	No

Patient	Fever	Cough	Fatigue	Disease
1	Yes	Yes	Yes	Yes
2	No	Yes	No	No
3	Yes	No	Yes	Yes
4	No	No	No	No
5	Yes	Yes	No	Yes
6	No	Yes	Yes	No

BexTuychiev/ml-exam-cheatsheet.md