Skip to content

Instantly share code, notes, and snippets.

@BexTuychiev
Last active June 11, 2026 03:31
Show Gist options
  • Select an option

  • Save BexTuychiev/a21149e55f1dea08db31dcc7e91e20e4 to your computer and use it in GitHub Desktop.

Select an option

Save BexTuychiev/a21149e55f1dea08db31dcc7e91e20e4 to your computer and use it in GitHub Desktop.
ML Exam Cheat Sheet — Worked Examples

ML Exam Cheat Sheet — Worked Examples

Formulas

Topic Formula Variables
Z-score z = (x - μ) / σ x=value, μ=mean, σ=std dev
Min-max x' = (x - xmin) / (xmax - xmin) xmin/xmax=dataset min/max
Accuracy (TP + TN) / N N=total samples
Precision TP / (TP + FP) "of all predicted positive, how many were right"
Recall TP / (TP + FN) "of all actual positive, how many did we catch"
F1 2 × (Precision × Recall) / (Precision + Recall) harmonic mean of P and R
OLS slope m = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)² x̄=mean of x, ȳ=mean of y
OLS intercept b = ȳ - m × x̄ plug in after finding m
MSE (1/n) × Σ(yᵢ - ŷᵢ)² yᵢ=actual, ŷᵢ=predicted
1 - SS_res/SS_tot SS_res=Σ(yᵢ-ŷᵢ)², SS_tot=Σ(yᵢ-ȳ)²
GD Loss L = 1/(2n) × Σ(ŷᵢ - yᵢ)² n=number of points
GD ∂L/∂m (1/n) × Σ(ŷᵢ - yᵢ)xᵢ error × x for each point
GD ∂L/∂b (1/n) × Σ(ŷᵢ - yᵢ) just the errors
GD update m₁ = m₀ - α × ∂L/∂m α=learning rate
K-Means distance d = √((x₂-x₁)² + (y₂-y₁)²) Euclidean distance
K-Means centroid new_c = mean of all points in cluster x and y averaged separately
Naive Bayes P(class|features) ∝ P(class) × P(f1|class) × ... ∝ means "proportional to"
Gini impurity Gini = 1 - Σ(pᵢ)² pᵢ=proportion of class i at node
Gini gain Gain = Gini(parent) - weighted avg Gini(children) pick feature with highest gain

Topic 1: Feature Scaling

A dataset contains the following recorded temperatures (°C) from a weather station over 5 days: [23, 45, 31, 52, 18]

Apply z-score normalization and min-max scaling to all values.

Z-score

Thinking: "I need mean and std first. Mean = sum/n."

(23+45+31+52+18)/5 = 169/5 = 33.8

std = √[((23-33.8)² + (45-33.8)² + (31-33.8)² + (52-33.8)² + (18-33.8)²) / 5] = √[(116.64 + 125.44 + 7.84 + 331.24 + 249.64) / 5] = √[830.8/5] = √165.76 = 12.875

Thinking: "Now plug each value into (x - mean)/std."

  • z1 = (23-33.8)/12.875 = -0.839
  • z2 = (45-33.8)/12.875 = +0.870
  • z3 = (31-33.8)/12.875 = -0.217
  • z4 = (52-33.8)/12.875 = +1.413
  • z5 = (18-33.8)/12.875 = -1.227

Check: values centered around 0, some negative some positive. ✓

Min-max

Thinking: "Find min and max first. Min=18, Max=52. Range = 52-18 = 34."

  • x1 = (23-18)/34 = 0.147
  • x2 = (45-18)/34 = 0.794
  • x3 = (31-18)/34 = 0.382
  • x4 = (52-18)/34 = 1.0
  • x5 = (18-18)/34 = 0.0

Check: min maps to 0, max maps to 1. ✓


Topic 2: Confusion Matrix

A spam detection model was tested on 200 emails. The results are:

  • 45 spam emails correctly identified as spam
  • 12 spam emails missed (predicted as not spam)
  • 18 legitimate emails wrongly flagged as spam
  • 125 legitimate emails correctly identified as not spam

Calculate accuracy, precision, recall, and F1. Is this dataset balanced or imbalanced? If FN is more costly than FP, which metric matters most?

Thinking: "First map the story to TP, TN, FP, FN. Spam = positive class."

  • TP = 45 (spam correctly caught)
  • FN = 12 (spam missed — predicted not spam)
  • FP = 18 (legit flagged as spam)
  • TN = 125 (legit correctly cleared)
  • N = 200

Accuracy = (45+125)/200 = 170/200 = 0.85

Precision = 45/(45+18) = 45/63 = 0.714

Recall = 45/(45+12) = 45/57 = 0.789

F1 = 2 × (0.714 × 0.789)/(0.714 + 0.789) = 2 × 0.563/1.503 = 0.750

Thinking: "Balanced? Spam = 57, Not spam = 143. 57/200 = 28.5%. That's less than ~50%. Imbalanced."

Thinking: "FN more costly = missing spam is worse than false alarms. That means Recall is the key metric — it measures how many actual spams we caught."

Recall = 0.789 → catching ~79% of spam. Decent but not great for a high-stakes filter.


Topic 3: OLS Linear Regression

A researcher records hours of sunlight per day and crop yield (kg) for 5 farms:

Hours of sunlight (x) Crop yield (y)
3 42
5 61
7 78
8 85
11 103

Fit a linear regression model. Predict yield for 9 hours. If actual yield at 9 hours was 91kg, what is the residual?

Thinking: "Calculate means first."

x̄ = (3+5+7+8+11)/5 = 34/5 = 6.8 ȳ = (42+61+78+85+103)/5 = 369/5 = 73.8

Thinking: "Build a table for slope numerator and denominator."

xᵢ yᵢ xᵢ-x̄ yᵢ-ȳ (xᵢ-x̄)(yᵢ-ȳ) (xᵢ-x̄)²
3 42 -3.8 -31.8 120.84 14.44
5 61 -1.8 -12.8 23.04 3.24
7 78 0.2 4.2 0.84 0.04
8 85 1.2 11.2 13.44 1.44
11 103 4.2 29.2 122.64 17.64
Σ 280.8 36.8

m = 280.8 / 36.8 = 7.63

Thinking: "Now intercept: b = ȳ - m×x̄"

b = 73.8 - 7.63×6.8 = 73.8 - 51.88 = 21.92

Model: ŷ = 7.63x + 21.92

Thinking: "Predict for x=9."

ŷ = 7.63×9 + 21.92 = 68.67 + 21.92 = 90.59

Thinking: "Residual = actual - predicted."

Residual = 91 - 90.59 = +0.41

Positive residual = model slightly underestimated.


Topic 4: MSE + R²

Using the crop yield model from Topic 3 (ŷ = 7.63x + 21.92), calculate MSE and R² for the 5 training points. Interpret the R² value.

Thinking: "Generate predictions for all 5 points, then compute errors."

xᵢ yᵢ ŷᵢ (yᵢ-ŷᵢ) (yᵢ-ŷᵢ)²
3 42 44.81 -2.81 7.90
5 61 60.07 0.93 0.86
7 78 75.33 2.67 7.13
8 85 82.96 2.04 4.16
11 103 105.85 -2.85 8.12

MSE = (7.90+0.86+7.13+4.16+8.12)/5 = 28.17/5 = 5.63

Thinking: "For R² I need SS_res and SS_tot. SS_res = sum of squared errors above. SS_tot uses ȳ=73.8."

SS_res = 28.17

yᵢ yᵢ-ȳ (yᵢ-ȳ)²
42 -31.8 1011.24
61 -12.8 163.84
78 4.2 17.64
85 11.2 125.44
103 29.2 852.64

SS_tot = 2170.8

R² = 1 - 28.17/2170.8 = 1 - 0.013 = 0.987

Thinking: "R²=0.987 means the model explains 98.7% of the variance in crop yield. Excellent fit."


Topic 5: Gradient Descent

A dataset shows hours studied vs exam score:

x y
2 35
4 55
6 70
8 88

Starting at m=0, b=0, learning rate α=0.1. Perform one full iteration of gradient descent.

Part a: Initial loss L(0,0)

Thinking: "All predictions are 0 since m=0, b=0. Errors are just -y."

L = 1/(2×4) × [(0-35)² + (0-55)² + (0-70)² + (0-88)²] = 1/8 × [1225 + 3025 + 4900 + 7744] = 1/8 × 16894 = 2111.75

Part b: Gradient formulas

Thinking: "Write these down — the 2 from the power cancels the 2 in 1/(2n)."

  • ∂L/∂m = (1/n) × Σ(ŷᵢ - yᵢ) × xᵢ
  • ∂L/∂b = (1/n) × Σ(ŷᵢ - yᵢ)

Part c: Evaluate gradients at m=0, b=0

xᵢ yᵢ ŷᵢ-yᵢ (ŷᵢ-yᵢ)×xᵢ
2 35 -35 -70
4 55 -55 -220
6 70 -70 -420
8 88 -88 -704

∂L/∂m = (1/4) × (-70-220-420-704) = (1/4) × (-1414) = -353.5

∂L/∂b = (1/4) × (-35-55-70-88) = (1/4) × (-248) = -62

Part d: Update parameters

Thinking: "Negative gradient → step in positive direction."

m₁ = 0 - 0.1 × (-353.5) = 35.35 b₁ = 0 - 0.1 × (-62) = 6.2

Part e: New loss at m=35.35, b=6.2

Thinking: "Build new predictions with ŷ = 35.35x + 6.2"

xᵢ yᵢ ŷᵢ ŷᵢ-yᵢ (ŷᵢ-yᵢ)²
2 35 77.1 42.1 1772.41
4 55 147.6 92.6 8574.76
6 70 218.3 148.3 21992.89
8 88 289.0 201.0 40401.00

L = 1/8 × 72741.06 = 9092.63

Thinking: "Loss went up — α=0.1 is too large, causing overshoot. Just report the number on the exam."


Topic 6: K-Means Clustering

A gym tracks 6 members by age and weekly workout hours. Perform one full iteration of K-Means with initial centroids C1=(2,20) and C2=(8,50).

Member Age (x) Workouts/week (y)
A 1 15
B 3 25
C 4 30
D 7 45
E 9 55
F 10 60

Step 1: Assign to nearest centroid

Thinking: "For every point compute d=√((x-cx)²+(y-cy)²) to both centroids. Assign to closer one."

Member d to C1=(2,20) d to C2=(8,50) Assigned
A(1,15) √(1+25)=5.10 √(49+1225)=35.70 C1
B(3,25) √(1+25)=5.10 √(25+625)=25.50 C1
C(4,30) √(4+100)=10.20 √(16+400)=20.40 C1
D(7,45) √(25+625)=25.50 √(1+25)=5.10 C2
E(9,55) √(49+1225)=35.70 √(1+25)=5.10 C2
F(10,60) √(64+1600)=40.79 √(4+100)=10.20 C2

Step 2: Update centroids

Thinking: "New centroid = average of all points in that cluster."

C1 cluster: A(1,15), B(3,25), C(4,30)

  • x: (1+3+4)/3 = 2.67
  • y: (15+25+30)/3 = 23.33
  • New C1 = (2.67, 23.33)

C2 cluster: D(7,45), E(9,55), F(10,60)

  • x: (7+9+10)/3 = 8.67
  • y: (45+55+60)/3 = 53.33
  • New C2 = (8.67, 53.33)

Topic 7: Naive Bayes

A doctor wants to predict if a patient has a disease based on 3 symptoms:

Patient Fever Cough Fatigue Disease
1 Yes Yes Yes Yes
2 No Yes No No
3 Yes No Yes Yes
4 No No No No
5 Yes Yes No Yes
6 No Yes Yes No

New patient: Fever=Yes, Cough=Yes, Fatigue=No. Does this patient have the disease?

Step 1: Priors

Thinking: "Count each class. Yes=3, No=3, total=6."

  • P(Yes) = 3/6 = 0.5
  • P(No) = 3/6 = 0.5

Step 2: Likelihoods

Thinking: "For each class, count how often each feature value appears. Yes patients: 1,3,5. No patients: 2,4,6."

For Yes (patients 1, 3, 5):

  • P(Fever=Yes | Yes) = 3/3 = 1.0
  • P(Cough=Yes | Yes) = 2/3 = 0.667
  • P(Fatigue=No | Yes) = 1/3 = 0.333

For No (patients 2, 4, 6):

  • P(Fever=Yes | No) = 0/3 = 0
  • P(Cough=Yes | No) = 2/3 = 0.667
  • P(Fatigue=No | No) = 2/3 = 0.667

Step 3: Multiply

Thinking: "Multiply prior × all likelihoods for each class."

P(Yes | features) ∝ 0.5 × 1.0 × 0.667 × 0.333 = 0.111

P(No | features) ∝ 0.5 × 0 × 0.667 × 0.667 = 0

Thinking: "0.111 > 0. One zero kills the entire No product — expected, just report it."

Prediction: Disease = Yes


Topic 8: Gini Impurity + Gini Gain

A dataset of 10 patients, predicting if they need surgery (Yes/No) based on 3 features:

Patient Age BMI Smoker Surgery
1 Old High Yes Yes
2 Young Low No No
3 Old Low Yes Yes
4 Young High No No
5 Old High Yes Yes
6 Young Low No No
7 Old Low No No
8 Young High Yes Yes
9 Old High Yes Yes
10 Young Low No No

Which feature should be the root node?

Step 1: Gini of whole dataset

Thinking: "Count Yes and No. Yes=5, No=5, total=10."

Gini(root) = 1 - (5/10)² - (5/10)² = 1 - 0.25 - 0.25 = 0.5

Step 2: Gini for Age split

Thinking: "Split into Old and Young. Count Yes/No in each group."

  • Old: patients 1,3,5,7,9 → Yes=4, No=1
  • Young: patients 2,4,6,8,10 → Yes=1, No=4

Gini(Old) = 1 - (4/5)² - (1/5)² = 1 - 0.64 - 0.04 = 0.32 Gini(Young) = 1 - (1/5)² - (4/5)² = 0.32

Gini(Age) = (5/10)×0.32 + (5/10)×0.32 = 0.32

Gini Gain(Age) = 0.5 - 0.32 = 0.18

Step 3: Gini for BMI split

  • High: patients 1,4,5,8,9 → Yes=4, No=1
  • Low: patients 2,3,6,7,10 → Yes=1, No=4

Gini(High) = 1 - (4/5)² - (1/5)² = 0.32 Gini(Low) = 1 - (1/5)² - (4/5)² = 0.32

Gini(BMI) = (5/10)×0.32 + (5/10)×0.32 = 0.32

Gini Gain(BMI) = 0.5 - 0.32 = 0.18

Step 4: Gini for Smoker split

Thinking: "Smoker=Yes: patients 1,3,5,8,9 → Yes=5, No=0. Smoker=No: patients 2,4,6,7,10 → Yes=0, No=5."

Gini(Smoker=Yes) = 1 - (5/5)² - (0/5)² = 1 - 1 - 0 = 0.0 Gini(Smoker=No) = 1 - (0/5)² - (5/5)² = 0.0

Thinking: "Both groups are perfectly pure — Gini=0 means no mixing at all."

Gini(Smoker) = (5/10)×0 + (5/10)×0 = 0.0

Gini Gain(Smoker) = 0.5 - 0.0 = 0.50

Conclusion

Thinking: "Compare all three Gini Gains. Pick the highest — that feature splits the data most cleanly."

Feature Gini Gain
Age 0.18
BMI 0.18
Smoker 0.50

Root node = Smoker — it perfectly separates the classes.


Topic 9: Cosine Similarity + Item-Based Collaborative Filtering

4 users rated 4 movies (0 = not rated):

User M1 M2 M3 M4
U1 5 3 0 1
U2 4 0 4 1
U3 1 1 0 5
U4 0 0 5 4

(a) Calculate cosine similarity between M1 and M2, M1 and M3, M1 and M4. (b) Predict U1's rating for M3 using item-based collaborative filtering.

Part a: Cosine Similarity

Thinking: "Treat each movie as a vector of user ratings. Cosine similarity = dot product divided by product of magnitudes."

Formula: cos(A,B) = (A·B) / (||A|| × ||B||)

  • M1 = [5, 4, 1, 0]
  • M2 = [3, 0, 1, 0]
  • M3 = [0, 4, 0, 5]
  • M4 = [1, 1, 5, 4]

sim(M1, M2):

  • A·B = (5×3)+(4×0)+(1×1)+(0×0) = 15+0+1+0 = 16
  • ||M1|| = √(25+16+1+0) = √42 = 6.48
  • ||M2|| = √(9+0+1+0) = √10 = 3.16
  • sim = 16/(6.48×3.16) = 16/20.48 = 0.781

sim(M1, M3):

  • A·B = (5×0)+(4×4)+(1×0)+(0×5) = 0+16+0+0 = 16
  • ||M3|| = √(0+16+0+25) = √41 = 6.40
  • sim = 16/(6.48×6.40) = 16/41.47 = 0.386

sim(M1, M4):

  • A·B = (5×1)+(4×1)+(1×5)+(0×4) = 5+4+5+0 = 14
  • ||M4|| = √(1+1+25+16) = √43 = 6.56
  • sim = 14/(6.48×6.56) = 14/42.51 = 0.329

Part b: Predict U1's rating for M3

Thinking: "U1 rated M1=5, M2=3, M4=1. Use similarities of those movies to M3 as weights. Need sim(M2,M3) and sim(M3,M4) as well."

sim(M2, M3):

  • A·B = (3×0)+(0×4)+(1×0)+(0×5) = 0 → no overlap

sim(M3, M4):

  • A·B = (0×1)+(4×1)+(0×5)+(5×4) = 0+4+0+20 = 24
  • sim = 24/(6.40×6.56) = 24/41.98 = 0.572

Formula: predicted = Σ(sim(M3,Mₓ) × rating(U1,Mₓ)) / Σ|sim(M3,Mₓ)|

Thinking: "Drop M2 — sim=0 contributes nothing."

predicted(U1,M3) = (0.386×5 + 0×3 + 0.572×1) / (0.386 + 0 + 0.572) = (1.93 + 0 + 0.572) / 0.958 = 2.502 / 0.958 = 2.61

U1 would rate M3 around 2.6 — below average interest.


Topic 10: MLP Forward Pass with ReLU

A neural network has:

  • Input: X = [2, 3]
  • Activation: ReLU(z) = max(0, z)
  • No bias terms
W1 = [[ 0.5, -1.0],
      [-0.5,  0.8],
      [ 1.0,  0.3]]

W2 = [[0.6, -0.4, 0.9]]

Perform a full forward pass and compute ŷ.

Step 1: Compute Z1 = W1 · X

Thinking: "W1 is (3×2), X is (2×1). Each row of W1 dot-products with X. Result is 3×1 — one value per hidden neuron."

  • Neuron 1: (0.5×2) + (-1.0×3) = 1.0 - 3.0 = -2.0
  • Neuron 2: (-0.5×2) + (0.8×3) = -1.0 + 2.4 = +1.4
  • Neuron 3: (1.0×2) + (0.3×3) = 2.0 + 0.9 = +2.9

Z1 = [-2.0, 1.4, 2.9]

Step 2: Apply ReLU → H = ReLU(Z1)

Thinking: "ReLU = max(0, z). Negatives become 0. Positives pass through unchanged."

  • ReLU(-2.0) = 0
  • ReLU(1.4) = 1.4
  • ReLU(2.9) = 2.9

H = [0, 1.4, 2.9]

Thinking: "Neuron 1 was suppressed — only neurons 2 and 3 pass signal forward."

Step 3: Compute Z2 = W2 · H

Thinking: "W2 is (1×3), H is (3×1). Result is a single scalar."

Z2 = (0.6×0) + (-0.4×1.4) + (0.9×2.9) = 0 + (-0.56) + 2.61 = 2.05

Output

Thinking: "No activation on output layer — return Z2 directly."

ŷ = 2.05

Key things to remember

  • W1 shape: (hidden_nodes × input_size)
  • W2 shape: (output_nodes × hidden_nodes)
  • Always apply ReLU after Z1, before passing to next layer
  • ReLU kills negatives — check each value carefully
  • Final output has no activation unless the question specifies one
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment