| Topic | Formula | Variables |
|---|---|---|
| Z-score | z = (x - μ) / σ |
x=value, μ=mean, σ=std dev |
| Min-max | x' = (x - xmin) / (xmax - xmin) |
xmin/xmax=dataset min/max |
| Accuracy | (TP + TN) / N |
N=total samples |
| Precision | TP / (TP + FP) |
"of all predicted positive, how many were right" |
| Recall | TP / (TP + FN) |
"of all actual positive, how many did we catch" |
| F1 | 2 × (Precision × Recall) / (Precision + Recall) |
harmonic mean of P and R |
| OLS slope | m = Σ(xᵢ - x̄)(yᵢ - ȳ) / Σ(xᵢ - x̄)² |
x̄=mean of x, ȳ=mean of y |
| OLS intercept | b = ȳ - m × x̄ |
plug in after finding m |
| MSE | (1/n) × Σ(yᵢ - ŷᵢ)² |
yᵢ=actual, ŷᵢ=predicted |
| R² | 1 - SS_res/SS_tot |
SS_res=Σ(yᵢ-ŷᵢ)², SS_tot=Σ(yᵢ-ȳ)² |
| GD Loss | L = 1/(2n) × Σ(ŷᵢ - yᵢ)² |
n=number of points |
| GD ∂L/∂m | (1/n) × Σ(ŷᵢ - yᵢ)xᵢ |
error × x for each point |
| GD ∂L/∂b | (1/n) × Σ(ŷᵢ - yᵢ) |
just the errors |
| GD update | m₁ = m₀ - α × ∂L/∂m |
α=learning rate |
| K-Means distance | d = √((x₂-x₁)² + (y₂-y₁)²) |
Euclidean distance |
| K-Means centroid | new_c = mean of all points in cluster |
x and y averaged separately |
| Naive Bayes | P(class|features) ∝ P(class) × P(f1|class) × ... |
∝ means "proportional to" |
| Gini impurity | Gini = 1 - Σ(pᵢ)² |
pᵢ=proportion of class i at node |
| Gini gain | Gain = Gini(parent) - weighted avg Gini(children) |
pick feature with highest gain |
A dataset contains the following recorded temperatures (°C) from a weather station over 5 days: [23, 45, 31, 52, 18]
Apply z-score normalization and min-max scaling to all values.
Thinking: "I need mean and std first. Mean = sum/n."
(23+45+31+52+18)/5 = 169/5 = 33.8
std = √[((23-33.8)² + (45-33.8)² + (31-33.8)² + (52-33.8)² + (18-33.8)²) / 5] = √[(116.64 + 125.44 + 7.84 + 331.24 + 249.64) / 5] = √[830.8/5] = √165.76 = 12.875
Thinking: "Now plug each value into (x - mean)/std."
- z1 = (23-33.8)/12.875 = -0.839
- z2 = (45-33.8)/12.875 = +0.870
- z3 = (31-33.8)/12.875 = -0.217
- z4 = (52-33.8)/12.875 = +1.413
- z5 = (18-33.8)/12.875 = -1.227
Check: values centered around 0, some negative some positive. ✓
Thinking: "Find min and max first. Min=18, Max=52. Range = 52-18 = 34."
- x1 = (23-18)/34 = 0.147
- x2 = (45-18)/34 = 0.794
- x3 = (31-18)/34 = 0.382
- x4 = (52-18)/34 = 1.0
- x5 = (18-18)/34 = 0.0
Check: min maps to 0, max maps to 1. ✓
A spam detection model was tested on 200 emails. The results are:
- 45 spam emails correctly identified as spam
- 12 spam emails missed (predicted as not spam)
- 18 legitimate emails wrongly flagged as spam
- 125 legitimate emails correctly identified as not spam
Calculate accuracy, precision, recall, and F1. Is this dataset balanced or imbalanced? If FN is more costly than FP, which metric matters most?
Thinking: "First map the story to TP, TN, FP, FN. Spam = positive class."
- TP = 45 (spam correctly caught)
- FN = 12 (spam missed — predicted not spam)
- FP = 18 (legit flagged as spam)
- TN = 125 (legit correctly cleared)
- N = 200
Accuracy = (45+125)/200 = 170/200 = 0.85
Precision = 45/(45+18) = 45/63 = 0.714
Recall = 45/(45+12) = 45/57 = 0.789
F1 = 2 × (0.714 × 0.789)/(0.714 + 0.789) = 2 × 0.563/1.503 = 0.750
Thinking: "Balanced? Spam = 57, Not spam = 143. 57/200 = 28.5%. That's less than ~50%. Imbalanced."
Thinking: "FN more costly = missing spam is worse than false alarms. That means Recall is the key metric — it measures how many actual spams we caught."
Recall = 0.789 → catching ~79% of spam. Decent but not great for a high-stakes filter.
A researcher records hours of sunlight per day and crop yield (kg) for 5 farms:
| Hours of sunlight (x) | Crop yield (y) |
|---|---|
| 3 | 42 |
| 5 | 61 |
| 7 | 78 |
| 8 | 85 |
| 11 | 103 |
Fit a linear regression model. Predict yield for 9 hours. If actual yield at 9 hours was 91kg, what is the residual?
Thinking: "Calculate means first."
x̄ = (3+5+7+8+11)/5 = 34/5 = 6.8 ȳ = (42+61+78+85+103)/5 = 369/5 = 73.8
Thinking: "Build a table for slope numerator and denominator."
| xᵢ | yᵢ | xᵢ-x̄ | yᵢ-ȳ | (xᵢ-x̄)(yᵢ-ȳ) | (xᵢ-x̄)² |
|---|---|---|---|---|---|
| 3 | 42 | -3.8 | -31.8 | 120.84 | 14.44 |
| 5 | 61 | -1.8 | -12.8 | 23.04 | 3.24 |
| 7 | 78 | 0.2 | 4.2 | 0.84 | 0.04 |
| 8 | 85 | 1.2 | 11.2 | 13.44 | 1.44 |
| 11 | 103 | 4.2 | 29.2 | 122.64 | 17.64 |
| Σ | 280.8 | 36.8 |
m = 280.8 / 36.8 = 7.63
Thinking: "Now intercept: b = ȳ - m×x̄"
b = 73.8 - 7.63×6.8 = 73.8 - 51.88 = 21.92
Model: ŷ = 7.63x + 21.92
Thinking: "Predict for x=9."
ŷ = 7.63×9 + 21.92 = 68.67 + 21.92 = 90.59
Thinking: "Residual = actual - predicted."
Residual = 91 - 90.59 = +0.41
Positive residual = model slightly underestimated.
Using the crop yield model from Topic 3 (ŷ = 7.63x + 21.92), calculate MSE and R² for the 5 training points. Interpret the R² value.
Thinking: "Generate predictions for all 5 points, then compute errors."
| xᵢ | yᵢ | ŷᵢ | (yᵢ-ŷᵢ) | (yᵢ-ŷᵢ)² |
|---|---|---|---|---|
| 3 | 42 | 44.81 | -2.81 | 7.90 |
| 5 | 61 | 60.07 | 0.93 | 0.86 |
| 7 | 78 | 75.33 | 2.67 | 7.13 |
| 8 | 85 | 82.96 | 2.04 | 4.16 |
| 11 | 103 | 105.85 | -2.85 | 8.12 |
MSE = (7.90+0.86+7.13+4.16+8.12)/5 = 28.17/5 = 5.63
Thinking: "For R² I need SS_res and SS_tot. SS_res = sum of squared errors above. SS_tot uses ȳ=73.8."
SS_res = 28.17
| yᵢ | yᵢ-ȳ | (yᵢ-ȳ)² |
|---|---|---|
| 42 | -31.8 | 1011.24 |
| 61 | -12.8 | 163.84 |
| 78 | 4.2 | 17.64 |
| 85 | 11.2 | 125.44 |
| 103 | 29.2 | 852.64 |
SS_tot = 2170.8
R² = 1 - 28.17/2170.8 = 1 - 0.013 = 0.987
Thinking: "R²=0.987 means the model explains 98.7% of the variance in crop yield. Excellent fit."
A dataset shows hours studied vs exam score:
| x | y |
|---|---|
| 2 | 35 |
| 4 | 55 |
| 6 | 70 |
| 8 | 88 |
Starting at m=0, b=0, learning rate α=0.1. Perform one full iteration of gradient descent.
Thinking: "All predictions are 0 since m=0, b=0. Errors are just -y."
L = 1/(2×4) × [(0-35)² + (0-55)² + (0-70)² + (0-88)²] = 1/8 × [1225 + 3025 + 4900 + 7744] = 1/8 × 16894 = 2111.75
Thinking: "Write these down — the 2 from the power cancels the 2 in 1/(2n)."
- ∂L/∂m = (1/n) × Σ(ŷᵢ - yᵢ) × xᵢ
- ∂L/∂b = (1/n) × Σ(ŷᵢ - yᵢ)
| xᵢ | yᵢ | ŷᵢ-yᵢ | (ŷᵢ-yᵢ)×xᵢ |
|---|---|---|---|
| 2 | 35 | -35 | -70 |
| 4 | 55 | -55 | -220 |
| 6 | 70 | -70 | -420 |
| 8 | 88 | -88 | -704 |
∂L/∂m = (1/4) × (-70-220-420-704) = (1/4) × (-1414) = -353.5
∂L/∂b = (1/4) × (-35-55-70-88) = (1/4) × (-248) = -62
Thinking: "Negative gradient → step in positive direction."
m₁ = 0 - 0.1 × (-353.5) = 35.35 b₁ = 0 - 0.1 × (-62) = 6.2
Thinking: "Build new predictions with ŷ = 35.35x + 6.2"
| xᵢ | yᵢ | ŷᵢ | ŷᵢ-yᵢ | (ŷᵢ-yᵢ)² |
|---|---|---|---|---|
| 2 | 35 | 77.1 | 42.1 | 1772.41 |
| 4 | 55 | 147.6 | 92.6 | 8574.76 |
| 6 | 70 | 218.3 | 148.3 | 21992.89 |
| 8 | 88 | 289.0 | 201.0 | 40401.00 |
L = 1/8 × 72741.06 = 9092.63
Thinking: "Loss went up — α=0.1 is too large, causing overshoot. Just report the number on the exam."
A gym tracks 6 members by age and weekly workout hours. Perform one full iteration of K-Means with initial centroids C1=(2,20) and C2=(8,50).
| Member | Age (x) | Workouts/week (y) |
|---|---|---|
| A | 1 | 15 |
| B | 3 | 25 |
| C | 4 | 30 |
| D | 7 | 45 |
| E | 9 | 55 |
| F | 10 | 60 |
Thinking: "For every point compute d=√((x-cx)²+(y-cy)²) to both centroids. Assign to closer one."
| Member | d to C1=(2,20) | d to C2=(8,50) | Assigned |
|---|---|---|---|
| A(1,15) | √(1+25)=5.10 | √(49+1225)=35.70 | C1 |
| B(3,25) | √(1+25)=5.10 | √(25+625)=25.50 | C1 |
| C(4,30) | √(4+100)=10.20 | √(16+400)=20.40 | C1 |
| D(7,45) | √(25+625)=25.50 | √(1+25)=5.10 | C2 |
| E(9,55) | √(49+1225)=35.70 | √(1+25)=5.10 | C2 |
| F(10,60) | √(64+1600)=40.79 | √(4+100)=10.20 | C2 |
Thinking: "New centroid = average of all points in that cluster."
C1 cluster: A(1,15), B(3,25), C(4,30)
- x: (1+3+4)/3 = 2.67
- y: (15+25+30)/3 = 23.33
- New C1 = (2.67, 23.33)
C2 cluster: D(7,45), E(9,55), F(10,60)
- x: (7+9+10)/3 = 8.67
- y: (45+55+60)/3 = 53.33
- New C2 = (8.67, 53.33)
A doctor wants to predict if a patient has a disease based on 3 symptoms:
| Patient | Fever | Cough | Fatigue | Disease |
|---|---|---|---|---|
| 1 | Yes | Yes | Yes | Yes |
| 2 | No | Yes | No | No |
| 3 | Yes | No | Yes | Yes |
| 4 | No | No | No | No |
| 5 | Yes | Yes | No | Yes |
| 6 | No | Yes | Yes | No |
New patient: Fever=Yes, Cough=Yes, Fatigue=No. Does this patient have the disease?
Thinking: "Count each class. Yes=3, No=3, total=6."
- P(Yes) = 3/6 = 0.5
- P(No) = 3/6 = 0.5
Thinking: "For each class, count how often each feature value appears. Yes patients: 1,3,5. No patients: 2,4,6."
For Yes (patients 1, 3, 5):
- P(Fever=Yes | Yes) = 3/3 = 1.0
- P(Cough=Yes | Yes) = 2/3 = 0.667
- P(Fatigue=No | Yes) = 1/3 = 0.333
For No (patients 2, 4, 6):
- P(Fever=Yes | No) = 0/3 = 0
- P(Cough=Yes | No) = 2/3 = 0.667
- P(Fatigue=No | No) = 2/3 = 0.667
Thinking: "Multiply prior × all likelihoods for each class."
P(Yes | features) ∝ 0.5 × 1.0 × 0.667 × 0.333 = 0.111
P(No | features) ∝ 0.5 × 0 × 0.667 × 0.667 = 0
Thinking: "0.111 > 0. One zero kills the entire No product — expected, just report it."
Prediction: Disease = Yes
A dataset of 10 patients, predicting if they need surgery (Yes/No) based on 3 features:
| Patient | Age | BMI | Smoker | Surgery |
|---|---|---|---|---|
| 1 | Old | High | Yes | Yes |
| 2 | Young | Low | No | No |
| 3 | Old | Low | Yes | Yes |
| 4 | Young | High | No | No |
| 5 | Old | High | Yes | Yes |
| 6 | Young | Low | No | No |
| 7 | Old | Low | No | No |
| 8 | Young | High | Yes | Yes |
| 9 | Old | High | Yes | Yes |
| 10 | Young | Low | No | No |
Which feature should be the root node?
Thinking: "Count Yes and No. Yes=5, No=5, total=10."
Gini(root) = 1 - (5/10)² - (5/10)² = 1 - 0.25 - 0.25 = 0.5
Thinking: "Split into Old and Young. Count Yes/No in each group."
- Old: patients 1,3,5,7,9 → Yes=4, No=1
- Young: patients 2,4,6,8,10 → Yes=1, No=4
Gini(Old) = 1 - (4/5)² - (1/5)² = 1 - 0.64 - 0.04 = 0.32 Gini(Young) = 1 - (1/5)² - (4/5)² = 0.32
Gini(Age) = (5/10)×0.32 + (5/10)×0.32 = 0.32
Gini Gain(Age) = 0.5 - 0.32 = 0.18
- High: patients 1,4,5,8,9 → Yes=4, No=1
- Low: patients 2,3,6,7,10 → Yes=1, No=4
Gini(High) = 1 - (4/5)² - (1/5)² = 0.32 Gini(Low) = 1 - (1/5)² - (4/5)² = 0.32
Gini(BMI) = (5/10)×0.32 + (5/10)×0.32 = 0.32
Gini Gain(BMI) = 0.5 - 0.32 = 0.18
Thinking: "Smoker=Yes: patients 1,3,5,8,9 → Yes=5, No=0. Smoker=No: patients 2,4,6,7,10 → Yes=0, No=5."
Gini(Smoker=Yes) = 1 - (5/5)² - (0/5)² = 1 - 1 - 0 = 0.0 Gini(Smoker=No) = 1 - (0/5)² - (5/5)² = 0.0
Thinking: "Both groups are perfectly pure — Gini=0 means no mixing at all."
Gini(Smoker) = (5/10)×0 + (5/10)×0 = 0.0
Gini Gain(Smoker) = 0.5 - 0.0 = 0.50
Thinking: "Compare all three Gini Gains. Pick the highest — that feature splits the data most cleanly."
| Feature | Gini Gain |
|---|---|
| Age | 0.18 |
| BMI | 0.18 |
| Smoker | 0.50 |
Root node = Smoker — it perfectly separates the classes.
4 users rated 4 movies (0 = not rated):
| User | M1 | M2 | M3 | M4 |
|---|---|---|---|---|
| U1 | 5 | 3 | 0 | 1 |
| U2 | 4 | 0 | 4 | 1 |
| U3 | 1 | 1 | 0 | 5 |
| U4 | 0 | 0 | 5 | 4 |
(a) Calculate cosine similarity between M1 and M2, M1 and M3, M1 and M4. (b) Predict U1's rating for M3 using item-based collaborative filtering.
Thinking: "Treat each movie as a vector of user ratings. Cosine similarity = dot product divided by product of magnitudes."
Formula: cos(A,B) = (A·B) / (||A|| × ||B||)
- M1 = [5, 4, 1, 0]
- M2 = [3, 0, 1, 0]
- M3 = [0, 4, 0, 5]
- M4 = [1, 1, 5, 4]
sim(M1, M2):
- A·B = (5×3)+(4×0)+(1×1)+(0×0) = 15+0+1+0 = 16
- ||M1|| = √(25+16+1+0) = √42 = 6.48
- ||M2|| = √(9+0+1+0) = √10 = 3.16
- sim = 16/(6.48×3.16) = 16/20.48 = 0.781
sim(M1, M3):
- A·B = (5×0)+(4×4)+(1×0)+(0×5) = 0+16+0+0 = 16
- ||M3|| = √(0+16+0+25) = √41 = 6.40
- sim = 16/(6.48×6.40) = 16/41.47 = 0.386
sim(M1, M4):
- A·B = (5×1)+(4×1)+(1×5)+(0×4) = 5+4+5+0 = 14
- ||M4|| = √(1+1+25+16) = √43 = 6.56
- sim = 14/(6.48×6.56) = 14/42.51 = 0.329
Thinking: "U1 rated M1=5, M2=3, M4=1. Use similarities of those movies to M3 as weights. Need sim(M2,M3) and sim(M3,M4) as well."
sim(M2, M3):
- A·B = (3×0)+(0×4)+(1×0)+(0×5) = 0 → no overlap
sim(M3, M4):
- A·B = (0×1)+(4×1)+(0×5)+(5×4) = 0+4+0+20 = 24
- sim = 24/(6.40×6.56) = 24/41.98 = 0.572
Formula: predicted = Σ(sim(M3,Mₓ) × rating(U1,Mₓ)) / Σ|sim(M3,Mₓ)|
Thinking: "Drop M2 — sim=0 contributes nothing."
predicted(U1,M3) = (0.386×5 + 0×3 + 0.572×1) / (0.386 + 0 + 0.572) = (1.93 + 0 + 0.572) / 0.958 = 2.502 / 0.958 = 2.61
U1 would rate M3 around 2.6 — below average interest.
A neural network has:
- Input: X = [2, 3]
- Activation: ReLU(z) = max(0, z)
- No bias terms
W1 = [[ 0.5, -1.0],
[-0.5, 0.8],
[ 1.0, 0.3]]
W2 = [[0.6, -0.4, 0.9]]
Perform a full forward pass and compute ŷ.
Thinking: "W1 is (3×2), X is (2×1). Each row of W1 dot-products with X. Result is 3×1 — one value per hidden neuron."
- Neuron 1: (0.5×2) + (-1.0×3) = 1.0 - 3.0 = -2.0
- Neuron 2: (-0.5×2) + (0.8×3) = -1.0 + 2.4 = +1.4
- Neuron 3: (1.0×2) + (0.3×3) = 2.0 + 0.9 = +2.9
Z1 = [-2.0, 1.4, 2.9]
Thinking: "ReLU = max(0, z). Negatives become 0. Positives pass through unchanged."
- ReLU(-2.0) = 0
- ReLU(1.4) = 1.4
- ReLU(2.9) = 2.9
H = [0, 1.4, 2.9]
Thinking: "Neuron 1 was suppressed — only neurons 2 and 3 pass signal forward."
Thinking: "W2 is (1×3), H is (3×1). Result is a single scalar."
Z2 = (0.6×0) + (-0.4×1.4) + (0.9×2.9) = 0 + (-0.56) + 2.61 = 2.05
Thinking: "No activation on output layer — return Z2 directly."
ŷ = 2.05
- W1 shape: (hidden_nodes × input_size)
- W2 shape: (output_nodes × hidden_nodes)
- Always apply ReLU after Z1, before passing to next layer
- ReLU kills negatives — check each value carefully
- Final output has no activation unless the question specifies one