The Two-proportion Z-test is a statistical method used to determine if there is a significant difference between the proportions of two groups. It's commonly applied in A/B testing to compare the performance of two versions of a product, feature, or webpage (e.g., version A and version B).
Here's a breakdown of the concept using an A/B test example:
You are testing two versions of a landing page:
- Version A (control group)
- Version B (treatment group)
The goal is to determine which version leads to a higher conversion rate (e.g., percentage of users who sign up).
-
Version A:
-
$n_A = 500$ (number of visitors) -
$X_A = 50$ (number of conversions) - Conversion rate (
$p_A$ ):$`\frac{X_A}{n_A} = \frac{50}{500} = 0.10$ (10%)
-
-
Version B:
-
$n_B = 480$ (number of visitors) -
$X_B = 65$ (number of conversions) - Conversion rate (
$p_B $ ):$\frac{X_B}{n_B} = \frac{65}{480} \approx 0.1354$ (13.54%)
-
-
State the null and alternative hypotheses:
-
Null hypothesis (
$H_0$ ):$p_A = p_B$ (no difference in conversion rates). -
Alternative hypothesis (
$H_1$ ):$p_A \neq p_B$ (conversion rates are different).
-
Null hypothesis (
-
Calculate the pooled proportion (
$p_{pooled}$ ): This assumes the null hypothesis is true, meaning both groups have the same underlying proportion.$$p_{pooled} = \frac{X_A + X_B}{n_A + n_B} = \frac{50 + 65}{500 + 480} = \frac{115}{980} \approx 0.1173$$ -
Compute the standard error (SE):
The standard error measures the variability of the difference between proportions under the null hypothesis:
$$SE = \sqrt{p_{pooled} \cdot (1 - p_{pooled}) \cdot \left(\frac{1}{n_A} + \frac{1}{n_B}\right)}$$ Substituting values:
$$SE = \sqrt{0.1173 \cdot (1 - 0.1173) \cdot \left(\frac{1}{500} + \frac{1}{480}\right)}$$ $$SE \approx \sqrt{0.1173 \cdot 0.8827 \cdot (0.002 + 0.002083)}$$ $$SE \approx \sqrt{0.1173 \cdot 0.8827 \cdot 0.004083} \approx \sqrt{0.000423} \approx 0.02057$$ -
Calculate the Z-score:
The Z-score measures how many standard errors the observed difference is from the null hypothesis:
$$Z = \frac{p_B - p_A}{SE}$$ Substituting values:
$$Z = \frac{0.1354 - 0.10}{0.02057} \approx \frac{0.0354}{0.02057} \approx 1.72$$ -
Determine the p-value:
Use the Z-score to find the p-value from a standard normal distribution table or software. For $ Z = 1.72 $:
- The p-value for a two-tailed test is approximately $ 2 \cdot (1 - \Phi(1.72)) $, where $ \Phi $ is the cumulative distribution function of the standard normal distribution.
From tables, $ \Phi(1.72) \approx 0.9573
$. Thus: $ $ p\text{-value} \approx 2 \cdot (1 - 0.9573) = 2 \cdot 0.0427 = 0.0854 $$ -
Draw conclusions:
- If $ p\text{-value} < \alpha $ (e.g., $ \alpha = 0.05 $), reject $ H_0 $ (significant difference).
- Here, $ p\text{-value} \approx 0.0854 > 0.05 $, so we fail to reject the null hypothesis. There's no statistically significant difference between the two conversion rates.
The Two-proportion Z-test allows you to determine if the observed difference between the two conversion rates is statistically significant. In this example, the evidence was insufficient to conclude that Version B is better than Version A at the 5% significance level.
Yes, it is possible to calculate the standard error (SE) separately for each group and then combine them to calculate the Z-score. This method is known as the unpooled Z-test or the separate variance Z-test. It is often used when there is reason to believe that the proportions of the two groups are different under the null hypothesis, rather than assuming a common pooled proportion.
-
Calculate the standard error (SE) for each group:
- For Group A: $$ SE_A = \sqrt{\frac{p_A \cdot (1 - p_A)}{n_A}} $$
- For Group B: $$ SE_B = \sqrt{\frac{p_B \cdot (1 - p_B)}{n_B}} $$
-
Combine the standard errors: The total standard error is calculated as: $$ SE_{total} = \sqrt{SE_A^2 + SE_B^2} $$
-
Calculate the Z-score: The Z-score is then computed as: $$ Z = \frac{p_B - p_A}{SE_{total}} $$
Using the same A/B test data as before:
- $ n_A = 500 $, $ X_A = 50 $, $ p_A = 0.10 $
- $ n_B = 480 $, $ X_B = 65 $, $ p_B \approx 0.1354 $
-
Calculate SE for each group:
- For Group A: $$ SE_A = \sqrt{\frac{0.10 \cdot (1 - 0.10)}{500}} = \sqrt{\frac{0.10 \cdot 0.90}{500}} = \sqrt{0.00018} \approx 0.01342 $$
- For Group B: $$ SE_B = \sqrt{\frac{0.1354 \cdot (1 - 0.1354)}{480}} = \sqrt{\frac{0.1354 \cdot 0.8646}{480}} = \sqrt{0.0002438} \approx 0.01562 $$
-
Combine the SEs: $$ SE_{total} = \sqrt{SE_A^2 + SE_B^2} = \sqrt{(0.01342)^2 + (0.01562)^2} $$ $$ SE_{total} = \sqrt{0.0001802 + 0.0002439} = \sqrt{0.0004241} \approx 0.0206 $$
-
Calculate the Z-score: $$ Z = \frac{p_B - p_A}{SE_{total}} = \frac{0.1354 - 0.10}{0.0206} \approx \frac{0.0354}{0.0206} \approx 1.72 $$
The Z-score and conclusions are the same as in the pooled test in this case because the sample sizes are relatively large, and the proportions are not drastically different.
- If you believe the null hypothesis does not imply equal proportions.
- When the sample sizes are small, or the group proportions differ significantly.
This method accounts for the variability of each group independently, making it more flexible than the pooled approach.
The Minimum Detectable Effect (MDE) is the smallest difference between the proportions of two groups that can be detected as statistically significant, given the desired statistical power and significance level. It's a critical metric for designing A/B tests, as it defines the sensitivity of the test.
The MDE is calculated as:
Where:
- $ Z_{\alpha/2}
$: The critical Z-value for the significance level ($ \alpha $), typically $ 1.96 $ for a 5% two-tailed test. - $ Z_{\beta}
$: The critical Z-value for the desired power ($ 1 - \beta $), typically $ 0.84 $ for 80% power. - $ SE $: The standard error of the difference between proportions.
For the pooled approach, $ SE $ is given by:
For the unpooled approach, $ SE $ is calculated separately for each group and combined as:
Using the A/B test data:
- $ n_A = 500 $, $ X_A = 50 $, $ p_A = 0.10 $
- $ n_B = 480 $, $ X_B = 65 $, $ p_B = 0.1354 $
- $ \alpha = 0.05 $ (5% significance level, $ Z_{\alpha/2} = 1.96 $)
- Power = 80% ($ Z_{\beta} = 0.84 $)
For the pooled approach:
The minimum detectable effect (MDE) is approximately 5.76 percentage points. This means the A/B test is designed to detect a difference of at least 5.76% in conversion rates between Version A and Version B, with 80% power and a 5% significance level. If the actual difference is smaller, the test might fail to detect it as statistically significant.
In the context of Minimum Detectable Effect (MDE), the Power refers to the probability of correctly rejecting the null hypothesis (
-
Definition: Power is the probability of avoiding a Type II error ($ \beta $), where a Type II error occurs when the test fails to reject
$H_0$ even though$H_1$ is true. $$ \text{Power} = 1 - \beta $$ -
Typical Value:
- A commonly used value for power is 80% ($ \beta = 0.20 $).
- This means there's an 80% chance of detecting a true difference if it exists, and a 20% chance of failing to detect it.
-
Critical Z-value (
$Z_\beta$ ):- The critical Z-value associated with power is used in the MDE calculation.
- For 80% power, $ Z_\beta \approx 0.84 $ (from the standard normal distribution).
- For higher power, such as 90%, $ Z_\beta \approx 1.28 $.
The power influences the minimum detectable effect (MDE) because higher power requires a larger effect size to detect smaller differences reliably. In the MDE formula:
-
$Z_{\alpha/2}$ : Reflects the chosen significance level ($\alpha$ , e.g., 5%,$Z_{\alpha/2} = 1.96$ ). -
$Z_\beta$ : Reflects the desired power ($1 - \beta$ , e.g., 80%,$Z_\beta = 0.84$ ). -
$SE$ : The standard error.
For example:
- At 80% power (
$Z_\beta = 0.84$ ), the MDE is lower than at 90% power ($Z_\beta = 1.28$ ), assuming all other parameters are the same.
- Low Power: Increases the risk of missing true effects (high Type II error).
- High Power: Requires a larger sample size, which increases cost and time but reduces the risk of missing true effects.
In most practical A/B tests, 80% power is used as a balance between sensitivity and cost.