# Statistical Methodology Behind the ab-test-analysis Skill: Z-Tests, Confidence Intervals, and Power Analysis

> Unlock the statistical methodology behind ab-test-analysis. Learn about Z-tests, confidence intervals, and power analysis for rigorous A/B testing.

- Repository: [Pawel Huryn/pm-skills](https://github.com/phuryn/pm-skills)
- Tags: deep-dive
- Published: 2026-07-01

---

**The ab-test-analysis skill implements a frequentist statistical framework that combines two-tailed Z-tests or chi-squared tests for significance, 95% confidence intervals for effect estimation, and binomial sample-size power validation to ensure rigorous A/B test analysis.**

The **ab-test-analysis** skill in the `phuryn/pm-skills` repository provides a comprehensive statistical audit for product experiments. Understanding the **statistical methodology behind the ab-test-analysis skill** requires examining the frequentist framework defined in [`pm-data-analytics/skills/ab-test-analysis/SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.md). The skill applies classical hypothesis testing methods to evaluate conversion rates, calculate relative lift, and ensure experiments meet rigorous power requirements before recommending product decisions.

## Hypothesis Testing and P-Value Calculation

The skill's primary significance test relies on comparing proportions between two independent groups. In [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md) lines 34-36, the methodology specifies either a **two-tailed Z-test** or a **chi-squared test** to generate the p-value for the null hypothesis that both groups share identical conversion rates.

The **Z-test for proportions** treats the control and variant as independent binomial samples. The test statistic follows the standard normal distribution under the null hypothesis, allowing the skill to calculate exact p-values for observing the measured difference (or greater) if the groups were truly identical. Alternatively, the **chi-squared test** provides identical results for large samples while offering robustness for categorical analysis.

## Conversion Metrics and Effect Size

Before significance testing, the skill computes descriptive metrics that contextualize the practical impact.

### Conversion Rate and Relative Lift

According to lines 34-35 of [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md), the skill first calculates the **conversion rate** for each group as successes divided by total participants. It then computes the **relative lift** using the formula:

```python
lift = (p_variant - p_control) / p_control * 100

```

This percentage change indicates the business impact of the variant relative to the baseline control experience.

### Confidence Interval Construction

Line 37 specifies the construction of a **95% confidence interval** for the difference between the two conversion rates. This interval provides a range of plausible values for the true population difference, complementing the p-value with information about effect size precision. The skill interprets intervals that exclude zero as evidence of a statistically significant directional effect.

## Sample Size Validation and Statistical Power

The skill prevents false conclusions from under-powered experiments through rigorous sample size validation. Lines 26-28 of [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md) implement the classic binomial sample-size formula:

```

n = (Z_{α/2}^2 * 2 * p * (1-p)) / MDE^2

```

Where:

- **Z_{α/2}** represents the critical value for the desired confidence level (1.96 for 95% confidence)
- **p** is the pooled conversion rate
- **MDE** is the minimum detectable effect size

The skill flags tests with less than **80% power** (β < 0.20), warning users when sample sizes are insufficient to detect meaningful differences reliably.

## Decision Framework and Implementation

The statistical outputs feed into a decision matrix defined in lines 49-55 of [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md). The skill renders a "Ship it" recommendation only when three conditions align: **p < 0.05** (line 38), the confidence interval excludes zero, and the relative lift meets business relevance thresholds.

Below is a runnable Python implementation using the `statsmodels` library that replicates the skill's core statistical logic:

```python
import pandas as pd
import statsmodels.stats.proportion as smp

# Load raw experiment data (must contain `group` and `converted` columns)

df = pd.read_csv('experiment.csv')

# Aggregate counts

control = df[df.group == 'control']['converted'].sum()
control_n = df[df.group == 'control'].shape[0]
variant = df[df.group == 'variant']['converted'].sum()
variant_n = df[df.group == 'variant'].shape[0]

# Conversion rates

p_control = control / control_n
p_variant = variant / variant_n
lift = (p_variant - p_control) / p_control * 100

# Two-tailed Z-test (proportion comparison)

z_stat, p_val = smp.proportions_ztest([variant, control],
                                      [variant_n, control_n],
                                      alternative='two-sided')

# 95% confidence interval for the difference

ci_low, ci_upp = smp.proportion_confint(variant, variant_n, alpha=0.05, method='normal')
ci_diff = (ci_low - p_control, ci_upp - p_control)

print(f'Control rate: {p_control:.2%}')
print(f'Variant rate: {p_variant:.2%}')
print(f'Relative lift: {lift:.2f}%')
print(f'p-value (Z-test): {p_val:.4f}')
print(f'95% CI for difference: {ci_diff[0]:.2%} – {ci_diff[1]:.2%}')

```

## Summary

- **The ab-test-analysis skill** in `phuryn/pm-skills` applies frequentist statistical methods including two-tailed Z-tests or chi-squared tests to evaluate A/B experiments.
- **Significance testing** uses a standard **p < 0.05** threshold, while **95% confidence intervals** quantify the uncertainty around observed conversion rate differences.
- **Sample size validation** employs the binomial formula to ensure adequate statistical power (≥80%) before recommending decisions.
- **Practical and statistical significance** are both required: the skill checks for p-value significance, confidence interval exclusion of zero, and business-relevant lift magnitude before generating "Ship it" recommendations.

## Frequently Asked Questions

### What is the difference between the Z-test and chi-squared test in this skill?

Both tests examine whether conversion rates differ between control and variant groups. The **two-tailed Z-test** specifically compares two proportions and is preferred for its direct interpretability in A/B contexts. The **chi-squared test** provides equivalent results for large samples and offers flexibility for multi-variant analysis. According to [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md) line 36, the skill accepts either method to calculate the p-value.

### How does the skill handle sample size validation?

The skill uses the binomial sample-size formula found in [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md) lines 26-28: `n = (Z_{α/2}^2 * 2 * p * (1-p)) / MDE^2`. This calculates the required sample size per variant to detect a given minimum detectable effect (MDE) with 80% power. If the actual sample size falls below this threshold, the skill flags the test as under-powered and advises against drawing conclusions.

### Why does the skill use a 95% confidence level?

The 95% confidence level balances Type I error control (false positives) with practical decision-making needs. As implemented in line 37 of [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md), the **95% confidence interval** provides a range of plausible values for the true difference between groups. If this interval excludes zero, the skill confirms a statistically significant directional effect, complementing the p-value with effect size information.

### What triggers the "Ship it" recommendation in the decision matrix?

The decision matrix in [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md) lines 49-55 requires three simultaneous conditions: a **p-value below 0.05** (indicating statistical significance), a **95% confidence interval** that excludes zero (confirming directional consistency), and a **relative lift** magnitude that meets business relevance thresholds. The skill renders "Ship it" only when statistical rigor aligns with practical business impact.