Statistical Methodology Behind the ab-test-analysis Skill: Z-Tests, Confidence Intervals, and Power Analysis
The ab-test-analysis skill implements a frequentist statistical framework that combines two-tailed Z-tests or chi-squared tests for significance, 95% confidence intervals for effect estimation, and binomial sample-size power validation to ensure rigorous A/B test analysis.
The ab-test-analysis skill in the phuryn/pm-skills repository provides a comprehensive statistical audit for product experiments. Understanding the statistical methodology behind the ab-test-analysis skill requires examining the frequentist framework defined in pm-data-analytics/skills/ab-test-analysis/SKILL.md. The skill applies classical hypothesis testing methods to evaluate conversion rates, calculate relative lift, and ensure experiments meet rigorous power requirements before recommending product decisions.
Hypothesis Testing and P-Value Calculation
The skill's primary significance test relies on comparing proportions between two independent groups. In SKILL.md lines 34-36, the methodology specifies either a two-tailed Z-test or a chi-squared test to generate the p-value for the null hypothesis that both groups share identical conversion rates.
The Z-test for proportions treats the control and variant as independent binomial samples. The test statistic follows the standard normal distribution under the null hypothesis, allowing the skill to calculate exact p-values for observing the measured difference (or greater) if the groups were truly identical. Alternatively, the chi-squared test provides identical results for large samples while offering robustness for categorical analysis.
Conversion Metrics and Effect Size
Before significance testing, the skill computes descriptive metrics that contextualize the practical impact.
Conversion Rate and Relative Lift
According to lines 34-35 of SKILL.md, the skill first calculates the conversion rate for each group as successes divided by total participants. It then computes the relative lift using the formula:
lift = (p_variant - p_control) / p_control * 100
This percentage change indicates the business impact of the variant relative to the baseline control experience.
Confidence Interval Construction
Line 37 specifies the construction of a 95% confidence interval for the difference between the two conversion rates. This interval provides a range of plausible values for the true population difference, complementing the p-value with information about effect size precision. The skill interprets intervals that exclude zero as evidence of a statistically significant directional effect.
Sample Size Validation and Statistical Power
The skill prevents false conclusions from under-powered experiments through rigorous sample size validation. Lines 26-28 of SKILL.md implement the classic binomial sample-size formula:
n = (Z_{α/2}^2 * 2 * p * (1-p)) / MDE^2
Where:
- Z_{α/2} represents the critical value for the desired confidence level (1.96 for 95% confidence)
- p is the pooled conversion rate
- MDE is the minimum detectable effect size
The skill flags tests with less than 80% power (β < 0.20), warning users when sample sizes are insufficient to detect meaningful differences reliably.
Decision Framework and Implementation
The statistical outputs feed into a decision matrix defined in lines 49-55 of SKILL.md. The skill renders a "Ship it" recommendation only when three conditions align: p < 0.05 (line 38), the confidence interval excludes zero, and the relative lift meets business relevance thresholds.
Below is a runnable Python implementation using the statsmodels library that replicates the skill's core statistical logic:
import pandas as pd
import statsmodels.stats.proportion as smp
# Load raw experiment data (must contain `group` and `converted` columns)
df = pd.read_csv('experiment.csv')
# Aggregate counts
control = df[df.group == 'control']['converted'].sum()
control_n = df[df.group == 'control'].shape[0]
variant = df[df.group == 'variant']['converted'].sum()
variant_n = df[df.group == 'variant'].shape[0]
# Conversion rates
p_control = control / control_n
p_variant = variant / variant_n
lift = (p_variant - p_control) / p_control * 100
# Two-tailed Z-test (proportion comparison)
z_stat, p_val = smp.proportions_ztest([variant, control],
[variant_n, control_n],
alternative='two-sided')
# 95% confidence interval for the difference
ci_low, ci_upp = smp.proportion_confint(variant, variant_n, alpha=0.05, method='normal')
ci_diff = (ci_low - p_control, ci_upp - p_control)
print(f'Control rate: {p_control:.2%}')
print(f'Variant rate: {p_variant:.2%}')
print(f'Relative lift: {lift:.2f}%')
print(f'p-value (Z-test): {p_val:.4f}')
print(f'95% CI for difference: {ci_diff[0]:.2%} – {ci_diff[1]:.2%}')
Summary
- The ab-test-analysis skill in
phuryn/pm-skillsapplies frequentist statistical methods including two-tailed Z-tests or chi-squared tests to evaluate A/B experiments. - Significance testing uses a standard p < 0.05 threshold, while 95% confidence intervals quantify the uncertainty around observed conversion rate differences.
- Sample size validation employs the binomial formula to ensure adequate statistical power (≥80%) before recommending decisions.
- Practical and statistical significance are both required: the skill checks for p-value significance, confidence interval exclusion of zero, and business-relevant lift magnitude before generating "Ship it" recommendations.
Frequently Asked Questions
What is the difference between the Z-test and chi-squared test in this skill?
Both tests examine whether conversion rates differ between control and variant groups. The two-tailed Z-test specifically compares two proportions and is preferred for its direct interpretability in A/B contexts. The chi-squared test provides equivalent results for large samples and offers flexibility for multi-variant analysis. According to SKILL.md line 36, the skill accepts either method to calculate the p-value.
How does the skill handle sample size validation?
The skill uses the binomial sample-size formula found in SKILL.md lines 26-28: n = (Z_{α/2}^2 * 2 * p * (1-p)) / MDE^2. This calculates the required sample size per variant to detect a given minimum detectable effect (MDE) with 80% power. If the actual sample size falls below this threshold, the skill flags the test as under-powered and advises against drawing conclusions.
Why does the skill use a 95% confidence level?
The 95% confidence level balances Type I error control (false positives) with practical decision-making needs. As implemented in line 37 of SKILL.md, the 95% confidence interval provides a range of plausible values for the true difference between groups. If this interval excludes zero, the skill confirms a statistically significant directional effect, complementing the p-value with effect size information.
What triggers the "Ship it" recommendation in the decision matrix?
The decision matrix in SKILL.md lines 49-55 requires three simultaneous conditions: a p-value below 0.05 (indicating statistical significance), a 95% confidence interval that excludes zero (confirming directional consistency), and a relative lift magnitude that meets business relevance thresholds. The skill renders "Ship it" only when statistical rigor aligns with practical business impact.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →