How to Perform A/B Test Statistical Analysis with PM Skills
PM Skills provides a dedicated A/B test analysis skill that automates rigorous statistical significance testing, sample-size validation, and product decision recommendations through the /analyze-test command.
The phuryn/pm-skills repository delivers a comprehensive analytics framework designed for product managers who need to evaluate experiments without hand-rolling statistical code. Its A/B test statistical analysis capability lives in pm-data-analytics/skills/ab-test-analysis/SKILL.md and executes via the command defined in pm-data-analytics/commands/analyze-test.md, orchestrating a four-phase workflow that validates design integrity before calculating significance.
How the A/B Test Analysis Skill Works
The skill is implemented as a markdown-based instruction set that the PM Skills engine parses, making it portable across chat interfaces, CLI tools, or Slack bots without additional dependencies beyond Python for generated scripts.
Core File Locations
Two files define the complete analysis pipeline:
pm-data-analytics/skills/ab-test-analysis/SKILL.md– Contains the step-by-step logic, validation formulas, and decision matrix for interpreting results.pm-data-analytics/commands/analyze-test.md– Defines the/analyze-testcommand interface, input formats (summary stats, CSV, or free text), and output templates.
The Four-Phase Statistical Workflow
The skill enforces a rigorous sequence that prevents premature conclusions:
- Capture experiment context – Records the hypothesis, variant names, primary and guardrail metrics, duration, and traffic split.
- Validate test design – Automatically checks sample-size adequacy using power-analysis formulas, verifies test duration spans at least 1–2 business cycles, detects sample-ratio-mismatch (SRM) to catch randomization failures, and flags potential novelty effects.
- Calculate statistical significance – Computes conversion rates, relative lift, two-tailed z-test (or chi-squared) p-values, and 95% confidence intervals.
- Interpret and recommend – Maps the significance matrix to a clear product decision: Ship, Extend, Stop, or Investigate.
Running A/B Test Statistical Analysis with /analyze-test
The command accepts multiple input formats, triggering different analysis depths depending on data availability.
Method 1: Quick Analysis with Summary Statistics
For rapid decisions when you only have aggregated metrics, invoke the command with inline statistics:
/analyze-test Control: 4.2% conversion (n=5000), Variant: 4.8% conversion (n=5100)
The command parses these values, runs the built-in statistical routine, and returns a markdown summary following the template in analyze-test.md.
Method 2: Deep Analysis with Raw CSV Data
For full statistical rigor, supply raw event data. The skill expects a CSV with columns user_id, variant, and converted:
/analyze-test
# attach file: ab_test_results.csv
When raw data is detected, the skill generates and executes a Python script that performs the complete calculation pipeline:
import pandas as pd
from scipy import stats
df = pd.read_csv('ab_test_results.csv')
control = df[df.variant == 'control']
variant = df[df.variant == 'variant']
# conversion rates
p_control = control.converted.mean()
p_variant = variant.converted.mean()
# lift calculation
lift = (p_variant - p_control) / p_control * 100
# two-tailed z-test with pooled probability
n_control = len(control)
n_variant = len(variant)
p_pool = (control.converted.sum() + variant.converted.sum()) / (n_control + n_variant)
z = (p_variant - p_control) / ((p_pool * (1 - p_pool) * (1/n_control + 1/n_variant)) ** 0.5)
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
# 95% confidence interval for the difference
se = (p_pool * (1 - p_pool) * (1/n_control + 1/n_variant)) ** 0.5
ci_low = (p_variant - p_control) - 1.96 * se
ci_high = (p_variant - p_control) + 1.96 * se
print(f"Control CR: {p_control:.3%} (n={n_control})")
print(f"Variant CR: {p_variant:.3%} (n={n_variant})")
print(f"Lift: {lift:.2f}%")
print(f"P-value: {p_value:.4f}")
print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")
This script yields the exact metrics inserted into the final recommendation table.
Validating Statistical Rigor Before Calculation
The skill automatically guards against false positives by enforcing pre-calculation checks defined in SKILL.md:
- Sample Size Adequacy – Validates that enrollment meets the minimum required to detect the minimum detectable effect (MDE) using standard power-analysis formulas.
- Duration Check – Ensures the test ran for at least 1–2 full business cycles to account for daily and weekly seasonality.
- Randomization Validation – Calculates the sample-ratio-mismatch (SRM) p-value to detect bugs in traffic allocation that could invalidate results.
Interpreting Results: The Decision Matrix
After calculation, the skill maps the statistical output to a product decision using the built-in decision table:
## A/B Test Results: Checkout CTA Experiment
**Hypothesis**: New CTA increases checkout conversion.
| Metric | Control | Variant | Lift | P-value | Significant? |
|--------|---------|---------|------|---------|--------------|
| Conversion | 4.20% | 4.80% | +14.3% | 0.0012 | Yes |
| Revenue (guardrail) | $1.20 | $1.18 | -1.7% | – | No concern |
**Recommendation:** **SHIP** – lift is statistically and practically significant, guardrails unchanged.
The decision logic follows four outcomes:
- Ship – Statistically significant positive lift with no guardrail degradation.
- Extend – Directionally positive but underpowered; requires more samples.
- Stop – Statistically significant negative impact or failed guardrails.
- Investigate – Inconclusive results with anomalies requiring deeper segmentation.
Summary
- A/B test statistical analysis in PM Skills is defined in
pm-data-analytics/skills/ab-test-analysis/SKILL.mdand invoked via/analyze-testfrompm-data-analytics/commands/analyze-test.md. - The workflow enforces four phases: context capture, design validation, significance calculation, and decision recommendation.
- Input methods include summary statistics for quick checks or raw CSV data for full Python-based analysis using
pandasandscipy.stats. - Automatic validation checks for sample-ratio-mismatch, sample size adequacy, and business-cycle duration prevent false positives.
- Output follows a structured decision matrix recommending Ship, Extend, Stop, or Investigate based on statistical and business significance.
Frequently Asked Questions
How does PM Skills handle sample size validation?
The skill applies standard power-analysis formulas to verify that the enrolled population can detect the defined minimum detectable effect (MDE) with adequate statistical power. If the sample is underpowered, the recommendation shifts to Extend rather than interpreting the current p-value.
Can I analyze continuous metrics like revenue per user, not just conversion rates?
While the provided Python template focuses on binary conversion outcomes (using z-tests for proportions), the skill architecture supports custom metrics. You would modify the generated script to use t-tests or Mann-Whitney U tests for continuous distributions, though the core command structure and decision framework remain identical.
What are guardrail metrics, and how do they affect the Ship decision?
Guardrail metrics are secondary health indicators (e.g., revenue per session, page load time, support tickets) that must not degrade significantly even if the primary metric improves. The skill requires guardrail p-values to remain above your significance threshold (typically >0.05) before issuing a Ship recommendation, preventing optimization of one metric at the expense of overall user experience.
Does PM Skills require a specific Python environment to run the statistical scripts?
The skill generates standalone Python scripts that import only pandas and scipy.stats, requiring no specific PM Skills runtime beyond a standard Python 3 interpreter. This allows the analysis to run locally, in CI/CD pipelines, or within containerized chatbot environments without additional dependencies.
Have a question about this repo?
These articles cover the highlights, but your codebase questions are specific. Give your agent direct access to the source. Share this with your agent to get started:
curl -s "https://instagit.com/install.md" Maintain an open-source project? Get it listed too →