# How to Perform A/B Test Statistical Analysis with PM Skills

> Master A/B test statistical analysis using PM Skills. Automate significance testing, validate sample sizes, and get product decision recommendations with the analyze test command.

- Repository: [Pawel Huryn/pm-skills](https://github.com/phuryn/pm-skills)
- Tags: how-to-guide
- Published: 2026-07-02

---

**PM Skills provides a dedicated A/B test analysis skill that automates rigorous statistical significance testing, sample-size validation, and product decision recommendations through the `/analyze-test` command.**

The `phuryn/pm-skills` repository delivers a comprehensive analytics framework designed for product managers who need to evaluate experiments without hand-rolling statistical code. Its **A/B test statistical analysis** capability lives in [`pm-data-analytics/skills/ab-test-analysis/SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.md) and executes via the command defined in [`pm-data-analytics/commands/analyze-test.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/commands/analyze-test.md), orchestrating a four-phase workflow that validates design integrity before calculating significance.

## How the A/B Test Analysis Skill Works

The skill is implemented as a markdown-based instruction set that the PM Skills engine parses, making it portable across chat interfaces, CLI tools, or Slack bots without additional dependencies beyond Python for generated scripts.

### Core File Locations

Two files define the complete analysis pipeline:

- **[`pm-data-analytics/skills/ab-test-analysis/SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.md)** – Contains the step-by-step logic, validation formulas, and decision matrix for interpreting results.
- **[`pm-data-analytics/commands/analyze-test.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/commands/analyze-test.md)** – Defines the `/analyze-test` command interface, input formats (summary stats, CSV, or free text), and output templates.

### The Four-Phase Statistical Workflow

The skill enforces a rigorous sequence that prevents premature conclusions:

1. **Capture experiment context** – Records the hypothesis, variant names, primary and guardrail metrics, duration, and traffic split.
2. **Validate test design** – Automatically checks **sample-size adequacy** using power-analysis formulas, verifies test duration spans at least 1–2 business cycles, detects **sample-ratio-mismatch** (SRM) to catch randomization failures, and flags potential novelty effects.
3. **Calculate statistical significance** – Computes conversion rates, relative lift, **two-tailed z-test** (or chi-squared) **p-values**, and **95% confidence intervals**.
4. **Interpret and recommend** – Maps the significance matrix to a clear product decision: **Ship**, **Extend**, **Stop**, or **Investigate**.

## Running A/B Test Statistical Analysis with /analyze-test

The command accepts multiple input formats, triggering different analysis depths depending on data availability.

### Method 1: Quick Analysis with Summary Statistics

For rapid decisions when you only have aggregated metrics, invoke the command with inline statistics:

```text
/analyze-test Control: 4.2% conversion (n=5000), Variant: 4.8% conversion (n=5100)

```

The command parses these values, runs the built-in statistical routine, and returns a markdown summary following the template in [`analyze-test.md`](https://github.com/phuryn/pm-skills/blob/main/analyze-test.md).

### Method 2: Deep Analysis with Raw CSV Data

For full statistical rigor, supply raw event data. The skill expects a CSV with columns `user_id`, `variant`, and `converted`:

```text
/analyze-test

# attach file: ab_test_results.csv

```

When raw data is detected, the skill generates and executes a Python script that performs the complete calculation pipeline:

```python
import pandas as pd
from scipy import stats

df = pd.read_csv('ab_test_results.csv')
control = df[df.variant == 'control']
variant = df[df.variant == 'variant']

# conversion rates

p_control = control.converted.mean()
p_variant = variant.converted.mean()

# lift calculation

lift = (p_variant - p_control) / p_control * 100

# two-tailed z-test with pooled probability

n_control = len(control)
n_variant = len(variant)
p_pool = (control.converted.sum() + variant.converted.sum()) / (n_control + n_variant)
z = (p_variant - p_control) / ((p_pool * (1 - p_pool) * (1/n_control + 1/n_variant)) ** 0.5)
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

# 95% confidence interval for the difference

se = (p_pool * (1 - p_pool) * (1/n_control + 1/n_variant)) ** 0.5
ci_low = (p_variant - p_control) - 1.96 * se
ci_high = (p_variant - p_control) + 1.96 * se

print(f"Control CR: {p_control:.3%} (n={n_control})")
print(f"Variant CR: {p_variant:.3%} (n={n_variant})")
print(f"Lift: {lift:.2f}%")
print(f"P-value: {p_value:.4f}")
print(f"95% CI: [{ci_low:.3%}, {ci_high:.3%}]")

```

This script yields the exact metrics inserted into the final recommendation table.

## Validating Statistical Rigor Before Calculation

The skill automatically guards against false positives by enforcing pre-calculation checks defined in [`SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/SKILL.md):

- **Sample Size Adequacy** – Validates that enrollment meets the minimum required to detect the minimum detectable effect (MDE) using standard power-analysis formulas.
- **Duration Check** – Ensures the test ran for at least 1–2 full business cycles to account for daily and weekly seasonality.
- **Randomization Validation** – Calculates the **sample-ratio-mismatch** (SRM) p-value to detect bugs in traffic allocation that could invalidate results.

## Interpreting Results: The Decision Matrix

After calculation, the skill maps the statistical output to a product decision using the built-in decision table:

```markdown

## A/B Test Results: Checkout CTA Experiment

**Hypothesis**: New CTA increases checkout conversion.

| Metric | Control | Variant | Lift | P-value | Significant? |
|--------|---------|---------|------|---------|--------------|
| Conversion | 4.20% | 4.80% | +14.3% | 0.0012 | Yes |
| Revenue (guardrail) | $1.20 | $1.18 | -1.7% | – | No concern |

**Recommendation:** **SHIP** – lift is statistically and practically significant, guardrails unchanged.

```

The decision logic follows four outcomes:
- **Ship** – Statistically significant positive lift with no guardrail degradation.
- **Extend** – Directionally positive but underpowered; requires more samples.
- **Stop** – Statistically significant negative impact or failed guardrails.
- **Investigate** – Inconclusive results with anomalies requiring deeper segmentation.

## Summary

- **A/B test statistical analysis** in PM Skills is defined in [`pm-data-analytics/skills/ab-test-analysis/SKILL.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/skills/ab-test-analysis/SKILL.md) and invoked via `/analyze-test` from [`pm-data-analytics/commands/analyze-test.md`](https://github.com/phuryn/pm-skills/blob/main/pm-data-analytics/commands/analyze-test.md).
- The workflow enforces four phases: context capture, design validation, significance calculation, and decision recommendation.
- **Input methods** include summary statistics for quick checks or raw CSV data for full Python-based analysis using `pandas` and `scipy.stats`.
- Automatic validation checks for **sample-ratio-mismatch**, sample size adequacy, and business-cycle duration prevent false positives.
- Output follows a structured decision matrix recommending **Ship**, **Extend**, **Stop**, or **Investigate** based on statistical and business significance.

## Frequently Asked Questions

### How does PM Skills handle sample size validation?

The skill applies standard power-analysis formulas to verify that the enrolled population can detect the defined minimum detectable effect (MDE) with adequate statistical power. If the sample is underpowered, the recommendation shifts to **Extend** rather than interpreting the current p-value.

### Can I analyze continuous metrics like revenue per user, not just conversion rates?

While the provided Python template focuses on binary conversion outcomes (using z-tests for proportions), the skill architecture supports custom metrics. You would modify the generated script to use **t-tests** or **Mann-Whitney U tests** for continuous distributions, though the core command structure and decision framework remain identical.

### What are guardrail metrics, and how do they affect the Ship decision?

**Guardrail metrics** are secondary health indicators (e.g., revenue per session, page load time, support tickets) that must not degrade significantly even if the primary metric improves. The skill requires guardrail p-values to remain above your significance threshold (typically >0.05) before issuing a **Ship** recommendation, preventing optimization of one metric at the expense of overall user experience.

### Does PM Skills require a specific Python environment to run the statistical scripts?

The skill generates standalone Python scripts that import only `pandas` and `scipy.stats`, requiring no specific PM Skills runtime beyond a standard Python 3 interpreter. This allows the analysis to run locally, in CI/CD pipelines, or within containerized chatbot environments without additional dependencies.