HubSpot will happily declare an email A/B winner with 200 sends and a 1% open-rate gap. That’s noise, not signal. Most teams promote the “winner,” then watch performance regress. Here’s how to test honestly.
Minimum sample size
For a meaningful open-rate test (assuming a 25% baseline and detecting a 3-point lift), you need roughly 1,800 contacts per variant. For click-rate at 4% baseline detecting a 1-point lift, you need around 6,000 per variant. Smaller list = no valid test.
If your test list is under 4,000 total, don’t A/B test. Run two single sends in different weeks and look at directional patterns.
Test one variable, not three
Subject line OR send time OR preview text. Never all three. If you change three variables and one wins, you don’t know which change caused the lift. HubSpot’s UI will let you, your decision-making shouldn’t.
Pick your decision metric before you launch
Open rate is misleading after iOS 15 (Mail Privacy Protection inflates opens). Use click rate or downstream conversion as the decision metric. Document it in the test name: 2026-04-pricing-cta-test-decisionmetric-CTR.
Wait the full cycle
Most B2B email engagement happens in the first 48 hours; long-tail engagement runs to 7 days. Don’t call the test at 24 hours because one variant is “ahead.” Wait the full week.
Account for day-of-week confounds
Tuesday vs Thursday changes everything. HubSpot’s split test sends both variants at the same time, which controls for this. If you’re running serial tests instead of split tests, randomize day of week or you’re testing days, not creative.
Document the loss
A null result is a result. Maintain a experiment_log.md:
Test: 2026-04 pricing CTA color
Hypothesis: Orange CTA outperforms blue
Result: No statistically significant difference (p=0.34, n=8200/variant)
Decision: Stay with blue, retest only if creative direction changes
Most teams forget the losing tests, which means they re-run them every 18 months.
Don’t test what doesn’t move the needle
Subject line emoji on a transactional email isn’t worth testing. Save A/B capacity for nurture-funnel CTAs and welcome series, where compounded lift matters.
What to do this week
Audit your last 10 A/B tests. Calculate whether each had the sample size to detect the lift you claimed. Discard the false positives and retest the ones that actually mattered.