Enter visitors and conversions for each variant. We run a pooled two-proportion z-test (two-sided) and Wald intervals at your chosen confidence level. All math runs locally—nothing is uploaded. Free tools hub · PPC Budget Calculator
Variant A (often control)
Variant B (often challenger)
Educational tool only—not medical, legal, or financial advice. For experiment design, pair with your analytics stack and governance policies.
About this tool
Running an A/B test without statistical significance is like declaring a winner after a coin flip that stopped mid-air. You might feel momentum behind one variant, but random noise, weekday effects, and uneven traffic allocation can easily produce a temporary lead that would vanish with another thousand visitors. Statistical significance is the disciplined line we draw between “this could be luck” and “this pattern is unlikely if the two experiences were truly identical.” It does not prove causality in a philosophical sense, but it quantifies how surprised we should be to see a gap this large under a simple null model where both variants convert at the same underlying rate.
Marketing teams live with partial information: budgets end, launches cannot wait forever, and stakeholders want a headline. Significance testing gives you a shared vocabulary—p-values, confidence levels, intervals—so product, growth, and finance can argue about decisions using numbers instead of vibes alone. It also protects you from shipping a worse experience because a noisy week favored the wrong branch.
This free A/B Test Significance Calculator from SynthQuery compares two proportions with a classic pooled two-proportion z-test and Wald confidence intervals at ninety, ninety-five, or ninety-nine percent confidence. You enter visitors and conversions for Variant A and Variant B; we show conversion rates, relative uplift of B versus A, a two-sided p-value, whether the gap clears your alpha, z-score, per-variant intervals, a visual bar comparison with error bars, a simple significance meter, and a post-hoc power readout that comments on whether your sample was typically large enough to detect an effect like the one you observed. Everything executes in your browser—paste experiment readouts from your analytics or testing tool without uploading raw customer data to our servers.
What this tool does
The core inference engine is a two-sided test of equal conversion rates using a pooled standard error under the null hypothesis that both variants share a common probability of success. That standard textbook approach is easy to audit and matches what many teams learned in statistics primers; it is not the only valid framework—Bayesian tests, sequential methods, and exact binomial variants exist—but it is transparent and fast for spreadsheet-grade checks.
Wald confidence intervals accompany each variant’s observed rate. They center on the sample proportion and extend by a z-multiplier times the square root of p times one minus p divided by n. For very small samples or rates near zero or one hundred percent, Wilson or Clopper-Pearson intervals behave more conservatively; we surface that caveat in the chart caption so analysts know when to escalate to richer tooling.
The relative uplift percentage expresses how much higher or lower Variant B’s rate is compared to Variant A in proportional terms: open parenthesis pB minus pA close parenthesis divided by pA, times one hundred, when A’s rate is positive. If A converted at zero percent and B did not, uplift is undefined in the interface; that edge case is rare in live tests but appears in synthetic homework examples.
The significance meter is a visual companion to the p-value, scaled so that smaller p-values relative to your alpha read as stronger evidence against the “no difference” assumption. It is intentionally simple—your written conclusion should still cite the numeric p-value and the business impact, not the bar alone.
Post-hoc power answers a retrospective question: given the sample sizes you actually ran and the rates you actually saw, how often would a test like this detect a difference of that magnitude if it were real? Low post-hoc power suggests your study was under-fueled for effects of that size, which helps explain borderline outcomes. It is not a license to keep peeking until significance appears; governance and pre-registration still matter. The bar chart plots each variant’s observed conversion rate in percent with asymmetric error-bar offsets derived from the Wald interval endpoints so stakeholders can see overlap at a glance.
Technical details
Let xA and nA denote conversions and visitors for Variant A, with p-hat A equals xA over nA, and similarly for B. Under the null hypothesis of equal population proportions, we form a pooled estimate p-pool equals open parenthesis xA plus xB close parenthesis over open parenthesis nA plus nB close parenthesis. The standard error is the square root of p-pool times one minus p-pool times open parenthesis one over nA plus one over nB close parenthesis. The z-statistic is open parenthesis p-hat B minus p-hat A close parenthesis divided by that standard error. The two-sided p-value is twice the upper tail probability of the standard normal beyond the absolute z-score, assuming the normal approximation is adequate.
The Wald interval for each variant uses p-hat plus or minus z-alpha-over-two times the square root of p-hat times one minus p-hat over n, clipped to the unit interval. Confidence level ninety-five percent corresponds to alpha equals zero point zero five and the familiar one point nine six multiplier for large samples.
Post-hoc power uses a normal approximation with the alternative standard error based on separate p-hat values rather than the pooled null, comparing the rejection region implied by the pooled critical ratio to the shift under the observed rates. This is a teaching-grade estimate; production experimentation platforms often add continuity corrections, hierarchical shrinkage, or sequential designs.
Independence assumptions matter: repeated visits from the same user can violate independence if your denominator is sessions but your randomization is user-based. Always mirror the unit of assignment in your analytics export when possible.
Use cases
Landing page experiments are the classic use case. Growth engineers ship a new hero, pricing copy, or social proof block, split traffic fifty-fifty, and need a read on whether the challenger improved signups. This calculator turns the final visitor and conversion tallies into a p-value and interval summary you can attach to a launch review without exporting everything to R on day one.
Email subject-line and preheader tests benefit from the same structure when the outcome is binary—clicked, ordered, or upgraded. Creative teams iterate quickly; having a neutral calculator reduces arguments about whether a two-point lift is signal or noise, especially when list segments differ subtly between sends.
Pricing page and plan-layout tests often pair with revenue guardrails. Statistical significance on a micro-conversion such as “expand pricing table” does not replace checks on average revenue per user or churn at thirty days, but it tells you whether the immediate response pattern diverged beyond chance.
Call-to-action optimization—button color micro-tests aside—frequently targets clarity of label, placement above the fold, and mobile thumb reach. When traffic is mobile-heavy, sample sizes balloon; use the post-hoc power line to explain to leadership why you need another week of collection even though the point estimate looks exciting.
Paid media teams that randomize at the ad set or landing-page level can paste totals after a flight ends, then connect implications to the CPA Calculator and ROI Calculator when finance asks what the lift means in dollars. Product-led SaaS teams testing onboarding modals and B2B teams testing gated content offers both need a lingua franca; this page supplies the statistical sentence that sits next to qualitative user quotes in the retrospective.
How SynthQuery compares
Enterprise experimentation platforms bundle statistics with assignment, targeting, and governance. Comparing approaches helps you decide when a lightweight browser calculator is enough.
Aspect
SynthQuery
Typical alternatives
SynthQuery calculator vs Optimizely-style reporting
Paste final counts from any source; see z-test p-value, Wald intervals, uplift, and a power comment instantly in the browser.
Full-stack tools compute statistics inside the product, enforce sample ratio mismatch checks, and tie to feature flags—heavier, but integrated.
SynthQuery vs VWO built-in stats engines
Transparent formulas you can reproduce in a spreadsheet; no account or snippet required.
VWO and peers offer Bayesian and frequentist modes with guardrails for peeking; use those when the experiment runs inside their stack.
Pooled z-test vs Bayesian probability of best
Frequentist p-value answers how unusual the data would be if rates were equal; it is not the probability that B is better.
Bayesian tools report credible intervals and chance-to-win; interpret definitions carefully when mixing frameworks in one deck.
Client-side privacy
Inputs never leave your device unless you copy them yourself.
Hosted platforms log assignments and events on their servers by design—often desirable for audit trails.
How to use this tool effectively
Start by aligning definitions with your experiment platform. “Visitors” might mean unique users, sessions, or randomization units depending on whether you use client-side bucketing, server-side assignment, or email splits. Conversions should match the primary metric your team pre-registered—purchase, signup, qualified lead, or click on a module. Mixing definitions between variants or changing the goal mid-flight invalidates simple calculators; this page assumes you already exported consistent counts for A and B over the same time window and with the same attribution rules.
For a landing page test, open your A/B report and copy total exposures and goal completions for control versus challenger. Type control under Variant A and challenger under Variant B if that mirrors your internal language; the math is symmetric, but uplift is always expressed as B relative to A, so swap rows if you want A to be the new experience. Choose a confidence level: ninety-five percent is the common default in marketing; ninety-nine percent is stricter and reduces false positives at the cost of more false negatives; ninety percent is sometimes used for directional early reads—document the choice in your experiment log.
For an email subject-line test, use sends or delivered messages as the visitor denominator and unique clicks or orders as conversions, depending on whether your hypothesis is about engagement or revenue. If your ESP deduplicates opens unreliably, prefer clicks or downstream purchases for cleaner binary outcomes. Enter counts, click Calculate, and read the p-value alongside the confidence intervals. If intervals overlap heavily and the p-value is high, do not rewrite the playbook yet; if B clears significance with meaningful uplift and acceptable guardrails (refunds, spam complaints, support load), you have a stronger case to roll out.
After calculation, use Copy results to paste a summary into Notion, Jira, or a slide appendix. Use Reset when switching between unrelated tests so you do not mix traffic from different campaigns. Pair this page with the Conversion Rate Calculator when you need to reconcile rates from partial exports, the CTR Calculator when the top of the funnel is impressions-to-clicks, and the PPC Budget Calculator when paid traffic supplied the audience for the experiment.
Limitations and best practices
This calculator assumes a simple two-proportion z-test with independent observations and large enough samples for the normal approximation. It does not replace sequential stopping rules, multiple-comparison correction across dozens of variants, or stratified variance when traffic mix shifts mid-test. When rates are extreme and counts are small, consider exact tests or Wilson intervals.
Peeking inflates false positives: if you calculate, then wait, then calculate again whenever you are curious, you are not operating at the nominal alpha you selected. Use fixed horizons or platform-native sequential methods when decisions are made continuously.
The significance meter is illustrative only. Always report numeric p-values, sample sizes, and the business cost of false wins and false losses alongside statistical output.
For related planning math, use the Conversion Rate Calculator, CTR Calculator, CPA Calculator, ROI Calculator, and PPC Budget Calculator; browse the Free tools hub for sample-size utilities and the full marketing set as new calculators roll out.
Browse SynthQuery marketing calculators and utilities—CTR, CPA, CPM, CLV, conversion helpers, and emerging tools such as sample-size planners as they appear in the catalog.
Multi-channel spend planning with CTR and conversion assumptions—useful when experiments run on paid traffic.
Frequently asked questions
Statistical significance is a threshold decision rule applied after you collect data. You posit a null hypothesis—often “Variant A and Variant B have the same underlying conversion rate”—and ask how extreme your observed split would be if that null were true. If the outcome is sufficiently unlikely (as measured by a p-value compared to a chosen alpha such as zero point zero five), you reject the null and call the result statistically significant. Significance does not mean the lift is large in business terms, morally important, or guaranteed to repeat next quarter; it means the gap is hard to explain as mere sampling noise under the model you assumed.
The p-value is the probability of seeing a difference at least as extreme as the one in your data, in the direction tested, if the null hypothesis were true and your modeling assumptions held. A small p-value means such an extreme split would be rare under equal underlying rates. A large p-value means splits like yours happen often even when variants are identical, so you lack strong evidence of a difference. The p-value is not the probability that the null is true, nor the probability that B is better—that is a common misread. It is a measure of compatibility between your data and a simple null story.
Ideally, you stop at a pre-planned sample size or date chosen before the test, powered for a minimum detectable effect that matters financially. Early stopping because one variant looks ahead invites inflated false positives unless you use sequential methods designed for repeated looks. If external events break the test—site outage, promo code leak, tracking bug—pause, fix instrumentation, and restart with clean assignment rather than torturing the numbers. When business deadlines force a decision on inconclusive data, document that the result is directional and consider follow-up validation rather than claiming ninety-five percent confidence you did not earn.
Ninety-five percent corresponds to alpha zero point zero five and is the de facto marketing standard: fewer false positives than a loose ninety percent rule, less punishing than ninety-nine percent on sample size. Ninety-nine percent tightens the bar—useful when shipping a change is expensive to reverse, such as pricing infrastructure or compliance-sensitive flows. Ninety percent is more lenient; some teams use it for low-risk creative tests but must accept more noise wins. Whatever you pick, set it before peeking and align stakeholders so post-hoc changes do not erode trust.
Relative uplift compares Variant B to Variant A as a percentage change in conversion rate: open parenthesis rate B minus rate A close parenthesis divided by rate A, times one hundred, when A’s rate is greater than zero. A fifty percent relative uplift from two percent to three percent is very different in absolute terms than fifty percent from twenty percent to thirty percent; always pair uplift with baseline rate and absolute incremental conversions when discussing revenue impact.
Separate Wald intervals for each proportion are not identical to a formal test of equality; overlap rules are heuristics that can mislead, especially when sample sizes differ. The pooled z-test directly evaluates the difference between proportions. You may occasionally see surprising combinations near boundaries—when in doubt, trust the p-value from the intended difference test and visualize both rates with their intervals as context, not as a substitute for the joint test.
Post-hoc power looks backward at the sample you actually ran and estimates how likely this design would have been to detect a difference similar to what you observed if that difference reflected a true underlying gap. Very low post-hoc power suggests your experiment was often too small to reliably detect effects of that magnitude, which helps interpret “not significant” outcomes—you may have missed a real but modest lift. High post-hoc power means a true gap like the one you saw would usually register as significant; if you still missed, the lift may be illusory or assumptions may be violated.
This page handles exactly two proportions. Multi-variant tests need multiple-comparison adjustments or hierarchical models. Multi-step funnels should define whether the metric is end-to-end conversion, stepwise progression, or revenue per visitor; each choice changes denominators and independence. Use specialized platforms or analysts when the design goes beyond a clean A versus B binary outcome.
Sample size should ideally be chosen before data collection so you have enough power to detect a minimum effect that matters to revenue. This page reports post-hoc power after the fact; forward planning belongs in a dedicated sample-size workflow with baseline rate, minimum detectable effect, alpha, and target power as inputs. Check the Free tools hub for SynthQuery’s Sample Size Calculator when available, or use your experimentation platform’s planner, then validate outcomes here after the run. Enterprise stacks add peeking and stratification rules you should mirror when production stakes are high.
No. Like other SynthQuery free utilities, the arithmetic runs entirely in your browser. Counts you type remain on your device unless you copy or screenshot them elsewhere. Follow your company’s data policies when pasting customer metrics into any tool, even local ones.