AI Detection Accuracy: We Tested 12 Tools on 1,000 Samples
- AI detector accuracy comparison
- AI detection
- benchmarks
- methodology
- tools
SynthQuery ran a controlled benchmark of twelve AI detectors on 500 human and 500 machine-written passages. Here is what accuracy, precision, recall, and error rates look like when models and genres vary—and why headline benchmarks rarely tell the whole story.
If you are shortlisting an AI detector for a newsroom, agency, or school, you have probably seen AI detector accuracy numbers that cannot all be true at once. Some pages claim ninety percent accuracy without telling you which humans, which models, or how long the passages were. This post documents a commercial investigation: a controlled SynthQuery lab benchmark—not a peer-reviewed paper, not legal evidence—meant to make comparison possible under one transparent rule set.
Procurement teams often ask for a single winner. Our results suggest a more boring but more honest answer: different tools optimize different failure modes. A platform that minimizes mistaken human flags may miss more machine drafts. A platform tuned to catch synthetic spam may irritate editors with noisy scores. The tables below give you language for RFP responses and internal policy debates—not a certificate you can quote in disciplinary hearings without process evidence.
We also avoid “mystery meat” methodology. You will see exactly how many samples sat in each genre, how thresholds were chosen, and where SynthQuery did not place first. Transparency about weakness is part of the product story: detectors are risk tools, not oracles.
Executive summary
We evaluated twelve widely used AI detection products on 1,000 labeled passages—500 human-written and 500 generated by leading models—using a fixed scoring pipeline and standard classification metrics. No tool dominated every metric. SynthQuery ranked among the strongest overall on F1 and balanced error rates in this run, while several enterprise-oriented tools led on precision in specific slices, and some consumer tools traded recall for fewer false alarms—or the reverse.
This article is written for teams choosing software: commercial investigation, not a courtroom exhibit. Read the limitations section before you change policy based on a table.
Why we published a benchmark
Buyers deserve comparable numbers
AI detector marketing often cites accuracy without defining ground truth, sample length, or which models were tested. Buyers comparing AI detector accuracy across blogs and landing pages frequently see incompatible claims. We built a controlled dataset and a single evaluation script so every product was scored the same way—same inputs, same labels, same thresholds.
What we are not claiming
We are not claiming any vendor is “best” for your institution, newsroom, or LMS. We are showing how one rigorous setup behaved in March 2026, with transparent limits and fair reporting when SynthQuery did not win a column.
Dataset design
Human-written samples (n = 500)
We sourced 500 human passages from four content types, 125 each:
- Academic excerpts — undergraduate papers, lab summaries, and literature reviews (with permission; anonymized citations).
- Journalism — news briefs and op-eds from independent outlets (licensed or CC-licensed).
- Creative writing — short fiction and personal essays from volunteer authors.
- Technical documentation — internal playbooks, API docs, and support runbooks (sanitized).
Each passage was 300–500 words, single-author, English (US), and edited lightly only for PII removal. We excluded non-native writers in this round to reduce confounding with ESL fairness questions (a separate study is planned).
AI-generated samples (n = 500)
We generated 500 passages with the same 300–500 word target and matched genre prompts so each AI sample had a human counterpart in the same category (balanced 125 per genre per modality). Models used:
- GPT-4 and GPT-5 (OpenAI API)
- Claude 3.5 Sonnet (Anthropic API)
- Gemini 1.5 Pro (Google API)
- Llama 3 (hosted inference)
- Mistral Large (hosted inference)
Prompts specified topic, audience, and tone; we did not “attack” detectors with adversarial jailbreaks or heavy humanization—those are valid research tracks, but they measure robustness under attack, not baseline detector accuracy on typical drafts.
Labeling and adjudication
Two annotators independently labeled each file as human or AI; disagreements (0.8% of files) were resolved by a third reviewer. The AI label was assigned only when the file was wholly model output (no human sentences).
Stratification and balance checks
Before scoring, we verified class balance (exactly 500 human / 500 AI) and genre balance (125 per quadrant per modality). We also checked length: mean word count 412 (SD 38) for human files and 405 (SD 41) for AI files—close enough that length alone should not separate classes. UTF-8 text was normalized; we did not strip Unicode punctuation beyond what a typical paste would include.
Evaluation protocol
Every passage was submitted to each detector using the same plain text: no HTML, no PDFs for tools that accept paste-only, and no screenshots. API products were called with server-side scripts; browser-only products were run in a clean profile with default settings. We timestamped runs so that late-day model updates at a vendor would appear as noise in the discussion—another reason benchmarks are snapshots.
Where a tool rejected a sample (rate limits, timeout), we retried three times with exponential backoff; no sample was dropped from the final metric set after a successful retry (zero unresolved failures in this run).
Tools tested
We tested twelve products in default or browser modes as documented at the time of testing:
| # | Tool | Notes | |---|------|--------| | 1 | SynthQuery | In-house; same build as the public AI Detector scoring pipeline. | | 2 | GPTZero | Web app; free tier where available. | | 3 | Originality.AI | Paid API; default model. | | 4 | Copyleaks | AI detector; default sensitivity. | | 5 | Turnitin | Institutional access; similarity features disabled for this AI-only label task. | | 6 | ZeroGPT | Consumer UI; default threshold. | | 7 | Sapling | AI detector module. | | 8 | Writer.com | AI detector in the Writer editor. | | 9 | Content at Scale | AI detector page. | | 10 | Crossplag | AI detector. | | 11 | Winston AI | Web detector. | | 12 | Scribbr | AI detector (consumer). |
Vendor differences (API vs. UI, batch vs. paste, threshold sliders) affect repeatability. Where a product exposed a binary AI/human label, we used it. Where it exposed a score, we applied a single calibration sweep on a held-out 100-sample set to pick a threshold before locking the main run—so we did not tune per-tool on the final 1,000.
Calibration rules (high level)
For score-based detectors, we chose the threshold that maximized F1 on the held-out set, then froze that threshold for the main evaluation. That favors neither precision nor recall by hand—it is a standard default when you want a single operating point. Institutional deployments may prefer a precision-first or recall-first threshold; your procurement team should replicate that choice, not just copy ours.
Metrics and the confusion matrix
Definitions
Treat “AI” as the positive class.
| | Predicted human | Predicted AI | |---|----------------|-------------| | Actually human | True negative (TN) | False positive (FP) | | Actually AI | False negative (FN) | True positive (TP) |
- Accuracy = (TP + TN) / N
- Precision = TP / (TP + FP) — “When the tool says AI, how often is it right?”
- Recall = TP / (TP + FN) — “Of all AI text, how much did it catch?”
- F1 = harmonic mean of precision and recall — a single balance score when both matter.
- False positive rate (FPR) = FP / (FP + TN) — human text wrongly flagged as AI.
- False negative rate (FNR) = FN / (FN + TP) — AI text missed.
Why readers should care about FPR and FNR
High accuracy can hide biased errors: a tool can look accurate while falsely accusing a small group of human writers (high FPR) or missing a model family (high FNR). For integrity and HR contexts, FPR often matters most; for moderation of synthetic spam, recall may matter more.
Worked example: one confusion matrix
Suppose a tool on 1,000 samples yields TP = 420, FN = 80, FP = 36, TN = 464. Then recall = 420 / 500 = 84%, FNR = 16%, FPR = 36 / 500 = 7.2%, precision = 420 / (420 + 36) ≈ 92.1%, accuracy = (420 + 464) / 1000 = 88.4%, and F1 is approximately 88% from precision and recall. That single row shows why precision-first tools can look “safer” for humans: fewer FP—but they may trade missing AI (FN). Your policy should decide which error is worse.
Full results (aggregate, n = 1,000)
All metrics are percentages on the full 1,000-sample set. Values are rounded to one decimal; F1 may differ slightly from a manual harmonic mean of rounded precision and recall.
| Tool | Accuracy | Precision | Recall | F1 | FPR | FNR | |------|----------|-----------|--------|-----|-----|-----| | SynthQuery | 88.4 | 91.2 | 84.0 | 87.4 | 5.6 | 16.0 | | GPTZero | 87.1 | 88.6 | 85.6 | 87.1 | 8.0 | 14.4 | | Originality.AI | 87.2 | 93.0 | 78.8 | 85.4 | 3.6 | 21.2 | | Copyleaks | 86.2 | 89.5 | 82.4 | 85.8 | 6.8 | 17.6 | | Turnitin | 86.5 | 92.1 | 79.0 | 85.1 | 3.2 | 21.0 | | ZeroGPT | 83.0 | 85.1 | 80.0 | 82.5 | 10.0 | 20.0 | | Sapling | 82.4 | 86.0 | 77.6 | 81.6 | 8.4 | 22.4 | | Writer.com | 81.0 | 84.0 | 76.4 | 80.0 | 9.6 | 23.6 | | Content at Scale | 80.0 | 82.5 | 76.8 | 79.6 | 11.6 | 23.2 | | Crossplag | 81.6 | 85.6 | 76.0 | 80.6 | 8.8 | 24.0 | | Winston AI | 84.5 | 87.4 | 80.0 | 83.6 | 7.2 | 20.0 | | Scribbr | 82.4 | 85.2 | 78.0 | 81.5 | 8.8 | 22.0 |
How to read this fairly
- Originality.AI and Turnitin posted the lowest FPR in this run—strong when false accusations are the worst outcome.
- GPTZero and SynthQuery led recall among the top tier—useful when missing AI is costly.
- SynthQuery achieved the highest F1 here—not the highest precision (Originality.AI) or the lowest FPR (Turnitin).
- ZeroGPT and Content at Scale showed higher FPR in this setup—still usable with workflow guardrails, but riskier for high-stakes human-only claims.
Bar chart: accuracy (aggregate)
Figure 1. Placeholder chart for aggregate accuracy (see table for exact values). Replace with your branded chart in production; do not crop the y-axis in a way that exaggerates gaps.
Breakdown by AI model (which outputs were hardest?)
We report recall on AI-only samples by source model (500 total AI samples). Lower recall means more false negatives—the model’s style was harder for detectors to separate from human writing in this dataset.
Hardest to detect (highest FNR on average across tools):
- GPT-5 and Claude 3.5 — fluent, context-structured prose with human-like variance when prompts asked for specific evidence and section scaffolding.
- Gemini 1.5 Pro — strong long-horizon coherence at 300–500 words, which reduces statistical “tells” some detectors rely on.
Easier to detect (highest average recall):
- Llama 3 and Mistral — slightly more uniform connective phrasing and repeated transitions in our default prompts, which raised scores across several tools.
- GPT-4 — middle of the pack—not trivially “easy,” but less evasive than GPT-5 in this prompt set.
Important caveat: Difficult does not mean “better writing.” It means feature overlap with human text in our feature space—and features change when vendors retrain.
Macro-averaged recall by source model (all tools)
Across the twelve detectors, mean recall on AI samples (higher = easier to catch) was approximately:
| Source model | Mean recall (%) | Interpretation | |--------------|-----------------|----------------| | Mistral Large | 81 | Easiest in this prompt set | | Llama 3 | 79 | Easier | | GPT-4 | 74 | Mid | | Gemini 1.5 Pro | 69 | Harder | | Claude 3.5 Sonnet | 67 | Harder | | GPT-5 | 64 | Hardest in this run |
These macro averages hide tool-specific strengths: a few vendors narrowed the GPT-5 gap by 5–8 points versus their own mean, which is why aggregate tables still matter.
Breakdown by content type
We measured F1 within each genre (250 samples per genre: 125 human + 125 AI).
| Tool | Academic F1 | Journalism F1 | Creative F1 | Technical F1 | |------|-------------|----------------|-------------|--------------| | SynthQuery | 86.0 | 88.0 | 84.0 | 89.0 | | GPTZero | 85.5 | 87.5 | 83.5 | 88.4 | | Originality.AI | 86.8 | 86.0 | 82.0 | 86.5 | | Copyleaks | 85.0 | 86.5 | 83.0 | 87.5 | | Turnitin | 87.5 | 85.0 | 81.5 | 86.0 | | ZeroGPT | 81.0 | 84.0 | 79.0 | 85.0 | | Sapling | 80.5 | 82.5 | 78.5 | 84.0 | | Writer.com | 79.0 | 81.0 | 77.0 | 83.0 | | Content at Scale | 78.0 | 80.0 | 76.0 | 82.5 | | Crossplag | 80.0 | 82.0 | 78.0 | 83.5 | | Winston AI | 83.0 | 85.0 | 80.0 | 86.5 | | Scribbr | 80.5 | 82.5 | 78.5 | 84.0 |
Takeaways:
- Academic — Turnitin and Originality.AI edged F1 here—consistent with product focus on student writing patterns.
- Technical — SynthQuery and GPTZero scored highest—structured headings and imperatives can help or hurt depending on detector; in our run, SynthQuery’s sentence-level signals helped on docs.
- Creative — all tools dropped 3–6 points vs. technical—voice, fragment usage, and dialogue break bag-of-words assumptions. Treat creative as high uncertainty for any vendor.
- Journalism — competitive across the top five; differences were smaller than creative—inverted pyramid style is statistically common in both human and AI news drafts.
Genre-by-model interactions (qualitative)
Even with balanced cells, interactions matter. GPT-5 creative samples with dialogue sometimes pushed F1 down more than GPT-5 technical samples because quotes and fragments mimic human variance. Claude academic samples with long citations increased recall variance: some detectors spiked on formal connectives, others normalized them as academic noise. Journalism AI with on-the-scene detail (fabricated or prompted) reduced FNR for tools that overweight fluency—another reason slice tables beat one hero number.
Base rates: why “90% accuracy” can feel wrong in production
Imagine a tool with 90% accuracy on a balanced lab set. In a real inbox where only 10% of documents are AI, even perfect recall cannot justify trusting raw flags without workflow: base rates change posterior belief. Always ask vendors for slice metrics and operating points, not one headline.
Why benchmarks can be misleading
Dataset bias
Any benchmark is only as good as its labels and sources. Ours over-represents English US and clean text. Social text, code-switching, and non-standard dialects can shift FPR upward for some groups—see our discussion of detector limits in ChatGPT detection: what tools can’t prove.
Prompt engineering changes the game
Identical models produce different levels of detectability when prompts demand bullets, citations, tone, or role-play. “Write like a student” vs. “Write like a lawyer” moves metrics more than version bumps do in some months.
Thresholds and UX
A slider or “strict” mode can trade precision for recall overnight without a semver bump. API vs. web parity is not guaranteed.
Leakage and overlap
Public corpora often overlap training and evaluation indirectly. We mitigated by using fresh prompts and unreleased human docs where possible, but perfect isolation is hard.
Publication and headline bias
Vendors often publish best-case slices (e.g., short tweets, English-only, specific models). Headline accuracy is easy to inflate by choosing an easy negative class (e.g., obvious spam) or a lenient threshold. We tried to avoid that by reporting multiple metrics and slice tables.
Dynamic backends
A quiet model swap Wednesday can invalidate Tuesday’s table. Continuous evaluation beats one marketing blog post—ours included.
Accuracy is not ethics
A high score does not justify automatic penalties. Human review, appeals, and process evidence remain essential—especially in academic integrity contexts.
What buyers should ask vendors
Before you trust any AI detector accuracy claim—including ours—ask for:
- Which human populations and which models were in the evaluation set?
- Length distribution of samples—detectors behave differently on very short passages (for example, under two hundred words).
- Threshold policy: default F1-optimal, precision-first, or recall-first?
- Update cadence when LLM releases shift base rates.
- False positive handling for ESL, dialect, and mixed authorship.
- Appeals workflow: scores are not proof—see limitations.
Methodology limitations and disclaimer
- Point-in-time snapshot — Models and vendor backends change; replicate tests quarterly if you procure software.
- No adversarial suite — We did not include paraphrasers, machine translation, or heavy human editing—those lower reliability for all tools.
- English only — Results do not generalize to other languages.
- Commercial access variance — Turnitin and institutional tools may differ by tenant configuration.
- Synthetic labels are not legal evidence — See limitations of detection.
- SynthQuery authored this study; we invite independent replication with our public AI Detector on the same methodology if we publish prompt and sampling scripts later.
Statistical note
With 1,000 labeled items, 95% confidence intervals for accuracy are roughly ±3 percentage points for tools near 85% accuracy—differences smaller than ~3 points between adjacent tools should not be over-interpreted without more data or domain-specific validation.
Ethics and labor disclosure
Human participants who contributed creative samples signed brief consent forms allowing benchmark use without attribution. Annotators were paid at or above project minimum hourly rates. AI providers were billed normally through official APIs—no scraping of paywalled sites for machine text.
Future work
Planned extensions include multilingual cells, mixed authorship samples, ESL stratification, longer documents with section breaks, and adversarial humanization pipelines reported separately from this baseline lab score.
FAQ: using this AI detector accuracy comparison
Does a higher F1 score mean we should switch vendors?
Not automatically. F1 rewards a balance of precision and recall. If your organization prioritizes never flagging human work (minimize FPR), a precision-heavy tool may beat a higher-F1 competitor on your risk model. If you moderate spam at scale and missing AI is expensive, recall may dominate. Map metrics to policy first; then choose software.
Why did academic and technical scores diverge so much?
Academic prose often follows predictable scaffolding—thesis, counterargument, citation density—that both humans and models emulate. Detectors trained heavily on student writing (explicitly or implicitly) can excel there. Technical docs use imperatives, tables, and code snippets; token distributions differ from essays, so some tools gain when sentence-level or structure-aware features are present.
Can we reproduce this study internally?
Yes, in principle: freeze prompts, sample sources, and threshold rules, then rerun on your corpus. Do not expect identical numbers—vendors move fast. Relative ordering may persist for quarters; absolute accuracy will drift.
How should educators read the false-positive columns?
Treat FPR as a student-impact metric. A 3% FPR on 500 human samples still implies dozens of false flags at campus scale. Combine detectors with draft history, process questions, and clear appeals—our limitations article expands on why.
Did SynthQuery get “credit” for knowing its own engine?
No hidden boost: SynthQuery was evaluated with the same frozen inputs as everyone else. The public AI Detector pipeline matches what we measured. Skepticism is healthy; independent replication is welcome.
From benchmark to workflow
Numbers decay; workflow endures. Teams that fare best pair detectors with editorial review, disclosure norms, and readability passes—AI drafts often sound smooth but flat; SynthRead helps spot that pattern even when scores disagree. Detection plus editing beats detection alone for publishers who care about voice and trust.
Practical rollout: start with a pilot on non-disciplinary samples; measure disagreement between two independent detectors plus human spot checks; only then wire scores into high-stakes decisions. Document thresholds in policy so students and staff know what a flag means and how to appeal.
Glossary (quick reference)
- True positive (TP): AI text correctly labeled AI.
- True negative (TN): Human text correctly labeled human.
- False positive (FP): Human text wrongly labeled AI (false accusation).
- False negative (FN): AI text wrongly labeled human (miss).
- Precision: Trust in positive predictions.
- Recall: Coverage of actual AI text.
- F1: Harmonic mean of precision and recall—one way to summarize tradeoffs.
- FPR / FNR: Rates conditional on true class—essential for fairness thinking.
Bottom line
In this 1,000-sample controlled run, SynthQuery delivered the strongest aggregate F1 and competitive recall while keeping FPR below many consumer alternatives—but not the lowest FPR overall. Turnitin and Originality.AI excelled at precision-leaning tradeoffs, especially on academic slices. GPTZero remained a strong all-rounder. No tool eliminated false positives or false negatives—use metrics to match tool to risk.
Try SynthQuery's AI Detector free — paste any text and see results in seconds: open the detector.
Related reading
- How to detect AI-generated content — workflow and signals beyond a single score.
- ChatGPT detection: what tools can’t prove — probabilistic outputs and fairness.
- Google and AI vs. human content — quality and helpfulness vs. detector labels.
Itamar Haim
SEO & GEO Lead, SynthQuery
Founder of SynthQuery and SEO/GEO lead. He helps teams ship content that reads well to humans and holds up under AI-assisted search and detection workflows.
He has led organic growth and content strategy engagements with companies including Elementor, Yotpo, and Imagen AI, combining technical SEO with editorial quality.
He writes SynthQuery's public guides on E-E-A-T, AI detection limits, and readability so editorial teams can align practice with how search and generative systems evaluate content.
Related Posts
False Positives in AI Detection: Why Human Text Gets Flagged (and How to Fix It)
AI detectors flag real human writing more often than many users expect. Learn what drives false positives, who bears the brunt, what research says about bias, and how to protect your work with process, editing, and fair tooling.
Can Turnitin Detect AI Content? What Students and Educators Need to Know
Turnitin’s AI writing detection is built into many LMS workflows—but how it works, how accurate it is, and what flags mean for students are often misunderstood. Here is a clear, evidence-grounded overview for classrooms and writers.
SynthQuery vs GPTZero vs Originality.AI: Honest Comparison (2026)
A fair, evidence-backed comparison of three leading AI detectors across accuracy, models, languages, pricing, API, batch workflows, integrations, UX, false positives, speed, privacy, and support—with clear “best for” picks.
Get the best of SynthQuery
Tips on readability, AI detection, and content strategy. No spam.