Last updated: March 22, 2026 (March 2026)

Detection methodology & benchmarks

This page documents how we measure AI detection performance, what data we use, and how to interpret headline figures on the marketing site. You can link here when you need a citable source for our accuracy and usage claims.

Headline claims (homepage)

Detection accuracy: In our published English benchmark (n = 1,000: 500 human-written / 500 AI-generated samples, 300–500 words each), SynthQuery achieved 88.4% overall accuracy and 87.4% F1 after calibrating the score threshold on a held-out split before the main evaluation. See the benchmark article for precision, recall, FPR, and limitations. Full tables (precision, recall, FPR, FNR) are in our published 1,000-sample benchmark. That aggregate (88.4% accuracy) is not a guarantee for every short snippet, genre, or edited model draft.
Audience: Built for writers, editors, and teams who need clear signals—not a single opaque score. We do not display a verified global user count on the homepage; any historical internal totals are not shown as social proof.
Tool count: The public tools hub lists 362 tools (362+ in copy). See All tools.

How detection accuracy is measured

We evaluate the detector on labeled passages and documents where the source is known: human-written text from editorial and public domain samples, and AI-written text from modern LLMs (including GPT-class outputs) generated under controlled prompts. Labels are verified before inclusion in the benchmark.

We report accuracy as the fraction of predictions that match the held-out label at the document level, with additional sentence-level metrics for calibration. Standard mode uses our ensemble stack; DeepScan (paid plans) runs a stronger pass for mixed or edge-heavy content.

Datasets and limitations

The figures we cite on the marketing site come from our published English benchmark (n = 1,000): 500 human-written and 500 AI-generated passages (300–500 words), with stratification described in that article. It is not identical to public leaderboards or third-party test sets, and larger or newer internal evaluations may differ.

Real-world performance can differ from benchmark accuracy: very short text, heavy paraphrasing, translated text, or adversarial editing may reduce agreement with labels. For a research-level overview of detector possibilities and limits, see Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey (arXiv:2310.15264).

Comparison to other tools

Our published head-to-head fixes one corpus and protocol so readers can compare tools fairly on that slice. Vendors still use different datasets, label definitions, and thresholds elsewhere—when you see "industry average" claims, check which corpus and metric they use.

We focus on transparent methodology: headline accuracy matches the published n = 1,000 benchmark linked above. For product decisions, use the in-app verdict, heatmap, and your editorial judgment together.

Platform metrics (for citations)

Detection: In our published English benchmark (n = 1,000: 500 human-written / 500 AI-generated samples, 300–500 words each), SynthQuery achieved 88.4% overall accuracy and 87.4% F1 after calibrating the score threshold on a held-out split before the main evaluation. See the benchmark article for precision, recall, FPR, and limitations (see sections above).
Audience scale: Built for writers, editors, and teams who need clear signals—not a single opaque score.
Languages: Detection and readability are strongest in English; several tools accept other languages with varying degrees of optimization. The Translator supports many language pairs—see the in-app language list for the current set.
API availability: Production deployments are monitored; Enterprise plans can include SLA terms. Contact sales for uptime commitments specific to your contract.

Questions about this page? See FAQ or About.

Headline claims (homepage)

Detection accuracy: In our published English benchmark (n = 1,000: 500 human-written / 500 AI-generated samples, 300–500 words each), SynthQuery achieved 88.4% overall accuracy and 87.4% F1 after calibrating the score threshold on a held-out split before the main evaluation. See the benchmark article for precision, recall, FPR, and limitations. Full tables (precision, recall, FPR, FNR) are in our published 1,000-sample benchmark. That aggregate (88.4% accuracy) is not a guarantee for every short snippet, genre, or edited model draft.

Audience: Built for writers, editors, and teams who need clear signals—not a single opaque score. We do not display a verified global user count on the homepage; any historical internal totals are not shown as social proof.

Tool count: The public tools hub lists 362 tools (362+ in copy). See All tools.

How detection accuracy is measured

Datasets and limitations

Comparison to other tools

Platform metrics (for citations)

Detection: In our published English benchmark (n = 1,000: 500 human-written / 500 AI-generated samples, 300–500 words each), SynthQuery achieved 88.4% overall accuracy and 87.4% F1 after calibrating the score threshold on a held-out split before the main evaluation. See the benchmark article for precision, recall, FPR, and limitations (see sections above).

Audience scale: Built for writers, editors, and teams who need clear signals—not a single opaque score.

Languages: Detection and readability are strongest in English; several tools accept other languages with varying degrees of optimization. The Translator supports many language pairs—see the in-app language list for the current set.

API availability: Production deployments are monitored; Enterprise plans can include SLA terms. Contact sales for uptime commitments specific to your contract.

Headline claims (homepage)

How detection accuracy is measured

Datasets and limitations

Comparison to other tools

Platform metrics (for citations)

Command Palette

Headline claims (homepage)

How detection accuracy is measured

Datasets and limitations

Comparison to other tools

Platform metrics (for citations)