Search for a command to run...
Last updated: March 22, 2026 (March 2026)
This page documents how we measure AI detection performance, what data we use, and how to interpret headline figures on the marketing site. You can link here when you need a citable source for our accuracy and usage claims.
We evaluate the detector on labeled passages and documents where the source is known: human-written text from editorial and public domain samples, and AI-written text from modern LLMs (including GPT-class outputs) generated under controlled prompts. Labels are verified before inclusion in the benchmark.
We report accuracy as the fraction of predictions that match the held-out label at the document level, with additional sentence-level metrics for calibration. Standard mode uses our ensemble stack; DeepScan (paid plans) runs a stronger pass for mixed or edge-heavy content.
The figures we cite on the marketing site come from our published English benchmark (n = 1,000): 500 human-written and 500 AI-generated passages (300–500 words), with stratification described in that article. It is not identical to public leaderboards or third-party test sets, and larger or newer internal evaluations may differ.
Real-world performance can differ from benchmark accuracy: very short text, heavy paraphrasing, translated text, or adversarial editing may reduce agreement with labels. For a research-level overview of detector possibilities and limits, see Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey (arXiv:2310.15264).
Our published head-to-head fixes one corpus and protocol so readers can compare tools fairly on that slice. Vendors still use different datasets, label definitions, and thresholds elsewhere—when you see "industry average" claims, check which corpus and metric they use.
We focus on transparent methodology: headline accuracy matches the published n = 1,000 benchmark linked above. For product decisions, use the in-app verdict, heatmap, and your editorial judgment together.