Last updated: March 22, 2026 (March 2026)
This page documents how we measure AI detection performance, what data we use, and how to interpret headline figures on the marketing site. You can link here when you need a citable source for our accuracy and usage claims.
We evaluate the detector on labeled passages and documents where the source is known: human-written text from editorial and public domain samples, and AI-written text from modern LLMs (including GPT-class outputs) generated under controlled prompts. Labels are verified before inclusion in the benchmark.
We report accuracy as the fraction of predictions that match the held-out label at the document level, with additional sentence-level metrics for calibration. Standard mode uses our ensemble stack; DeepScan (paid plans) runs a stronger pass for mixed or edge-heavy content.
The internal benchmark uses roughly 10,000 human and AI-labeled samples combined, with stratified splits so that neither domain nor model family dominates. It is not identical to public leaderboards or third-party test sets.
Real-world performance can differ from benchmark accuracy: very short text, heavy paraphrasing, translated text, or adversarial editing may reduce agreement with labels. For a research-level overview of detector possibilities and limits, see Towards Possibilities & Impossibilities of AI-generated Text Detection: A Survey (arXiv:2310.15264).
We do not publish head-to-head competitor percentages on our private benchmark because vendors use different datasets, label definitions, and thresholds—numbers are rarely comparable without a shared, public test harness. When you see "industry average" claims elsewhere, check which corpus and metric they use.
We focus on transparent methodology: our headline accuracy is tied to the 10k-sample internal benchmark described above. For product decisions, we recommend using the in-app verdict, heatmap, and your editorial judgment together.