How AI Detectors Actually Work: The Technology Behind the Scenes
- ai-detection
- nlp
- machine-learning
- watermarking
A technical explainer of AI text detection: token probabilities, perplexity and burstiness, watermarks, classifiers, domain effects, multilingual limits, and why no score is ever mathematically certain.
If you paste text into an AI detector, you are not running a single magical test. You are usually looking at one or more statistical views of how “surprising” the words are under a language model, sometimes a neural classifier trained on human vs. machine examples, and—only when the generator cooperates—watermark statistics baked into the sampling process. This article walks through how those pieces fit together, where published benchmarks land, and why perfect accuracy is impossible in principle—not only in practice.
For a first pass on your own drafts, you can use SynthQuery’s AI Detector and pair it with SynthRead for readability and editing signals. This post is the companion “under the hood” read for teams that want more than a score.
Placeholder note for designers: the diagram above is a simplified pipeline; production systems may branch into parallel scorers and merge results.
How LLMs generate text and why distributions are detectable
A causal language model defines a conditional probability for each next token given everything that came before. If vocabulary size is V and token positions are t = 1, …, T, one common decomposition is:
P(w_1,\dots,w_T) = \prod_{t=1}^{T} P(w_t \mid w_{<t})
Each factor is a softmax over the vocabulary. Modern models are extremely good at assigning high probability to fluent continuations—so machine text often sits in high-density regions of that distribution: the model “likes” its own samples. Human writing is messier: rare word choices, idiosyncratic metaphors, and factual specificity can produce higher per-token surprise under any fixed scoring model, though not always.
That gap is statistical, not moral. It depends on which model you use as the judge, the domain (Reddit vs. legal brief), and whether a human is imitating “LLM tone.” Early interactive tools such as GLTR (Gehrmann et al., EMNLP 2019) visualized these probabilities to help analysts see when text looked “too green” (top-1 predictions) under GPT-2—still a useful mental model even as models grew larger.
Likelihood under an external judge
Practical detectors rarely have access to the exact model that wrote a passage. They approximate with a public LM (or a family of models). That mismatch is important: if the “judge” is GPT-2–class and the text is from a frontier system, absolute log-probabilities drift—but relative comparisons across large corpora can still rank machine-like fluency. Many systems therefore report calibrated probabilities or percentiles against a reference corpus rather than raw perplexity alone.
Why “random human typing” is not uniformly random
Humans do not sample from the same softmax as a temperature-1.0 model. We plan, backtrack, and inject content from memory. LLMs, without retrieval, interpolate statistical patterns from training data. Detection exploits distributional differences between those processes, not the philosophical question of creativity.
Perplexity and burstiness (with the actual formulas)
Perplexity compresses the average surprise of tokens into one number. Let ℓ_t be the negative log-likelihood of token w_t given all prior tokens (the usual next-token loss). Under some fixed scoring model:
\text{PP} = \exp\left( \frac{1}{T} \sum_{t=1}^{T} \ell_t \right)
Equivalently, log PP equals the average of the ℓ_t values: perplexity is the exponential of the average token cross-entropy. Lower perplexity means the scoring model finds the text more predictable—which is exactly what you often see when an LLM scores text it could have written itself.
Burstiness captures unevenness in surprise or structure. Several papers operationalize it differently; a simple feature-level view is to look at the variance of token surprises (or of sentence lengths after segmentation). Let ℓ̄ denote the mean of the ℓ_t. A burstiness-style statistic can be:
B = \frac{1}{T} \sum_{t=1}^{T} (\ell_t - \bar{\ell})^2
Intuition: templated machine prose may keep per-token losses ℓ_t in a narrow band (uniform “polished” rhythm), while human text often alternates short punchy lines with long asides, yielding higher variance in both length and information content. Detectors that combine mean surprise with variance can separate some corpora—but the features are not universal: a careful human editor can flatten variance, and a stochastic decoding setting can reintroduce it.
Operational note
Products marketed as “burstiness detectors” often implement heuristic proxies: sentence-length histograms, dependency-depth statistics, or entropy of function-word bigrams. The branding is catchier than the math; ask vendors which features they log if you run compliance reviews.
Watermarking: OpenAI-class ideas, Google DeepMind SynthID, and Kirchenbauer et al.
Statistical watermarking for text (Kirchenbauer et al., “A Watermark for Large Language Models,” arXiv:2301.10226) biases sampling so certain “green” tokens are slightly more likely than they would be under raw logits. The bias is subtle enough to preserve quality, but over many tokens it becomes a detectable signal: a detector counts how often observed tokens fall in the green sets implied by a secret key, and runs a hypothesis test (the paper gives interpretable p-values). Strength trades off against robustness to paraphrase and adversarial rewriting—if you delete or scramble enough tokens, any statistical watermark weakens.
SynthID (Google DeepMind) extends similar provenance and watermarking ideas across modalities (image, audio, video); the public materials emphasize scalable detection and responsible deployment alongside other safety layers. Text and pixel domains differ: images survive noise differently than Unicode strings, but the core theme is the same—embed a secret structure and test for it at scale. See DeepMind’s overview of SynthID for product-facing context and linked research.
OpenAI has discussed classifier-based detection and the limits of API-side scoring publicly; their research communications are a useful counterweight to vendor hype because they stress false positives on non-English text, short prompts, and edited material. See OpenAI’s discussion of AI-written text classifiers for the original framing and caveats.
Watermarking is powerful when the generator participates. Third-party sites that do not control model sampling cannot recover a secret green-token schedule unless the platform exposes verification APIs or metadata.
Fine-tuned classifier models: data, architectures, and failure modes
Commercial detectors often stack a Transformer encoder (RoBERTa, DeBERTa, or similar) with a classification head trained on:
- Positive class: text from known models (multiple families and decoding settings).
- Negative class: human-written text from books, forums, news, and academic corpora.
Training mixes matter. If the negative set skews toward informal English, formal human prose can look “machine.” If positives omit the latest model, new generators slip through until retraining. Data contamination is also real: popular web text already contains AI assistance; labels are noisy.
Architecturally, these models are ordinary text classifiers. Their limitation is not implementation quality—it is non-stationarity: the generative process changes every few months as vendors update RLHF, tool use, and base weights.
Training recipe (what papers and vendors actually tune)
A typical recipe looks like this:
- Stratify positives across model families (decoder-only, instruction-tuned, retrieval-augmented) and decoding settings (temperature, top-p, penalties).
- Stratify negatives by register: social media, journalism, STEM papers, student essays, ESL writers.
- Augment with back-translation, paraphrases, and light edits so the head does not memorize surface n-grams only.
- Calibrate logits to temperature scaling or isotonic regression so “80% AI” means something stable on a holdout set.
Even with care, domain shift at deployment time remains the dominant failure mode: the classifier optimizes a proxy that is not identical to your institution’s writing population.
Zero-shot detection vs. trained classifiers
Zero-shot methods (examples: probability curvature / perturbation tests in DetectGPT; GLTR-style visualization) use a scoring LM without a dedicated detector training set. Trained classifiers learn decision boundaries directly from labeled examples.
| Dimension | Zero-shot / scoring LM | Trained classifier | Watermark-based | | --- | --- | --- | --- | | Needs labeled human/AI corpus at train time | No (for basic score) | Yes | No for detect (but needs watermark at generation) | | Sensitive to scoring model mismatch | High | Medium | Low if verifier matches embedder | | Works on short snippets | Often weak | Often weak | Needs enough tokens for test power | | Robust to paraphrase | Varies | Varies | Degrades under heavy rewrite | | Deployability for third parties | Easier (API logprobs) | Needs model hosting | Needs generator cooperation |
DetectGPT (Mitchell et al., ICML 2023; arXiv:2301.11305) estimates how curved the log-probability landscape is around a passage and uses perturbations from a generic model (for example T5) to approximate that geometry without training examples of AI text. In their reported experiments—one setting involves GPT-NeoX (20B) generations on a news-style task—DetectGPT reaches about 0.95 AUROC, substantially above a strong zero-shot baseline reported near 0.81 AUROC in the same table. Treat those numbers as paper conditions, not guarantees on your classroom essay or your SaaS landing page.
When zero-shot wins or loses
Zero-shot scoring shines when you cannot collect a representative labeled dataset—e.g., a new model drops overnight—or when you worry that a supervised head overfits vendor-specific quirks. It weakens when the scoring LM is too small relative to the generator, or when passages are too short for stable curvature estimates. Supervised heads can win on in-distribution benchmarks yet crumble on out-of-domain human text; zero-shot methods sometimes generalize differently because they rely on generative mechanics rather than spurious corpus artifacts. Neither is universally dominant; teams evaluating vendors should test both styles on their own held-out writers.
Why accuracy varies by domain
Creative writing often mixes high perplexity with irregular structure; poets and dialog-heavy fiction can look “machine” to a model trained on expository web text—or the opposite, depending on the scorer.
Technical documentation repeats stock phrases (“click Save,” “see Table 2”). Both humans and LLMs produce low-surprise boilerplate, shrinking separability.
Academic papers carry citation patterns, nominalizations, and discipline-specific n-grams. Classifiers may latch onto style rather than provenance, which is why policy teams should never treat a score as evidence of misconduct by itself.
The adversarial arms race: humanizers and detector updates
Humanizer tools rewrite drafts to increase variance, insert specifics, swap transitions, and sometimes deliberately inject “burstier” sentence lengths. From a detection standpoint, that is distribution shift: the detector’s training data may not match the post-humanizer manifold.
Detectors respond with new training mixes, larger context windows, and ensemble scores (classifier + perplexity + stylistic features). This is an arms race, not a one-shot patch: any fixed rule becomes a training target for the next round of circumvention. For an editorial perspective on rewriting vs. disclosure, see our AI humanizer guide and keep ChatGPT detection limitations in mind.
What changes under the hood when humanizers run
Many humanizers chain: paraphrase → expansion/contraction → tone shift. Each step pushes token counts away from the original model’s typical co-occurrence statistics. Detectors that rely heavily on low-level n-grams see the biggest swings; classifiers with semantic layers (Transformer embeddings) may be more stable unless the rewrite preserves “AI register” phrases. None of this is an endorsement of evasion—policy teams should still prefer disclosure—but it explains why the same detector can disagree with itself before and after automated editing.
Multilingual detection challenges
Tokenizers differ across scripts; morphologically rich languages split differently into subwords. Reference data for “human” text may be scarcer or domain-skewed in languages beyond high-resource English. Translators also introduce mixed-language artifacts that confuse monolingual classifiers.
Peer-reviewed work in ACL/EMNLP and arXiv repeatedly finds lower F1 / AUROC for non-English benchmarks when models are trained primarily on English—sometimes dramatically so. Operationally: treat non-English scores as even noisier than English ones, and prefer longer passages when policies allow.
Script-specific effects
Logographic scripts (e.g., Chinese characters) pack more meaning per token than typical BPE chunks in Latin scripts; agglutinative languages may produce extremely long tokens under naive subwording. A detector’s tokenizer and scoring LM must match the language, or perplexity becomes a codec artifact rather than a style signal. If a vendor only documents English calibration, assume other languages are experimental.
Published benchmarks (how to read them)
When you read a table in an ACL or EMNLP paper:
- Which generator? GPT-3.5 vs. Llama vs. local fine-tunes shifts scores.
- Which prompt? Creative vs. summarization tasks change separability.
- How much text? 50 tokens vs. 500 tokens is not comparable.
- Human baseline quality? ELI5 Reddit prose ≠ graduate prose.
Representative anchors:
- GLTR-style probability visualization (Gehrmann et al., EMNLP 2019) established that LM probability features carry signal under older generators.
- DetectGPT (Mitchell et al., arXiv:2301.11305) reports strong zero-shot AUROC in several controlled evaluations; see their tables for task-by-task variation.
- Kirchenbauer et al. (arXiv:2301.10226) analyze detection p-values when watermarking is enabled and discuss robustness trade-offs.
Always compare the same metric (AUROC vs. accuracy at a fixed threshold) and note whether authors tune thresholds on validation data.
Negative results matter
Several EMNLP and ACL papers document detector collapse when adversaries use stronger paraphrasers or when the generator is unknown. A high AUROC on yesterday’s benchmark does not license automatic penalties tomorrow. Institutions should archive locally measured false-positive rates by cohort (ESL writers, engineering students, journalists) instead of trusting vendor marketing sheets.
Why no detector can be 100% accurate (mathematical sketch)
Informally, detection is binary classification between two overlapping distributions over text: human and machine. Even with infinite data, Bayes optimal error is:
P_e = \frac{1}{2} \int \min\{ p_H(x), p_M(x) \}\, dx
When p_H and p_M (human and machine densities over text) share support and overlap substantially—which they do whenever humans imitate “AI voice” and AIs imitate “human voice”—the integral of the minimum density is strictly positive, so perfect separation is impossible without extra information (watermark keys, provenance metadata) that the observer may not possess.
Finite text length adds sampling noise; mislabeled training data adds aleatoric label noise; and non-identifiability means multiple generative stories can produce the same string. Together, those forces cap calibrated accuracy below 100% even before adversarial rewriting.
Connection to confidence scores
If a detector outputs a “probability of AI” p̂, ideal calibration means that among all cases labeled 0.8, roughly 80% are truly AI-generated. Overlapping p_H and p_M force Bayes error above zero, so even a perfectly calibrated model cannot drive expected loss to zero at every threshold. That is distinct from software bugs: it is a limit of the feature information available from text alone.
Where watermarks change the equation
If generation includes a keyed pseudorandom green list, the hypothesis test sees extra random variables not available to generic LMs—information outside the raw Unicode. That can push error rates down conditional on the watermark surviving editing. Strip the watermark (or never embed it), and you are back to overlapping generative distributions.
Practical takeaway for teams
Use detectors as risk triage, not courtroom proof. Combine scores with editorial review, disclosure policy, and—where available—provenance tooling. For watermark research context beyond classifiers, our watermarking overview still applies; for workflow, see how to detect AI content.
Minimum viable evaluation checklist
Before you bake a detector into admissions, hiring, or SEO penalties:
- Sample size: measure stability on passages of the same length your policy will see (titles-only vs. full essays).
- Blind human labels: have annotators guess machine vs. human on disputed cases—if humans cannot agree, neither can a model.
- Appeals workflow: every automated flag needs a human path with clear turnaround; false positives are guaranteed by construction.
- Versioning: record model vendor, detector version, and date in audit logs so scores are comparable across semesters.
This mirrors how we describe limits elsewhere on the site: tools assist judgment; they do not replace it.
FAQ
Can AI detectors detect ChatGPT-4 and GPT-5 content?
ChatGPT / GPT-4-class text is detectable in aggregate when classifiers and scoring models are fresh and the sample is long enough—but no public tool offers a guaranteed verdict per passage. Future or unreleased “GPT-5-class” systems will shift token statistics again; detectors must retrain and may lag. Treat “which model” questions as versioned engineering, not fixed physics.
Why do AI detectors sometimes flag human-written text as AI?
Training skew (human negatives that don’t match your style), short inputs (statistics are unstable), genre effects (formulaic legal or technical prose looks “smooth”), and mismatch between the scorer LM and the true author all raise false positives. That is why OpenAI’s own classifier posts emphasized limitations on short and non-English text.
Are AI detectors accurate for non-English languages?
Generally less than for high-resource English, because tokenization, training data, and label quality vary. Use longer passages, native-language review, and lower confidence thresholds for automated actions.
Can paraphrasing tools fool AI detectors?
Often partially. Paraphrase and humanizer tools change n-gram and surprisal patterns; success depends on the detector’s features and how aggressively you rewrite. Heavy editing can also make human drafts look stranger to the scorer—another reason not to rely on a single number.
How accurate are free AI detectors vs paid ones?
Paid products sometimes run larger models, fresher training data, or ensemble features, but accuracy is not guaranteed by price. Evaluate on your domain with blind human labels; watch for false positives on staff writers. Free vs. paid is less informative than transparent methodology.
Will AI detection become impossible as models improve?
Perfect black-box detection of “human vs. machine” without side information is not a sensible endgame: distributions can be made arbitrarily close for a fixed judge. What remains valuable is risk scoring, provenance (watermarks, metadata), and process (disclosure, editing). The task evolves; it does not disappear as a policy problem.
Related tools on SynthQuery
- AI Detector — Model-assisted likelihood and stylistic signals for triage.
- SynthRead — Readability, structure, and human editing support after you’ve interpreted detection output.
Further reading on SynthQuery
- ChatGPT detection limitations: what to trust
- Watermarking AI text: what publishers explore
- AI humanizer guide
Sources and external references
- Mitchell et al., DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (ICML 2023 / arXiv).
- Kirchenbauer et al., A Watermark for Large Language Models.
- Gehrmann et al., GLTR (EMNLP 2019) — interactive visualization of LM probabilities.
- OpenAI, Research on AI-written text classifiers.
- Google DeepMind, SynthID — multimodal watermarking and detection overview.
Itamar Haim
SEO & GEO Lead, SynthQuery
Founder of SynthQuery and SEO/GEO lead. He helps teams ship content that reads well to humans and holds up under AI-assisted search and detection workflows.
He has led organic growth and content strategy engagements with companies including Elementor, Yotpo, and Imagen AI, combining technical SEO with editorial quality.
He writes SynthQuery's public guides on E-E-A-T, AI detection limits, and readability so editorial teams can align practice with how search and generative systems evaluate content.
Related Posts
What Is SynthID? Google's Multimodal AI Watermarking Explained
SynthID is Google DeepMind's watermarking and provenance technology for AI-generated images, audio, and video—not a generic 'AI detector.' Here's what it does, how it differs from statistical text checks, and what it means for publishers.
AI Content Detection in Journalism: How Newsrooms Verify Source Material
How journalism organizations use AI detection, wire-service policies, ethics codes, and workflows to protect trust—from breaking news to tips and comments—without treating classifiers as proof.
AI Detection API: How to Integrate AI Content Scanning Into Your Workflow
A developer-focused guide to integrating SynthQuery’s AI detection API: endpoints, auth, rate limits, Python/Node/cURL examples, WordPress and Google Docs patterns, batch jobs, score thresholds, and pricing-aware optimization.
Get the best of SynthQuery
Tips on readability, AI detection, and content strategy. No spam.