Spotting AI Voice Synthesis Used to Be a Party Trick. It Isn't Anymore.

There's a sound editors used to listen for. Not a glitch exactly — an absence. AI-generated speech from a few years back didn't breathe. A human reading a paragraph pulls air between clauses, and you can hear it: a small intake, the chest resetting. Early synthetic voices skipped it. They ran clause into clause with a steadiness no lung could manage. If you knew to listen for the missing breath, you could call a fake in about four seconds.

That trick is gone. AI voice synthesis now models breath, micro-hesitation, the little glottal click before a hard consonant. The tell that a whole profession leaned on has been patched out, version by version, and most of the people who relied on it haven't updated their instincts. If you verify audio for a living — or you're about to publish a clip you can't fully vouch for — the gap between what you believe you can hear and what you can actually hear is now a professional liability.

Where "you can always hear it" came from

The confidence was earned, once. The first wave of text-to-speech that ordinary people encountered — phone trees, screen readers, the GPS voice — had a signature you could spot blindfolded. Prosody landed in the wrong places. Emphasis fell on prepositions. Sibilance came out flat and slightly metallic, like an "s" recorded through a screen door. Pauses were uniform, because they were inserted by rule, not by meaning.

A generation of journalists, podcasters, and audio engineers built a mental fingerprint kit from those artifacts. It became received wisdom in newsrooms and edit bays: synthetic speech has a smell, and a trained ear will catch it. The belief spread faster than the technology that justified it, the way useful rules of thumb tend to. It got repeated in trainings, in style guides, in offhand reassurances to nervous editors. Don't worry, we'd know.

Here's the thin part. That entire detector was calibrated against a specific, dated generation of models. The fingerprint kit catalogs flaws — and flaws are exactly what each release of a synthesis model is engineered to remove. The robotic prosody, the flat sibilance, the metronome pauses: those weren't permanent properties of synthetic speech. They were temporary symptoms of immature systems. Train your ear on symptoms, and you've trained it on something the field is actively curing.

Can you tell if a voice is AI-generated?

Honestly, by ear alone, often no — not reliably, not anymore, not for a clean clip from a current model. You can still catch lazy fakes: voices with unnatural room tone, mismatched reverb between sentences, emotional flatness across a long passage, or pronunciation that's too even. But a short, well-prompted clip rendered by a recent model and then re-encoded for social platforms can pass a casual listen and a careful one. The reliable methods are no longer perceptual. They're forensic, and they fall into two very different families.

What audio forensics actually checks

Detection works backward from the finished file. It analyzes the signal for traces a human ear misses — spectral patterns, phase artifacts, statistical regularities in how the waveform was constructed. Classifiers trained on known synthetic and known real audio score how likely a clip is generated. This is genuinely useful and genuinely fragile. A detector trained on last year's models can miss this year's. Compression, noise, and re-encoding degrade the very traces it hunts for. And it produces a probability, not a verdict — which is exactly the wrong shape for a decision you have to defend.

Provenance works the other direction. Instead of guessing after the fact, it marks the audio at the moment of creation. Two approaches matter here:

Embedded watermarking — a signal woven into the audio itself at generation time, designed to survive compression and minor edits. Google's SynthID and similar systems aim to keep the mark detectable even after a clip is trimmed, re-encoded, or run through a platform's pipeline. The mark is inaudible by design.
Signed metadata — standards like C2PA attach a tamper-evident record of how a file was made and altered, cryptographically signed. Strip the metadata and you lose the record, but you also lose the claim of authenticity, which is its own signal.

Neither is a wall. Watermarks can be attacked, and a determined adversary using an unmarked or self-trained model never embeds one in the first place. Metadata gets stripped the moment a file is screenshotted into a new container. Provenance doesn't catch every fake. What it does is shift the burden: instead of proving a clip is synthetic, you can ask whether it can prove it's real.

A working verification pass

When a clip lands on your desk and you have to decide today, run it in this order:

Check for a provenance signal first. Look for C2PA credentials or a platform-level "AI-generated" label. Present and valid is the strongest, fastest answer you'll get.
Trace the source upstream. Where did the file originate — a verified account, a known outlet, a stranger's repost? Chain of custody beats signal analysis.
Listen for environmental tells, not vocal ones. Room tone shifts between sentences, reverb that doesn't match, an impossibly clean noise floor.
Run a detector, treat it as one input. A probability score informs your judgment; it doesn't make it.
When stakes are high, demand the unedited original. Re-encoding destroys evidence in both directions — get the file closest to the source.

Notice what's load-bearing here: not your ears. The reliable signals are about where a file came from and what it carries, not how it sounds.

The thing being protected isn't detection

The deeper shift is that this stopped being a perception problem and became an accountability one. Detection asks is this fake. Provenance asks can the source stand behind this. The second question scales; the first does not. You will never hear your way to certainty across thousands of clips a week. You can, increasingly, demand that authentic audio carry proof of its own origin — and treat the absence of that proof as a fact worth weighing.

So: the missing breath. For years it was the four-second tell that let a profession feel safe. The breath is back now, modeled and convincing, and the safety it offered was always borrowed against a flaw that wouldn't last. What replaces it isn't a sharper ear. It's the discipline to stop trusting the clip and start interrogating the file — because the next thing a model learns to fake is whatever you're currently relying on to catch it.

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools

Rachel Dunmore

The Signal · City of Punk