Home/ The Signal/ Industry/ AI Music Training and the Consent Problem: What Your Catalog Is Actually Worth
Ai Music

AI Music Training and the Consent Problem: What Your Catalog Is Actually Worth

A vocalist I know got a screenshot from a fan last spring. It was a forum thread listing the records an AI model had reportedly been trained on, and her name was on it — a specific number of songs…

A vocalist I know got a screenshot from a fan last spring. It was a forum thread listing the records an AI model had reportedly been trained on, and her name was on it — a specific number of songs, more than she'd released commercially. Her reaction wasn't a press statement. It was a text to me at 1 a.m.: how do they have the demos.

That question is the whole story of AI music training right now. Not the model architecture, not the licensing fine print — the demos. The stuff that never came out. The version a producer can hear in a generated render and recognize, because they were in the room when it was tracked.

This piece is for the people whose work is the training data. If you own masters, write songs, or produce records that end up in someone else's dataset, here is what's actually happening, what the mechanism is underneath the outrage, and where the law currently leaves you — which is to say, not where you'd hope.

The myth: "It only learns style, not your songs"

You've heard this one, probably from someone who builds these tools. The argument goes: a generative model doesn't store your recordings. It learns statistical patterns — the shape of a genre, the way a chorus tends to lift, the timbre of a certain kind of vocal. It's no different, the pitch goes, than a session player who grew up on your records and absorbed your phrasing. Style isn't copyrightable. Influence is how music has always worked.

It's a clean argument. It's also doing a lot of work to skip the part where your specific recordings were copied to train the thing in the first place.

The myth survives because it conflates two separate questions. One: does the output infringe? That's genuinely murky and gets argued case by case. Two: was making the model — the act of ingesting your catalog to train it — authorized? That second question is where most of the actual fight is, and "it only learns style" doesn't answer it at all.

The evidence: artists are counting

What changed the conversation wasn't legal theory. It was artists doing inventory.

Over the past couple of years, high-profile musicians have gone public with specific, countable allegations — not "AI is stealing music" in the abstract, but "this model trained on this many of my songs, and some of them I never released." The specificity is the point. A round number of unreleased tracks in a dataset is hard to explain as coincidental style absorption. It implies the files themselves were somewhere they shouldn't have been.

The pattern repeats across the industry, and it tends to cluster. Artists have noted that the genres getting ingested most aggressively are often the ones built by Black and Brown musicians — the same catalogs that the recorded-music business has historically underpaid. Whether that targeting is deliberate or a function of what's popular and therefore commercially valuable to imitate, the effect lands the same way on the people whose work it is.

None of these public claims has been fully adjudicated as of writing. Several lawsuits between major labels and AI music companies are working through courts in the US and elsewhere, and the discovery process in those cases — the part where companies have to disclose what's actually in their training sets — is where the durable answers will come from. Treat any specific allegation as a claim until a court or a settlement says otherwise. But the volume and the consistency of the claims are themselves the signal.

The mechanism: how your catalog gets in

Here's the part the style argument skips. Training a music model requires audio — a lot of it, typically as files, typically scraped or licensed or acquired from somewhere. The three rough sources are:

  • Licensed data, where the company paid for or struck a deal to use catalogs. Some firms now advertise this. It's the cleanest path and the rarest.
  • Public or "publicly available" data, a phrase doing enormous lifting. A track being streamable on a platform does not mean it's licensed for training. "I could find it" and "I'm allowed to copy it into a dataset" are different claims.
  • Scraped data, pulled from wherever, including leaked or pre-release material that was never meant to be public at all. This is how unreleased demos end up in a set.

Once the audio is in, training extracts patterns across all of it. The model doesn't keep a copy of your master in a folder. But it was trained on that copy, and the copy had to exist to do it. That's the act rights holders are pointing at. The legal question — is unauthorized copying for training a fair use, or infringement that requires a license — is unsettled and varies by jurisdiction. Don't let anyone, on either side, tell you it's settled.

And the safety nets you might assume cover this mostly don't. Content ID and similar systems compare a piece of audio against a database of registered recordings to flag matches in finished tracks. They were built to catch someone uploading your song, not to detect that your song was used to train a model that now generates something adjacent. A model output that sounds like your style but matches no specific waveform sails right past fingerprinting.

The honest takeaway

If you're a rights holder, the practical posture right now is documentation and attention, not panic. Register your works and keep your metadata clean — it's the baseline for any claim or future opt-out scheme. Read the terms on every platform you distribute through and look specifically for AI training and machine-learning clauses; some now include them, and some let you opt out. Where opt-out mechanisms exist, they're inconsistent and often opt-out rather than opt-in, which puts the labor on you. That's worth knowing before you assume silence means safety.

If you're a producer reaching for these tools — and plenty of us are — the question to ask your provider is narrow and answerable: what was this trained on, and can you indemnify me if it wasn't clean. Some companies now train on licensed or owned catalogs specifically so they can answer that. City of Punk is built on that premise, and so are a handful of competitors; if a tool won't tell you what's in its dataset, that silence is your answer.

The technology is a real instrument. The consent question underneath it is a real debt. Both things are true, and the people owed are still counting.

Your catalog isn't a style. It's a file someone copied.

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools
R

Rachel Dunmore

The Signal · City of Punk
← Previous signal

Suno Review: The Fastest Path to a Vocal-Led Song, and What It Costs You

Next signal →

When SZA Said No: AI Music Generation and the Fight Over Training Data