The Training Data Question: What AI Music Generation Owes the Artists It Learned From

Last month I typed a friend's stage name into the search box of a public dataset that several music models were trained on. She makes ambient guitar records — small label, maybe forty thousand monthly listeners, the kind of artist who licenses a track to a documentary and calls it a good quarter. I expected nothing. The search returned eleven of her songs, each tagged with genre, mood, and tempo, each parsed down into the kind of metadata a model eats for breakfast. She had never been asked. She had never been paid. She found out from me, over text, on a Tuesday.

That small moment is the whole argument about AI music generation compressed into one screen. The systems that now produce a passable lo-fi loop or a cinematic swell on demand did not learn music from nothing. They learned it from recordings — millions of them — and the question of whose recordings, gathered how, and with what permission, is the unresolved fault line running under every tool in this space.

The 238-song discovery

The version of this that made headlines involved someone far larger than my friend. SZA, searching a training database herself, reported finding 238 of her songs used to train music models — and said some of the material was unreleased. That detail matters more than the headcount. A leaked stem or a scrapped demo isn't only a commercial asset; it's a creative decision the artist made not to share. Finding it inside a training set means the choice was made for her.

Her reaction was furious and public, and other artists have circled the same drain for two years now. The reporting around it tends to resist neat conclusions, and it should. What happened is documentable. What it legally means is not yet settled anywhere that counts.

What a training dataset actually is

Strip away the mystique and a music training dataset is a very large library of audio paired with descriptions. A model studies the relationship between the words — "warm Rhodes, 72 BPM, melancholy" — and the waveform, then learns to generate new audio when given new words. It does not store songs the way a sampler stores a break; it stores statistical patterns. That distinction is genuinely important and also genuinely insufficient, because the patterns were extracted from specific human work without that work's consent.

Most large datasets were assembled by scraping. Crawlers pull audio and metadata from wherever it sits in bulk — public repositories, scraped streaming previews, archived uploads, smaller catalogs that never imagined themselves as raw material. The collection happens at a scale where individual permission was never part of the plan. By the time an artist searches for their name, the ingestion is years old and baked into models already shipping.

The part the law hasn't decided

Here is where I'll describe rather than promise, because anyone telling you the copyright outcome is settled is selling something.

Copyright protects specific recordings and compositions — the actual fixed work. Training a model involves copying those works to process them, and whether that copying is infringement or falls under a fair-use-style exception is the question currently moving through courts in multiple countries. The major labels have filed suit against AI music companies. Those cases will likely take years and may land differently in different jurisdictions. As of writing, there is no clean precedent that says "training on copyrighted audio is legal" or "illegal." There is only litigation in progress and a lot of confident people pretending otherwise.

A few things are clearer than the headline fight:

Output that closely imitates a specific recording or a recognizable voice is a different and more dangerous problem than training in the abstract. Voice cloning has already drawn cease-and-desist energy.
Consent and licensing are separable from copyright law. A company can choose to license training data and pay for it regardless of what a court eventually decides it was required to do. Some have started.
"It was publicly available" is not the same as "it was licensed for training." A song streaming on a platform is published, not donated.

Why the industry is split, and why both sides aren't wrong

The disagreement isn't artists versus technologists. It runs straight through music itself.

On one side: performers and writers whose catalogs were ingested without a conversation, who see a tool that competes with them built partly from their own labor. That grievance is real and doesn't require a court to validate it to be felt.

On the other: producers — including some prominent ones — who frame these tools as instruments, the latest in a long line of machines that musicians were told would end music and didn't. The sampler was theft, then it was hip-hop. The drum machine would put drummers out of work, then it made genres. There's a coherent argument that AI is the next sampler and the panic is the same panic.

What the sampler comparison skips is the consent step. When a producer chops a James Brown break, there's now a clearance system — imperfect, expensive, litigated into existence over decades, but a system. Sampling without clearance gets you sued. Training datasets arrived with no equivalent, and the gap between "we built clearance for samples" and "we built nothing for training" is the gap the whole fight lives in.

How to check whether your work was used

You can't audit every model, but you can do more than worry. Here's what's actually possible right now.

Search the public datasets directly. Several training corpora are openly browsable. Search your artist name, track titles, and label. If results come back, screenshot them with the dataset name and date — you want a record of what you found and when.
Listen for your fingerprint in outputs. Prompt a model with descriptions of your own sound. You're not looking for proof of infringement; you're gauging how closely the system can approximate your signature. It's an uncomfortable but informative test.
Read the data provenance, not the marketing. Tools increasingly publish where their training data came from. "Trained on licensed and proprietary data" is a meaningfully different sentence from silence. Ask. The serious companies answer.
Watch for opt-out mechanisms. Some datasets and platforms now offer removal or exclusion requests. They're inconsistent and often retroactive-in-name-only, but documenting your request matters if the law catches up.

For those of us who use these tools to make things — game loops, edit beds, podcast intros — the provenance question cuts both ways. You want output you can ship without a clearance nightmare, which means you want a tool whose training you can actually account for. City of Punk is built on that premise, and it's the same standard I'd hold any competitor to: if a vendor can't tell you where the music it learned from came from, that silence is the answer.

Back to Tuesday

My friend with the eleven songs hasn't decided what to do. There's no obvious lever to pull. Filing anything alone against a model trained on millions of tracks is a fantasy of resources she doesn't have, and the dataset is already downloaded onto machines she'll never find. Her music is in there. It will stay in there. The models that learned from it are out in the world making things that compete, faintly, with the thing she spent four years making.

She asked me whether using AI tools herself now would be hypocrisy or self-defense. I didn't have an answer, and I'm suspicious of anyone who does.

So here's the question I keep landing on, the one neither the courts nor the producers nor I can close yet: if a model learns the grammar of music from work it never asked to use, and then makes something genuinely new — is the debt to the original artists a legal one, a moral one, both, or a thing we haven't invented the word for?

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools

Samuel Kenworth

The Signal · City of Punk