When an Artist's Catalog Becomes AI Music Training Datasets, Who Said Yes?

Last fall a Grammy-winning singer posted that hundreds of her songs — including tracks she had never released — appeared to have been swept into a generative music model without anyone calling her label, her lawyer, or her. The number she gave was specific enough to sting. The reaction online was the usual split: vindication from people who'd been warning about this for two years, shrugs from people who think a song is a song once it's on Spotify.

But underneath the noise was a question almost every working musician has now asked themselves in private, usually around 1 a.m.: was my catalog used to build one of these AI music training datasets, and if it was, did anyone ever ask me?

The honest answer is layered, and parts of it are genuinely unresolved. So let's go through it the way you'd actually want a colleague to, not the way a press release would.

Was my work used to train the model — and how would I even know?

Here is the plain-language version, because that's what you came for: in most cases you cannot prove your specific song trained a specific model, the companies rarely publish their dataset contents, and current law has not settled whether they were required to ask you first. That uncertainty is not an accident. It is the load-bearing wall of the entire business.

A modern music model is trained on an enormous corpus of audio — somewhere between "a lot" and "effectively the open internet," depending on the company. That corpus gets assembled three ways, often blended:

Scraped or aggregated audio, pulled from public sources, video platforms, and bulk libraries.
Licensed catalogs, where a company pays a rights holder for training access. Several of these deals have happened quietly, and you usually find out after the fact.
User uploads and generated outputs, fed back in to refine the next version.

The problem for an artist is that once your recording is inside that statistical soup, it doesn't sit there as a file with your name on it. The model has learned patterns from it — phrasing, timbre, the way you bend a vowel — and patterns don't carry a chain of custody. That's why companies can say "the model doesn't store your song" and be technically correct while completely sidestepping the question of whether your song shaped it.

Why "they didn't ask" is the whole argument

Strip away the technical fog and the grievance is simple and old: you made something, somebody used it to build a commercial product that competes with you, and they never sought consent. The defense — that training is transformative, that the output isn't a copy — is an argument about the output. The artists are angry about the input.

Those are different conversations, and conflating them is the most common move you'll see from both sides. A company will answer an input complaint ("you took my work without asking") with an output reassurance ("our tool is built to create new music, not clone yours"). Even if the second part is true, it doesn't touch the first.

The part that turns a complaint into a pattern

The most pointed version of this critique isn't "my song got used." It's about whose songs get used, and who gets paid when the machine learns to sound like them.

The genres that anchor contemporary pop, hip-hop, R&B, house, and most of what a sync editor reaches for when they want "energy" were largely built by Black artists, many of whom were underpaid for the original recordings the first time around. If a model learns the feel of those records and a producer can then summon "a track in that pocket" with a text prompt, the value flows again toward whoever owns the model, not toward the lineage that created the sound.

That's the structural argument, and it's why this stopped being a tantrum about one catalog and became a policy question. When a single artist says "you used my 200 songs," it's a lawsuit. When you notice the same uncredited influences powering thousands of generated tracks, it's an extraction story, and we have seen that story in music before.

The company defense, weighed honestly

Generative music companies are not all saying the same thing, and it's worth being precise instead of lumping them. The common defenses, roughly in order of how much they actually address consent:

The defense	What it answers	What it dodges
"The model doesn't store or retrieve your song"	How the tech works	Whether your song trained it
"We built it to create, not to copy"	Output behavior	How the input was gathered
"Users can opt out / request removal"	Going-forward control	Everything already trained
"We have fingerprinting to catch copies" (ACR-style matching)	Blatant clones at output	Style and influence, which fingerprinting can't detect
"We've licensed catalogs from rights holders"	Some of the corpus	Whether you were in the licensed part — or the scraped part

Audio fingerprinting — the ACRCloud / Content ID family of tools — matches a recording against a reference database. It's good at catching someone uploading your actual master. It is useless at catching a model that learned your phrasing and produced something legally distinct but unmistakably yours in feel. That gap is exactly where the consent fight lives.

The opt-out model deserves its own line. Opt-out means the default is "we took it, tell us if you want out." Opt-in means the default is "we didn't take it unless you said yes." Those are opposite moral starting points dressed in similar language, and most artists only learn which one they're living under after their work is already in the pile.

Where the answer is honestly "it depends"

A responsible version of this story has to say what it doesn't know.

Whether scraping copyrighted recordings to train a model is fair use is not settled law, in the US or most other jurisdictions, as of this writing. Cases are working through the courts. Some will likely favor rights holders, some won't, and the outcome may differ for the training step versus the output step. Anyone telling you it's definitely illegal, or definitely fine, is selling certainty that doesn't exist yet.

It also depends heavily on your contracts. If you don't own your masters — and many artists don't — your label may already be free to license that catalog for training without consulting you, the same way they license for film and ads. The "they didn't ask me" anger is real, but in some cases the person who could have said no, and possibly already said yes, is your own label.

And it depends on the company. A tool trained on a fully licensed or proprietary library is a different ethical object from one trained on scraped audio, even if they produce similar-sounding loops. That distinction matters to anyone who builds responsibly, which is the position we take at City of Punk — the provenance of the training data is not a footnote, it's the product.

What you can actually check this week

You can't audit a private dataset. You can do a few concrete things:

Read your recording and publishing contracts for "new technologies," "derivative uses," and broad license-grant language. That's where training rights would hide. If you can't tell, that's a question for your lawyer, not a guess.
Search the major model companies' sites for an opt-out or artist-exclusion form. Several have them. Submitting one rarely undoes past training, but it documents that you objected, and a paper trail matters if law catches up.
Run a few of your hooks through a generator yourself. Prompt it toward your style and listen. You're not gathering legal proof — you're gathering your own informed opinion about how exposed your sound is.
When you license your work, ask the buyer one written question: "Will this be used to train machine learning models?" Most won't have an answer ready. Asking it, in writing, starts changing the default from opt-out to opt-in one deal at a time.

That last one is small. It won't unwind anything that's already happened. But the entire consent crisis runs on the assumption that nobody will ask — so this week, in the next email you send about your own work, ask.

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools

David Thornton

The Signal · City of Punk

When an Artist's Catalog Becomes AI Music Training Datasets, Who Said Yes?

Was my work used to train the model — and how would I even know?

Why "they didn't ask" is the whole argument

The part that turns a complaint into a pattern

The company defense, weighed honestly

Where the answer is honestly "it depends"

What you can actually check this week

Not sure which tool to use?

David Thornton

KEEP THE SIGNAL GOING

The Anatomy of Music Licensing Disputes: How a Defendant Turns "Pay Up" Into "Prove It"

The Privacy Policy Wall Between You and the Sample: What Consent Actually Does Before You Hear a Note

Music Licensing in the APAC Market: How the Major Labels Are Actually Buying In

Artist Consent Is the Missing Ingredient in Most AI Music Generation