The Quiet Trap in AI Training Data Licensing: What a Dataset's Curation Owns That the Songs Don't

You can clear every single track in a dataset — every Creative Commons attribution, every share-alike clause, every artist credit logged and honored — and still lose a copyright case over that dataset. Not because of the music. Because of the order the music was put in.

That sentence sounds like a riddle. It is not. It is the most important thing a music AI founder, a rights holder, or the counsel advising either one can understand about AI training data licensing right now, and the rest of this piece exists to earn the right to have said it.

First, the disclosure, because you should have it before you trust a word of the analysis. City of Punk makes a generative music product. We train models. We have a direct commercial interest in how these questions resolve, and a competitor reading this could reasonably ask whether I am grinding an axe. So here is the deal: everything below cuts toward more caution for companies like ours, not less. If this analysis is wrong, it is wrong in the direction of making my own job harder. Read it that way and check my reasoning.

The trap has a friendly name

The trap wears the word "open." Someone — a university lab, a music platform's research arm, a nonprofit — assembles a corpus. Tens of thousands of tracks, all of them already under permissive licenses. They wrap it in documentation, post it to a research portal, and label it something like "released for non-commercial research use." The tracks underneath are genuinely free to use. The download button is right there. No paywall, no login, no signed agreement.

So a team building a commercial model grabs it. The logic is clean and, on its face, reasonable: the underlying songs are Creative Commons, we are honoring those licenses, and the dataset itself was published openly. Where is the harm?

The harm, if there is one, lives in a layer most engineers never look at. The dispute that has started surfacing in courts — including a suit against a major chip and AI company over a music corpus assembled by a platform's subsidiary — does not primarily allege that the songs were stolen. It alleges that the collection was used outside the terms it was offered under. Those are not the same claim. They do not even live in the same body of law.

Pull those two layers apart and the whole problem becomes legible.

Two licenses, stacked, pointing different directions

When you download a research music dataset, you are almost always receiving two distinct grants of permission, even if nobody draws them as separate documents.

Layer one — the underlying works. Each track carries its own license. Creative Commons Attribution, CC BY-SA, sometimes CC0. These were granted by the artists or their assignees, and they typically do permit commercial use, as long as you meet the attribution or share-alike conditions. This is the layer most people check. If you stop here, everything looks clear.

Layer two — the compilation. Somebody selected which of the millions of available tracks to include. Somebody decided how to tag them — genre, mood, tempo, key, instrumentation. Somebody coordinated the metadata schema, normalized the audio, structured the directory, wrote the documentation. That work of assembly and curation is offered under its own terms, and those terms are frequently narrower than the songs inside. "Non-commercial research only" is the phrase that does the damage. The free music is real. The free dataset may not be.

You can comply perfectly with layer one and breach layer two. That is the riddle, unriddled. The CC license on a track does not grant you anything about the curation that surrounds it, any more than buying a paperback of a public-domain novel lets you photocopy the annotated scholarly edition.

Which raises the question every skeptical reader should ask next: can curation actually be owned, when none of its contents are?

Feist, and the part everyone misremembers

In the United States the governing case is Feist Publications v. Rural Telephone Service, decided by the Supreme Court in 1991. It is taught badly. Most people remember it as the case that said you cannot copyright a phone book, and they walk away thinking compilations are unprotectable. That is the wrong lesson.

What Feist actually held is narrower and more useful. A compilation of uncopyrightable facts — names, addresses, phone numbers — can still receive copyright protection, but only in its selection, coordination, and arrangement, and only where those choices reflect some original judgment. Rural's white pages lost because listing every subscriber alphabetically involved no original choice at all. There was nothing to select (they took everyone) and nothing to arrange (alphabetical is mechanical). The effort of compiling — and the Court was explicit that effort alone earns no copyright — does not substitute for that spark of original choice.

A dramatic overhead flat-lay photograph of two stacked translucent legal documents on a slate-grey…

Now hold a curated music research corpus next to a phone book. The phone book includes everyone in alphabetical order. The music corpus does something else entirely. It chose forty thousand tracks out of millions. It chose which genres to weight, which to exclude, which tagging taxonomy to impose, how to bucket tempo and mood, how to structure the relationships between audio and metadata. Those are exactly the editorial decisions Feist identified as protectable. The contents being free is beside the point. The arrangement of free contents can carry its own thin copyright, and "thin" is not "none."

This is why the strongest version of the claim against an AI company in these disputes is not "you used our songs." The songs may have been free. It is "you ingested the protected structure we built, and you did it under terms that forbade exactly the use you put it to." A model that trained on the corpus arguably consumed the selection and coordination — the part with the spark — not only the facts underneath.

I want to be careful here, because this is contested and a court has not finally resolved it. Reasonable lawyers disagree about whether training a model "copies" a compilation's protected expression in any legally cognizable way, or whether the model only learns from the unprotectable underlying tracks and discards the arrangement. That is a genuinely open doctrinal question, and anyone who tells you it is settled is selling something. But you do not get to wave it away because the music was Creative Commons. The Creative Commons grant never reached the curation in the first place.

Two theories, two very different bills

There is a second fork that matters for anyone pricing their exposure, and it is the difference between breach of contract and copyright infringement. These cases often plead both, because they are not the same animal.

If the dataset came with terms — even click-through or posted terms — saying "non-commercial research only," and a company used it commercially, that is a contract claim. The remedy is generally tied to the harm from the breach: what the license would have cost, what was lost. It is bounded, and it lives or dies on whether an enforceable agreement actually formed. A company that downloaded a file with no login, no signature, no clickwrap can credibly argue no contractual relationship existed — that it never agreed to anything. That defense has real teeth on the contract count.

But the copyright count does not need a contract. Copyright infringement is a violation of the rights holder's exclusive rights regardless of whether the two parties ever shook hands. And in the United States, where the compilation is registered, the plaintiff can reach for statutory damages — a per-work range set by statute rather than tied to proven harm — and, if the infringement is found willful, a substantial multiplier on top. Willfulness is where the documentation trail becomes radioactive. If discovery turns up an internal message acknowledging the "non-commercial" label and a decision to proceed anyway, the multiplier stops being theoretical.

So the same set of facts splits into two claims with two ceilings. The contract claim is capped and contestable. The copyright claim, if the compilation copyright theory holds, is potentially much larger and survives even a clean win on the contract. A defendant can plausibly beat the contract count and still be exposed on copyright, which is the opposite of how most engineers intuit the risk.

How I'd actually evaluate exposure

If you are deciding tonight whether a dataset in your training pipeline is safe to ship on commercially, here is the order I would run the checks. This is not legal advice — get counsel — but it is the triage I would do before counsel, so the conversation is short.

Provenance. Who assembled this, and what is their business? A corpus released by a commercial platform's research arm is a different risk profile than one from an academic group with no monetization motive. The party with a revenue model is the party with both the incentive and the standing to sue.

License scope — read the dataset terms, not the track terms. This is the whole ballgame. Find the document that governs the collection. If it says "research," "non-commercial," "academic," or "evaluation only," your clean track licenses do not save you. Note whether there is any explicit commercial carve-out, and whether it requires a separate paid license.

The gap between the two layers. Write down, in one line, what the underlying works permit and what the compilation permits. If those two lines disagree, the narrower one governs your use of the dataset as a dataset. The gap is your exposure.

The willfulness trail. Has anyone on the team, in writing, noted the non-commercial restriction? If yes, you no longer have a "we didn't know" posture, and you should assume that message will be read aloud in a deposition. Decide accordingly, and do not paper over it after the fact — that is worse.

A close-up macro photograph of a vintage card catalog drawer pulled open, revealing neatly…

Jurisdiction. US compilation copyright runs through Feist and statutory damages. The EU treats database rights and originality differently, and a parallel action in, say, a Belgian or French court can turn on a different doctrine entirely. If you ship globally, you are exposed under more than one regime, and a settlement in one does not bind the others.

Registration. In the US, statutory damages and attorneys' fees generally require timely registration of the work. A registered compilation is a sharper weapon than an unregistered one. If you are the rights holder reading this, that cuts the other way: register the curation, not only the tracks, if you want the bigger remedy available.

Notice what is not on this list: "the music was free." It does not appear because it does not answer the question. It answers a different, easier question that nobody is actually suing over.

Why this generalizes past one music corpus

It would be convenient to file this under "music industry problem" and move on. It does not stay filed.

The same two-layer structure sits underneath nearly every large training set that gets described as "open." LAION's image-text pairs are a curated index pointing at images, wrapped in their own terms and their own non-commercial research framing in places. Web-scale text corpora carry usage documentation that says one thing while the pages they point at say another. In-house research datasets that get quietly promoted into production pipelines carry whatever restriction the team agreed to when they first pulled the data — a restriction that the production use may blow past. In every one of these, the contents and the compilation can carry different permissions, and the commercial-versus-research line is exactly where the exposure concentrates.

The forward-looking part is the litigation pattern, and it is worth naming without conflating distinct cases. We have seen, in parallel music-AI suits, a rhythm: an aggressive filing, a vigorous public defense, and then — in several matters — a settlement before the hardest doctrinal questions get a final ruling. That pattern is informative even when the cases differ on their facts. It tells you that defendants with real resources are pricing the risk of an adverse compilation-copyright or training-data ruling highly enough to pay rather than test it to judgment. When the parties best positioned to litigate the open question keep declining to, that is data about the question.

It also means the law here may stay unsettled for a while precisely because the strongest test cases keep getting bought out of the docket before a court rules. So you cannot wait for a clean precedent to tell you what is safe. You have to price the ambiguity yourself, today, in your own pipeline.

Who should worry, and who can exhale

Worry if: you are training a commercial model and any input came from a corpus you did not assemble yourself and did not separately license for commercial use. Worry harder if that corpus came from an entity with a revenue model and a legal department, if your team's written record shows awareness of a non-commercial term, or if you ship into both the US and the EU. The "but the songs were Creative Commons" defense will not cover the compilation claim, and that is the claim with the bigger number attached.

You can breathe if: you are doing genuine non-commercial research and staying inside the dataset's stated scope. The non-commercial license exists to permit exactly what you are doing. The trap only springs when the use crosses into commercial territory the curation license never granted.

If you are the rights holder or the curator: your leverage is in the layer people overlook. Document your selection and coordination choices. Register the compilation, not only the tracks. State the commercial restriction in unambiguous language at the point of download. The clearer your terms, the more an opponent's later commercial use starts to look willful rather than innocent — and willful is where the remedy gets serious.

For founders specifically, the cleanest path is also the least glamorous: assemble or directly license your own training corpus, so that both layers — works and curation — are yours or licensed to you for the use you actually intend. It is slower and more expensive than grabbing a research set off a portal. It is also the only version where "the music was free" and "the dataset was free" finally mean the same thing.

The doctrine is unresolved, the cases keep settling before they teach us anything final, and the temptation to read an open download as an open license is strong precisely because it is so easy to act on. So here is the rule of thumb you can use tonight, before counsel, before the build ships: read the license on the collection, not the songs, and assume the narrower of the two governs everything you do.

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools

Sophie Langford

The Signal · City of Punk