The "Open Source" Trap in AI Training Data Licensing: Why a Free Research Dataset Can Still Cost You Millions

A research dataset shows up on a university mirror. A few thousand tracks, neatly tagged by genre, key, and BPM, every file marked Creative Commons, the whole package described in the README as "open source." You download it on a Tuesday, point your training pipeline at it, and ship a model that writes ad music for paying clients. You did nothing furtive. You took something a researcher published for free.

Eighteen months later there is a complaint, and the word doing the damage is "open source."

This is the failure mode at the center of AI training data licensing right now, and it is not about whether the songs were free. It is about a quieter document — the license on the dataset itself, the curation, the metadata — that often says, in plain language, non-commercial research only. That clause can bind your commercial product even when every underlying track was given away.

One disclosure before the analysis: City of Punk makes an AI music tool, which makes us a competitor to the companies on the wrong end of these disputes. We license our training data, and we run the exact exposure analysis below on ourselves before we run it on anyone else. Read this as a producer who has signed the contracts, not as a brand taking a victory lap.

What most people do

Most teams collapse three distinct ideas into one feeling of "this is fine."

The first conflation is "publicly downloadable" with "cleared for any use." If the file transfers without a paywall, the assumption goes, the rights must follow. They do not. Access and permission are separate questions, and a free download answers only the first one.

The second is "open source" with "public domain." Open source is a distribution model — here is the thing, take it, study it. Public domain is a rights status — nobody owns this anymore. A dataset can be the first without being remotely close to the second. Open-source software ships under licenses (GPL, MIT, Apache) that impose obligations; nobody reads "open source" on a code repo and assumes zero conditions. Datasets work the same way, and the conditions are frequently stricter than anything in software.

The third, and the one that actually triggers the lawsuits, is assuming the underlying tracks' license controls the whole package. If a dataset is built from Creative Commons recordings, surely the dataset inherits that freedom. But a curated corpus has two license layers stacked on top of each other, and most teams only ever read the bottom one.

Here is the pattern, abstracted from the disputes now working through US and European courts: a music platform releases a research dataset. The audio is CC-licensed. The dataset's own documentation — the part covering the selection of tracks, the tagging, the arrangement, the metadata — carries a separate term restricting use to non-commercial research. An AI company trains a commercial model on it and describes the source as "open source." The platform's position is that the company breached the dataset license. The company's reported position is that no contractual relationship existed at all. That gap — between "you agreed to our terms by using our data" and "we never agreed to anything" — is the whole fight.

What most people do is never notice the gap exists, because they read the audio license and stopped.

What the evidence suggests

The evidence — meaning the way these complaints are actually pleaded, and the precedent they lean on — suggests the dangerous layer is the curation, not the catalog.

Think of the two layers cleanly:

Layer one: the underlying works. Individual recordings, each with its own license. Creative Commons, public domain, whatever the rightsholder chose. If these are genuinely CC-licensed for the use you're making, this layer may be clean.
Layer two: the compilation. The selection of which tracks to include, the coordination of how they're organized, the arrangement and the metadata. This is a separate copyrightable thing, and it can carry a separate, more restrictive license.

US copyright law recognizes layer two through compilation copyright, and the governing precedent is Feist Publications v. Rural Telephone (1991). Feist held that a phone book's raw facts weren't protectable, but the selection, coordination, and arrangement of facts could be, provided there was a minimal degree of creativity in how they were chosen and ordered. A curated music dataset is a near-textbook compilation: someone decided which tracks belonged, how to label them, how to structure the whole. That editorial work can be protected even when each individual file is free.

So the live question is not "were the songs free." It is whether a dataset's curators can release the compilation for one purpose — research — and exclude another — commercial product training — and have that exclusion stick against someone who downloaded the package and trained on it anyway.

There are two theories running in parallel, and they matter because they carry different remedies.

The contract theory: by accessing and using the dataset, you accepted its license, and commercial training breached the non-commercial term. This is breach, not infringement, and it turns on whether a binding agreement formed — exactly the point an "we had no contractual relationship" defense attacks.

The copyright theory: the compilation is protected, the license never granted commercial rights, so commercial training is infringement of the compilation copyright regardless of contract formation. This is where statutory damages and a willfulness multiplier enter. The numbers vary by jurisdiction and by how many works or registrations are counted, and I won't invent a figure — but willful infringement exposure is categorically worse than an innocent-mistake posture, and "the README said open source" is not, on its own, a reliable shield.

None of this is settled. Courts have not uniformly ruled on whether non-commercial research terms on a freely distributed dataset can fence out commercial AI training, and the cases moving through US and Belgian courts may land differently. Treat the doctrine as contested. Treat your exposure as real anyway, because the cost of being wrong arrives whether or not the law was clear when you trained.

What I actually do

I read the dataset card before I read anything else. Not the audio licenses — the document describing the dataset as a whole. That file is a contract until proven otherwise, and I treat the word "research" in it as a hard boundary, not flavor text.

Concretely, the workflow:

Separate the layers in writing. For every corpus we evaluate, the provenance ledger has two columns: the license on the underlying works, and the license on the compilation. A clean bottom layer with a "non-commercial research only" top layer is a stop, not a maybe. Commercial product training is commercial use; no amount of CC audio underneath changes that.
Flag "open source" as a description, not a grant. When source documentation says open source, I go find the actual license file and read what it permits. If it permits research and is silent on or excludes commercial use, I assume commercial use is not granted.
Get permission or buy the equivalent. If a dataset is genuinely useful and genuinely off-limits, the move is to license it directly, license a comparable cleared corpus, or build one from works where we hold the rights. Slower. Defensible.
Document good faith. Because willfulness is the multiplier that turns a manageable problem into a company-ending one, I keep records of the license review, the legal sign-off, and why we believed each use was permitted. Good-faith reliance is a defense; "nobody checked" is the opposite.
Apply the same lens everywhere. This is not a music-only problem. LAION, Common Crawl, in-house research corpora — every one has a distribution story and a permission story, and they are not the same story. The two-layer reading travels.

The uncomfortable part for AI builders is that the cheapest, most convenient data is frequently the data carrying the heaviest restriction, because researchers release it under non-commercial terms precisely so it stays research. Convenience and clearance pull in opposite directions, and the pull does not show up until the complaint does.

So, the myth and the correction, side by side:

The myth is that a freely downloadable, "open source" research dataset built from Creative Commons tracks is cleared for commercial model training. The more accurate version is that "open source" governs how you may obtain the dataset, while a separate compilation license governs whether you may train a commercial product on it — and that license, not the free download, is what a court will read back to you.

Not sure which tool to use?

Compare the top AI music and sound tools side by side — honest reviews, real pricing, no sponsorships.

Compare the Tools

Grace Ellerton

The Signal · City of Punk