From sheet music to source code
How copyright battles from a century ago shape today's fight over just compensation in the world of generative AI
Imagine a parlor in Manhattan, circa 1906. A soft bustle of conversation, the faint scent of cigar smoke, and in the corner, a polished mahogany upright grand piano. But this is not just any piano. At the flick of a switch, a mechanism clicks to life, the paper roll begins to turn, and out comes the sounds of Victor Herbert’s hit tune Kiss Me Again, performed with uncanny precision by an empty bench.
The guests are charmed. Herbert, had he been there, would not have been pleased at all.
At the time, Herbert was one of the most prolific and popular composers in America. His music was everywhere on stages, in homes, at parties and increasingly, in machines like player pianos. Piano rolls had become the 'next big thing'. Manufacturers churned out endless reproductions of popular tunes and sold them widely. The only catch: they weren’t paying the composers their fair share of the profits.
The manufacturers' defense was, in its way, ingenious. Piano rolls weren’t "copies" of the music, they said. They weren’t readable by the human eye. There were no musical notes, no sheet music at all. Just a roll of perforated paper. Nothing more than a set of machine instructions. This was just automation. Not to be considered a performance at all. Piano rolls were examples of "dead art", they said. And dead art, by their reasoning, required no royalties.
Prompt windows
Jump to the present, and the perforated paper has been replaced by token sequences and transformer weights. Today’s machines don’t emit ragtime from piano rolls; they engage the user with fluent paragraphs, synthesize code, generate story lines, and replicate stylistic voice. As in the 1900s, what today's machines generate is often astonishing.
Where does this ability to entertain come from? It comes from something more familiar than the AI industry wants to admit. Entire datasets; books, blogs, software repos, lyrics, and journalism have been scraped and ingested by models like GPT-4 and LLaMA. And once again, the creators are told they don't deserve any compensation.
The argument? Nearly identical to the one Herbert heard a century ago: this is not a copy. It's just math. The original work is diffuse, unrecognizable. Therefore, no payment is necessary.
The courts
Incensed by repeated theft of his works, Victor Herbert took his grievance to court. He was not alone. Composers had seen their work reproduced on piano rolls and sold in droves, without a penny of payment. The legal question seemed clear: if you sell music encoded on paper, shouldn’t the person who wrote it receive something?
The courts disagreed. In White-Smith Music Publishing v. Apollo Co. (1908), the U.S. Supreme Court ruled that piano rolls were not “copies” of a musical work because they were not readable by humans. The distinction was absurd in practice but plausible in law. The music industry had embraced mechanical reproduction, and composers were left behind.
Today’s creators find themselves in a familiar bind.
Authors like Sarah Silverman and Michael Chabon are plaintiffs in lawsuits against Meta Platforms, alleging that their books were used, without permission, to train Meta’s large language model, LLaMA. The authors maintain that this use constitutes copyright infringement. Meta, like Apollo Co. in 1908, insists otherwise.
According to Meta’s legal briefs, training a model on copyrighted books is a transformative act. The LLM doesn’t reproduce stories; it extracts patterns. Each individual book, they argue, contributes "infinitesimal value" to the final model. Licensing them all, Meta claims, would be cost-prohibitive and legally unnecessary.
The plaintiffs counter that this is sleight of hand. Copying books wholesale to build a model is not the same as simply quoting a passage in an essay. The act of ingestion is itself a use, they argue, and one that deserves licensing, attribution, and compensation. Whether the output is transformative or not doesn’t excuse the unlicensed input. And, importantly, they insist that the existence of a market for licensing books to train models is a question of fact—something a jury, not a lawyer, should decide.
So far, the courts have largely leaned toward caution; delaying, deferring, sidestepping. In a case involving Anthropic, a federal judge even hinted that training models on books might fall under fair use.
But one case has bucked the trend.
In Thomson Reuters v. Ross Intelligence, a U.S. federal court ruled in favor of the copyright holder. Ross had used Westlaw’s editorial headnotes to train an AI research tool. The court found that this was not protected by fair use. The system wasn’t generative in the ChatGPT sense, but the ruling matters: it affirms that using copyrighted material to train an AI can cross the legal line.
It is the only final U.S. court decision to do so. One decision among many. But a crack, nonetheless.
Congress
Back in 1908, after the Supreme Court refused to recognize piano rolls as copies, Victor Herbert refused to fade into the footnotes. Instead, he changed his tactics. If the courts couldn’t see the value of creative labor in the age of machines, maybe Congress could.
This was no small pivot. Courts interpret law; Congress makes it. And Herbert understood that the real battle wasn’t about how the law currently read—it was about how the law needed to adapt. The world had changed. Creativity was no longer limited to live performance or ink on paper. Now that music reproduction had become mechanical, scalable, and profitable; it was only right that the creators of that music be fairly compensated.
Herbert and his allies spent the next year lobbying lawmakers, publishing in the press, and organizing support. The result was the Copyright Act of 1909, a landmark revision that introduced a new legal mechanism: the compulsory mechanical license. For the first time, manufacturers who used a musical composition in a mechanical reproduction—player pianos, phonographs, and the like—were legally required to pay a fixed royalty to the composer.
As then-president Theodore Roosevelt stated:
"Our copyright laws urgently need revision; they omit provision for many articles which, under modern reproductive processes, are entitled to protection; they impose hardships upon the copyright proprietor which are not essential to the fair protection of the public."
This was a turning point. Notably, the 1909 law didn’t stop the progress of technology. It didn’t restrict invention, either. What it did was codify a financial foundation for the value of creative work. It set a rule:
If you use it, you pay for it.
Still, law alone wasn’t enough. Someone had to track the uses, collect the fees, and, importantly, enforce the obligation.
That’s where ASCAP came in.
Rights and royalties
Formed in 1914, the American Society of Composers, Authors, and Publishers wasn’t just a union or a guild. It was infrastructure. ASCAP created a centralized registry of works, issued blanket licenses to venues and broadcasters, monitored performances, and pursued enforcement. Restaurants, clubs, radio stations, anyone who played music in public was expected to pay for the privilege. And if they didn’t, ASCAP sent letters. Then file lawsuits.
This wasn’t theoretical. It was practical. And it worked.
In 1939, after a fee dispute, radio broadcasters formed a rival organization: BMI, or Broadcast Music, Inc. BMI expanded the scope of licensing to include a wider range of genres and creators, bringing popular music into the fold and making the system more representative of the broader creative landscape.
Together, ASCAP and BMI didn’t just protect composers; they created a workable relationship between art and industry. They made it easy for users to comply. One license covered thousands of works. The money flowed in. The royalties flowed back.
They turned a vague moral argument, "you should pay the artist", into a functioning economic model.
And they did it by building systems with teeth.
But ASCAP and BMI weren’t the only beneficiaries. Over the next centurynew jobs, companies, and entire subsectors of the economy were created around the need to implement, monitor, and enforce copyright compliance. Software providers built royalty-tracking platforms. Lawyers carved out niches in licensing and rights management. Accountants specialized in royalty calculations. Agencies and standards bodies emerged to ensure transparency and fairness. And technologists built ever-more sophisticated systems for data provenance, usage monitoring, and reporting.
Implementing regulation didn’t just benefit artists, it fueled second-order economic growth.
As noted in a U.S. Senate-commissioned study, the compulsory license model “encouraged the formation of new record companies... increased composer, artist, and publisher royalties... and helped democratize access to recording.” It didn’t stifle the industry; it helped it scale. And there is no reason to think this pattern would not apply to today's AI licensing as well.
A modern copyright regime for AI won’t just ensure creators are paid. It will also catalyze entirely new categories of economic activity. From AI licensing platforms to data attribution engines, from model auditing software to opt-in registries and real-time usage monitors; we’ll see growth not only in content, but in compliance.
Because, as we've seen time and time again, regulation, when done properly, does not stifle innovation. It encourages it by setting the boundaries in which healthy markets can thrive and encouraging justice and opportunity for creators.
The uphill work ahead
So far the courts, for all their deliberation, seem just as short-sighted today as they were in 1908. Yes, we all agree that generative AI and related technology is impressive. The rights of the people whose work fuels it? Apparently, still up for debate. But before any licensing regime can work, we also need to answer a simple question: how would creators even know if their work is being used to train a model?
Victor Herbert could see his compositions printed on piano rolls. The infringement was physically tangible. Today’s creators face a more opaque challenge. Their works are scraped, copied, chunked, and transformed into training data sets consisting of billions or trillions of tokens. Once ingested, the connection between input and output becomes nearly invisible.
If there is to be any licensing regime with enforceable teeth, it requires some level of transparency and proof of use. Fortunately, this is not entirely theoretical. Several large training datasets already provide partial visibility into provenance. For example, collections like Common Crawl, LibGen, or Books3 contain full URLs, file hashes, or bibliographic data that point back to identifiable original works. Emerging research projects such as the Hugging Face models registry, which indexes hundreds of thousands of large AI models, show that tracking which LLM trained on what data is increasingly feasible.
It's not too hard to see how these datasets could form the technical backbone of a system that calculates proportional contributions. For example, how much of a corpus came from a particular author, publisher, or creator? Once quantified, these contributions could drive royalty allocations much like mechanical royalties were assigned per song in the era of piano rolls.
When regulators, platforms, and creators can agree on standards for data attribution and disclosure, then automated licensing infrastructure becomes possible. Transparency creates enforceability. And enforceability creates markets. In the absence of coordinated effort, all this falls to individual authors filing lawsuits, writing op-eds, giving testimony. Lone creators are left carrying the banner for just compensation and basic protection.
The good news
But, even if the courts eventually recognize that ingesting thousands of copyrighted books to train a generative model isn’t magically exempt from copyright law, it won’t be enough. And even if, like Herbert, they lose in court but prevail in Congress with a new legal framework, that still won’t be enough either. We’ll still need the thing that made ASCAP work: licensing and enforcement infrastructure. The good news is that some groundwork is being laid—both in the U.S. and abroad.
In the European Union, recent regulation has begun carving out enforceable limits. Under the 2019 Copyright in the Digital Single Market (CDSM) directive, authors can opt out of text-and-data mining—limiting how their work can be used in AI training. And with the 2024 AI Act, general-purpose AI developers will soon be required to disclose detailed summaries of their training data and respect those opt-outs. It's not royalties, but it’s a step toward transparency and agency. It’s the first real signal that governments might require platforms to take responsibility for what they learn from.
In the U.S., a handful of initiatives are pushing in the same direction.
The Dataset Providers Alliance (DPA) is a coalition of licensing companies—including Rightsify, Pixta, and Vectara—working to promote ethical, opt-in data sourcing for AI training. Their goal is to create standards around consent and compensation, making it easier for developers to license content responsibly.
Calliope Networks has launched a “License to Scrape” initiative focused on user-generated content, particularly YouTube videos. Their program lets creators opt in and get paid when their content is used to train machine learning models. It’s a step toward compensating the millions of creators whose material is now quietly part of the training corpus.
The Authors Guild, the largest professional organization for writers in the U.S., is advocating for a blanket licensing scheme specifically for written works used in AI training. Drawing directly from the ASCAP model, their proposal seeks to give authors a collective voice and a path to compensation without requiring each individual to negotiate with billion-dollar tech firms.
And the U.S. Copyright Office, while not a licensing entity, has begun issuing formal guidance on generative AI, calling for transparency in training data and urging lawmakers to consider new rules that reflect the realities of modern content use.
All of this is encouraging progress. But it does not yet add up to a unified solution. The energy is real, but the moment has not yet coalesced.
Are we close to our Victor Herbert moment? Possibly. But it won’t arrive on its own. Herbert didn’t just grumble. He and his colleagues built a future where creators are properly compensated.
If we want a future where authors, developers, educators, and artists retain a stake in the systems that learn from them, we’ll need more than courtroom wins or symbolic legislation. We’ll need coordination. We’ll need clarity. And yes, we’ll need infrastructure with teeth.
The framework is beginning to take shape. The question is whether we’re ready to finish what it started.
Past is Prologue
Back in the early 1900s Herbert didn’t win in court. He won in Congress. And even then, he still needed to build a system of cooperation and enforcement that resulted in royalties for creators in the form of ASCAP. That last part is what mattered most.
Today, the theft machines are faster, more subtle, and ever more impressive. Instead of playing catchy tunes, they now generate essays, images, code, and conversations.
But the principle behind all this commerce in creative works remains unchanged:
Just because your work is fed into a machine doesn’t mean you stop owning it.
We know what happens when we leave that principle unprotected. It erodes; quietly at first, then systemically. If we don’t actively and consciously build frameworks that respect and reward creative labor, we’ll default into one that exploits creators and cheapens their work.
A licensing framework for authors, artists, developers, and educators in the age of AI isn’t some romantic ideal or utopian fantasy. It’s overdue infrastructure. And like any good infrastructure, it will need policy, technology, and collective commitment to take shape. It will need open standards, opt-in registries, enforceable agreements, and institutions that serve creators, not just platforms.
The question isn’t whether we can build it. The question is whether we have the will to do it. We’ve been here before.
American computer scientist and technology pioneer Alan Kay has been quoted as saying:
"The best way to predict the future is to invent it."
That means the shape of the future is in our hands. What future do you predict?