Copyright Archives • Mass Law Blog

Reddit v. Anthropic: Contract Law as the New Frontier in AI Data Governance

by Lee Gesmer | Jun 23, 2025 | Contracts, Copyright, DMCA/CDA

At the heart of large language model (LLM) technology lies a deceptively simple triad: compute, algorithms, and data.

Compute powers training – vast arrays of graphics processing units crunching numbers at a scale measured in billions of parameters and trillions of tokens. Algorithms shape the intelligence – breakthroughs like the Transformer architecture enable models to understand, predict, and generate human-like text. Data is the raw material, the fuel that teaches the models everything they know about the world.

While compute and algorithms continue to advance at breakneck speed, data has emerged as a critical bottleneck. Training a powerful LLM requires unimaginably large volumes of text, and the web’s low-hanging fruit – books, Wikipedia, forums, blogs – has largely been harvested.

The large AI companies have gone to extraordinary lengths to collect new data. For example, Meta allegedly used bittorrent to download massive amounts of copyrighted content. OpenAI reportedly transcribed over a million hours of Youtube videos to train its models. Despite this, the industry may be approaching a data ceiling.

In this context, Reddit represents an unharvested goldmine – a sprawling archive containing terabytes of linguistically rich user-generated content.

And, according to Reddit, Anthropic, desperate for training data, couldn’t resist the temptation: it illegally scraped Reddit without a license and trained Claude on Reddit data.

The Strategic Calculus: Why Contract, Not Copyright

Reddit filed suit against Anthropic in San Francisco Superior Court on June 4, 2025. However, it chose not to follow the well-trodden path of federal copyright litigation in favor of a novel contract-centric strategy. This tactical pivot from The New York Times Co. v. OpenAI and similar copyright-based challenges signals a fundamental shift in how platforms may assert control over their data assets.

Reddit’s decision to anchor its complaint in state contract law reflects the limitations of copyright doctrine and the structural realities of user-generated content platforms. Unlike traditional media companies that own their content outright, Reddit operates under a licensing model where individual users retain copyright ownership while granting the platform non-exclusive rights under Section 5 of its User Agreement.

This ownership structure creates obstacles for copyright enforcement at scale. Reddit would face complex Article III standing challenges, since it lacks the requisite ownership interest to sue for direct infringement. Moreover, the copyright registration requirements (a precondition to filing suit) would prove prohibitively expensive and logistically impossible for millions of user posts. And, even if Reddit could establish standing, it would face the copyright fair use defenses that have been raised in the more than 40 pending AI copyright cases.

By pivoting to contract law, Reddit sidesteps these constraints. Contract claims require neither content ownership nor copyright registration – only proof of agreement formation and breach.

The Technical Architecture of Alleged Infringement

Reddit’s complaint alleges that Anthropic’s bot made over 100,000 automated requests after July 2024, when Anthropic publicly announced it had ceased crawling Reddit. This timeline is significant because it suggests deliberate circumvention of robots.txt exclusion protocols and platform-specific blocking mechanisms.

According to Reddit, the technical details reveal sophisticated evasion tactics. Unlike legitimate web crawlers that respect robots.txt directives, the alleged “ClaudeBot” activity employed distributed request patterns designed to avoid detection. Reddit’s complaint specifically references CDN bandwidth costs and engineering resources consumed by this traffic – technical details that will be crucial for establishing the “impairment” element required under California’s trespass to chattels doctrine.

The complaint’s emphasis on Anthropic’s “whitelist” of high-quality subreddits demonstrates technical sophistication in data curation. This selective approach undermines any defense that the scraping was merely incidental web browsing, instead revealing a targeted data extraction operation designed to maximize training value while minimizing detection risk.

Here’s a quick look at the three strongest counts in Reddit’s complaint: breach of contract, trespass to chattels and unjust enrichment.

Browse-Wrap Formation: The Achilles’ Heel

The most vulnerable aspect of Reddit’s contract theory lies in contract formation under browse-wrap principles. Reddit argues that each automated request constitutes acceptance of its User Agreement terms, which prohibit commercial exploitation under Section 3 and automated data collection under Section 7.

However, under Ninth Circuit precedent applying California law, browse-wrap contracts require reasonably conspicuous notice, and Reddit’s User Agreement link appears in small text at the page footer without prominent placement or mandatory acknowledgment – what some courts have termed “inquiry notice” rather than “actual notice.”

And, unlike human users who might scroll past terms of service, automated bots often access content endpoints directly, without rendering full page layouts. This raises fundamental questions about whether algorithmic agents can form contractual intent under traditional offer-and-acceptance doctrine.

California courts have been increasingly skeptical of browse-wrap enforcement, and have required more than mere website access to establish assent. Reddit’s theory will need to survive a motion to dismiss where Anthropic will likely argue that no reasonable bot operator would have constructive notice of buried terms.

The Trespass to Chattels Gambit

“Trespass to chattels” is the intentional, unauthorized interference with another’s tangible personal property (contrasted with real property) that impairs its condition, value, or use. Reddit asserts that Anthropic “trespassed” by scraping data from Reddit’s servers without permission.

Reddit’s trespass claim faces a high bar. California courts require proof of actual system impairment rather than mere unauthorized access. Reddit tries to meet this standard by citing CDN overage charges, server strain, and engineering time spent mitigating bot traffic. These bandwidth costs and engineering expenses, while real, may not rise to the level of system impairment that the courts demand.

The technical evidence will be crucial here. Reddit must demonstrate quantifiable performance degradation – slower response times, server crashes, or capacity limitations – rather than merely increased operational costs. This evidentiary burden may prove difficult given modern cloud infrastructure’s elastic scaling capabilities.

Unjust Enrichment and the Licensing Market

Reddit’s unjust enrichment claim rests on its data’s demonstrable market value, evidenced by licensing agreements with OpenAI, Google, and other AI companies. These deals, reportedly worth tens of millions annually, establish a market price for Reddit’s content.

The legal theory here is straightforward: Anthropic received the same valuable data as paying licensees but avoided the associated costs, creating an unfair competitive advantage. Under California law unjust enrichment requires showing that the defendant received a benefit that would be unjust to retain without compensation.

Reddit’s technically sophisticated Compliance API bolsters this claim. Licensed partners receive real-time deletion signals, content moderation flags, and structured data feeds that ensure training datasets remain current and compliant with user privacy preferences. Anthropic’s alleged automated data extraction bypassed these technical safeguards, potentially training on content that users had subsequently deleted or restricted.

Broader Implications for AI Governance

If Reddit’s contract theory succeeds, it would establish a powerful precedent allowing platforms to impose licensing requirements through terms of service. Every website with clear usage restrictions could potentially demand compensation from AI companies, fundamentally altering the economics of model training.

Conversely, if browse-wrap formation fails or federal preemption invalidates state law claims, AI developers would gain confidence that user generated web content remains accessible, subject to copyright limitations.

The Constitutional AI Paradox

Most damaging to Anthropic may be the reputational challenge to its “Constitutional AI” branding. The company has positioned itself as the ethical alternative in AI development, emphasizing safety and responsible practices. Reddit’s allegations create a narrative tension that extends beyond legal liability to market positioning.

This reputational dimension may drive settlement negotiations regardless of the legal merits, as Anthropic seeks to preserve its differentiated market position among enterprise customers increasingly focused on AI governance and compliance.

Conclusion

While Reddit’s legal claims face significant doctrinal challenges, the case underscores the importance of understanding both the technical architecture of web scraping and the evolving legal frameworks governing AI development. The outcome may determine whether platforms can use contractual mechanisms to assert control over their data assets, or whether AI companies can continue treating public web content as freely available training material subject only to copyright challenge.

Copyright Office Issues Long-Awaited Report on Generative AI Training – Register of Copyrights Is Fired Next Day

by Lee Gesmer | May 14, 2025 | Copyright, DMCA/CDA, General

The Copyright Office has been engaged in a multi-year study of how copyright law intersects with artificial intelligence. That process culminated in a series of three separate reports: Part 1 – Unauthorized Digital Replicas, Part 2 – Copyrightability, and now, the much-anticipated Part 3—Generative AI Training.

Many in the copyright community anticipated that the arrival of Part 3 would be the most important and controversial. It addresses a central legal question in the flood of recent litigation against AI companies: whether using copyrighted works as training data for generative AI qualifies as fair use. While the views of the Copyright Office are not binding on courts, they often carry persuasive weight with federal judges and legislators.

The Report Arrives, Along With Political Controversy

The Copyright Office finally issued its report – Copyright and Artificial Intelligence – Part 3 Generative AI Training (link on Copyright Office website; back-up link here) on Friday, May 9, 2025. But this was no routine publication. The document came with an unusual designation: “Pre-Publication Version.”

Then came the shock. The following day, Shira Perlmutter, the Register of Copyrights and nominal author of the report, was fired. Perlmutter had served in the role since October 2020, appointed by Librarian of Congress Carla Hayden—herself fired two days earlier, on May 8, 2025.

These abrupt, unexplained dismissals have rocked the copyright community. The timing has fueled speculation of political interference linked to concerns from the tech sector. Much of the conjecture has centered on Elon Musk and xAI, his artificial intelligence company, which may face copyright claims over the training of its Grok LLM model.

Adding to the mystery is the “pre-publication” label itself—something the Copyright Office has not used before. The appearance of this label, followed by Perlmutter’s termination, has prompted widespread belief that the report’s contents were viewed as too unfriendly to the AI industry’s legal position, and that her removal was a prelude to a potential retraction or revision.

What’s In That Report, Anyway?

Why might this report have rattled so many cages?

In short, it delivers a sharp rebuke to the AI industry’s prevailing fair use narrative. While the Office does not conclude that AI training is categorically infringing, its analytical framework casts deep doubt on the broad legality of using copyrighted works without permission to train generative models.

Here are key takeaways:

Transformative use? At the heart of the report is a skeptical view of whether using copyrighted works to train an AI model is “transformative” under Supreme Court precedent. The Office states that such use typically does not “comment on, criticize, or otherwise engage with” the copyrighted works in a way that transforms their meaning or message. Instead, it describes training as a “non-expressive” use that merely “extracts information about linguistic or aesthetic patterns” from copyrighted works—a use that courts may find insufficiently transformative.

Commercial use? The report flatly rejects the argument that AI training should be considered “non-commercial” simply because the outputs are new or the process is computational. Training large models is a commercial enterprise by for-profit companies seeking to monetize the results, and that, the Office emphasizes, weighs against fair use under the first factor.

Amount and substantiality? Many AI models are trained on entire books, images, or articles. The Office notes that this factor weighs against fair use when the entirety of a work is copied—even if only to extract patterns—particularly when that use is not clearly transformative.

Market harm? Here, the Office sounds the loudest alarm. It directly links unauthorized AI training to lost licensing opportunities, emerging collective licensing schemes, and potential market harm. The Office also notes that AI companies have begun entering into licensing deals with rightsholders—ironically undercutting their own arguments that licensing is impractical. As the Office puts it, the emergence of such markets suggests that fair use should not apply, because a functioning market for licenses is precisely what the fourth factor is meant to protect.

But Google Books? The report goes out of its way to distinguish training on entire works from cases like Authors Guild v. Google, where digitized snippets were used for a non-expressive, publicly beneficial purpose—search. AI training, by contrast, is described as for-profit, opaque, and producing outputs that may compete with the original works themselves.

Collectively, these conclusions paint a picture of AI training as a weak candidate for fair use protection. The report doesn’t resolve the issue, but it offers courts a comprehensive framework for rejecting broad fair use claims. And it sends a strong signal to Congress that licensing—statutory or voluntary—may be the appropriate policy response.

Conclusion

It didn’t take long for litigants to seize on the report. The plaintiffs in Kadrey v. Meta (which I recently wrote about here) filed a Statement of Supplemental Authority on May 12, 2025, the very next business day, citing Ninth Circuit authority that Copyright Office reports may be persuasive in arriving at judicial decisions (but failing to note that the report in question here is “pre-publication”). The report was submitted to judges in other active AI copyright cases as well.

The coming weeks may determine whether this report is a high-water mark in the Copyright Office’s independence or the opening move in its politicization. The “pre-publication” status may lead to a walk-back under new leadership. If, on the other hand, the report is published as final without substantive change, it may become a touchstone in the pending cases and influence future legislation.

If it survives, the legal debate over generative AI may have moved into a new phase—one where assertions of fair use must confront a detailed, skeptical, and institutionally backed counterargument.

As for the firing of Shira Perlmutter and Carla Hayden? No official explanation has been offered. But when the nation’s top copyright official is fired within 24 hours of issuing what could prove to be the most consequential copyright report in a generation, the message—intentional or not—is that politics may be catching up to policy.

Copyright, AI, and Meta’s Torrent Problem

by Lee Gesmer | Apr 28, 2025 | Copyright

“Move fast and break things.” Mark Zuckerberg’s famous motto seems especially apt when examining how Meta developed Llama, its flagship AI model.

Like OpenAI, Google, Anthropic, and others, Meta faces copyright lawsuits for using massive amounts of copyrighted material to train its large language models (LLMs). However, the claims against Meta go further. In Kadrey v. Meta, the plaintiffs allege that Meta didn’t just scrape data — it pirated it, using BitTorrent to pull hundreds of terabytes of copyrighted books from shadow libraries like LibGen and Z-Library.

This decision could significantly weaken Meta’s fair use defense and reshape the legal framework for AI training-data acquisition.

Meta’s BitTorrent Activities

In Kadrey v. Meta, plaintiffs allege that discovery has revealed that Meta’s GenAI team pivoted from tentative licensing discussions with publishers to mass BitTorrent downloading after receiving internal approvals that allegedly escalated “all the way to MZ”—Mark Zuckerberg.

BitTorrent is a peer-to-peer file-sharing protocol that efficiently distributes large files by breaking them into small pieces and sharing them across a decentralized “swarm” of users. Once a user downloads a piece, they immediately begin uploading it to others—a process known as “seeding.” While BitTorrent powers many legitimate projects like open source software distribution, it’s also the lifeblood of piracy networks. Courts have long treated unauthorized BitTorrent traffic as textbook copyright infringement (e.g., Glacier Films v. Turchin, 9th Cir. 2018).

The plaintiffs allege that Meta engineers, worried that BitTorrent “doesn’t feel right for a Fortune 500 company,” nevertheless torrented 267 terabytes between April and June 2024—roughly twenty Libraries of Congress worth of data. This included the entire LibGen non-fiction archive, Z-Library’s cache, and massive swaths of the Internet Archive. According to the plaintiffs’ forensic analysis, Meta’s servers re-seeded the files back into the swarm, effectively redistributing mountains of pirated works.

The Legal Framework and Why BitTorrent Matters

Meta’s alleged use of BitTorrent complicates its copyright defense in several critical ways:

1. Reproduction vs. Distribution Liability

Most LLM training involves reproducing copyrighted works, which defendants typically argue is protected as fair use. But BitTorrent introduces unauthorized distribution under § 106(3) of the Copyright Act. Even if the court finds Llama’s training to be fair use, unauthorized seeding could constitute a separate violation harder to defend as transformative.

2. Willfulness and Statutory Damages

Internal communications allegedly showed engineers warning about the legal risks, describing the pirated sources as “dodgy,” and joking about torrenting from corporate laptops. Plaintiffs allege that Meta ran the jobs on Amazon Web Services rather than Facebook servers, in a deliberate effort to make the traffic harder to trace back to Menlo Park. If proven, these facts could support a finding of willful infringement, exposing Meta to enhanced statutory damages of up to $150,000 per infringed work.

3. “Unclean Hands” and Fair Use Implications

The method of acquisition may significantly impact fair use analysis. Plaintiffs point to Harper & Row v. Nation Enterprise (1985), where the Supreme Court found that bad faith acquisition—stealing Gerald Ford’s manuscript—undermined the defendant’s fair use defense. They argue that torrenting from pirate libraries is today’s equivalent of exploiting a purloined manuscript.

Meta’s Defense and Its Vulnerabilities

Meta argues that its use of the plaintiffs’ books is transformative: it extracts statistical patterns, not expressive content. They rely on Authors Guild v. Google Books (2nd Cir. 2015) and emphasize that fair use focuses on how a work is used, not obtained. Meta claims that its engineers took steps to minimize seeding—however, the internal data logs that would prove this are missing.

The company also frames Llama’s outputs as new, non-infringing content—asserting that bad faith, even if proven, should not defeat fair use.

However, the plaintiffs counter that Llama differs from Google Books in key respects:

– Substitution risk: Llama is a commercial product capable of producing long passages that may mimic authors’ voices, not merely displaying snippets.

– Scale: The amount of copying—terabytes of entire book databases—dwarfs that upheld in Google Books.

– Market harm: Licensing markets for AI training datasets are emerging, and Meta’s decision to torrent pirated copies directly undermines that market.

Moreover, courts have routinely rejected defenses based on the idea that pirated material is “publicly available.” Downloading infringing content over BitTorrent has never been viewed kindly—even when defendants claimed to have good intentions.

Even if Meta persuades the court that its training of Llama is transformative, the torrenting evidence remains a serious threat because:

– The automatic seeding function of BitTorrent means Meta likely distributed copyrighted material, independent of any transformative use

– The apparent bad faith (jokes about piracy, euphemisms describing pirated archives as “public” datasets) and efforts to conceal traffic present a damaging narrative

– The deletion of torrent logs may support an adverse inference that distribution occurred

– Judge Vince Chhabria might prefer to decide the case on familiar grounds—traditional copyright infringement—rather than attempting to set sweeping precedent on AI fair use

Broader Implications

If the court rules that unlawful acquisition via BitTorrent taints subsequent transformative uses, the AI industry will face a paradigm shift. Companies will need to document clean sourcing for training datasets—or face massive statutory damages.

If Meta prevails, however, it may open the door for more aggressive data acquisition practices: anything “publicly available” online could become fair game for AI training, so long as the final product is sufficiently transformative.

Regardless of the outcome, the record in Kadrey v. Meta is already reshaping AI companies’ risk calculus. “Scrape now, pay later” is beginning to look less like a clever strategy and more like a legal time bomb.

Conclusion

BitTorrent itself isn’t on trial in Kadrey v. Meta, but its DNA lies at the center of the dispute. For decades, most fair use battles have focused on how a copyrighted work is exploited. This case asks a new threshold question: does how you got the work come first?

The answer could define how the next generation of AI is built.

SDNY Courts Split Over Copyright Management Information in AI Cases

by Lee Gesmer | Mar 17, 2025 | Copyright, DMCA/CDA

In my recent post—Postscript to my AI Series – Why Not Use the DMCA?—I discussed early developments in two cases pending against OpenAI in the U.S. District Federal District Court for the Southern District of New York (SDNY). Both cases focus on the claim that in the process of training its AI models, OpenAI illegally removed “copyright management information.” And, as I discuss below, they reach different outcomes.

What Is Copyright Management Information?

Many people who are familiar with the Digital Millennium Copyright Act’s (DMCA) “notice and takedown” provisions are unfamiliar with a part of the DMCA that makes it illegal to remove “copyright management information,” or “CMI.”

CMI includes copyright notices, information identifying the author, and details about the terms of use or rights associated with the work. It can be visible directly on the work, or metadata in the underlying code.

The CMI removal statute—Section 1202(b)(1) of the DMCA—is a “double scienter” law, requiring that a plaintiff prove that (1) CMI was intentionally removed from a copyrighted work, and (2) that the alleged infringer knew or had reasonable grounds to know that the removal of CMI would “induce, enable, facilitate, or conceal” copyright infringement.

Here is an example of how this law might work.

Assume that I have copied a work and that I have a legitimate fair use defense. However, assume further that I duplicated the work, removed the copyright notice and published the work without it. I have a fair use defense as to duplication and distribution, but could I still be liable for CMI removal?

The answer is yes. A violation of the DMCA is independent of my fair use defense. And, the penalty is not trivial. Liability for CMI removal can result in statutory damages ranging from $2,500 to $25,000 per violation, as well as attorneys’ fees and injunctive relief. Moreover, unlike infringement actions, a claim for CMI removal does not require prior registration of the copyright.

All of this adds up to a powerful tool for copyright plaintiffs, a fact that has not been lost on plaintiffs’ counsel in AI litigation.

CMI – Why Don’t AI Companies Want To Include It?

AI companies’ removal of CMI during training stems from both technical necessities and strategic considerations. From a technical perspective, large language model training requires standardized data preparation processes that typically strip metadata, formatting, and peripheral information to create uniform training examples. This preprocessing is fundamental to how neural networks learn from text—they require clean, consistent inputs to identify linguistic patterns effectively.

The computational overhead is also significant. Preserving and processing CMI for billions of training examples would increase storage requirements and computational costs. AI companies argue that this additional information provides minimal benefit to model performance while significantly increasing training complexity.

Content owners, however, contend that these technical justifications mask more strategic motivations. They argue that AI companies deliberately eliminate attribution information to obscure the provenance of training data, making it difficult to detect when copyrighted material has been incorporated into models. This removal, they claim, facilitates a form of “laundering” copyrighted content through AI systems, where original sources become untraceable.

More pointedly, content creators assert that CMI removal directly enables downstream infringement by making it impossible for users to identify when an AI output derives from or reproduces copyrighted works. Without embedded attribution information, neither the AI company nor end users can properly credit or license content that appears in generated outputs.

The technical reality and legal implications of this process sit at the heart of these emerging cases, with courts now being asked to determine whether standard machine learning preprocessing constitutes intentional CMI removal under the DMCA’s “double scienter” standard.

Raw Story Media v. OpenAI

In the first of the two SDNY cases, Raw Story Media v. OpenAI, federal district court judge Colleen McMahon dismissed Raw Story’s claim that when training ChatGPT, OpenAI had illegally removed CMI.

At the heart of Judge McMahon’s decision was her observation that although OpenAI removed CMI from Raw Story articles, Raw Story was unable to allege that the works from which CMI had been removed had ever been disseminated by ChatGPT to anyone. On these facts, Judge McMahon held that Raw Story lacked standing under the Article III standing principles established by the Supreme Court in Transunion v. Ramirez (2021). It’s worth noting her observation that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote” based on “the quantity of information contained in the [AI model].”

The Intercept Media v. OpenAI

In the second case, The Intercept Media v. OpenAI, The Intercept made the same allegation. It asserted that OpenAI had intentionally removed CMI—in this case authors, copyright notices, terms of use and title information—from its AI training set.

However, in this case Judge Jed Rakoff came to the opposite conclusion. In November 2024 he issued a bottom-line order declining to dismiss plaintiff’s CMI claim and stated that an opinion explaining his rationale would be forthcoming.

That opinion was issued on February 20, 2025.

At this early stage of the case (before discovery or trial) the judge found that The Intercept met the “double scienter” standard. As to the first part of the test, The Intercept alleged that the algorithm that OpenAI uses to build its AI training data sets can only capture an article’s main text, which excludes CMI. This satisfied the intentional removal element.

As to the second component of the standard, the court was persuaded by The Intercept’s theory of “downstream infringement,” which argues that OpenAI’s model might enable users to generate content based on The Intercept’s copyrighted works without proper attribution. And importantly, unlike in Raw Story, The Intercept was able to provide examples of verbatim regurgitation of its content from ChatGPT based on prompts from The Intercept’s data scientist.

The district court held that a copyright injury “does not require publication to a third party,” finding unpersuasive OpenAI’s argument that the Intercept failed to demonstrate a concrete injury because it had not conclusively established that users had actually accessed The Intercept’s articles via ChatGPT.

Curiously, Judge Rakoff’s decision failed to mention the earlier ruling in Raw Story Media, Inc. v. OpenAI, where Judge McMahon held, on similar facts, that the plaintiffs lacked standing to assert removal of CMI claims. Both cases were decided by SDNY district court judges. However, unlike the ruling in Raw Story Media Judge Rakoff concluded that The Intercept’s alleged injury was closely related to the property-based harms typically protected under copyright law, satisfying the Article III standing requirement.

Thus, while Raw Story’s CMI claims against OpenAI have been dismissed, The Intercept’s CMI removal case against OpenAI will proceed.

A Postscript to my AI Series – Why Not Use the DMCA?

by Lee Gesmer | Dec 11, 2024 | Copyright, DMCA/CDA

After reading my 3-part series on copyright and LLMs (start with Part 1, here) a couple of colleagues have asked me whether content owners could use the Digital Millennium Copyright Act (DMCA) to challenge the use of their copyright-protected content.

I’ll provide a short summary of the law on this issue, but the first thing to note is that the DMCA offers two potential avenues for content owners: Section 512(c)‘s widely-used ‘notice and takedown’ system and the lesser-known Section 1202(b)(1), which addresses the removal of copyright management information (CMI), like author names, titles, copyright notices and terms and conditions .

Section 1202(b)(1) – Removal or Alteration of CMI

First, let’s talk about the lesser-known DMCA provision. Several plaintiffs have tried an innovative approach under this provision, arguing that AI companies violated Section 1202(b)(1) by stripping away CMI in the training process.

In November, two federal judges in New York reached opposite conclusions on these claims. In Raw Story Media, Inc. v. OpenAI the plaintiff alleged that OpenAI had removed CMI during the training process, in violation of 1202(b)(1). The court applied the standing requirement established in Transunion v. Ramirez, a recent Supreme Court case that dramatically restricted standing to sue in federal courts to enforce federal statutes. The court held that the publisher lacked standing because it couldn’t prove that it had suffered “concrete harm” from the alleged CMI removal from ChatGPT. The court based this conclusion on the fact that Raw Story “did not allege that a copy of its work from which the CMI had been removed had been disseminated by ChatGPT to anyone in response to any specific query.” Absent dissemination Raw Media had no claim – under Transunion, “no concrete harm, no standing.”

But weeks later, in The Intercept Media v. OpenAI, a different judge issued a short order allowing similar claims to proceed. We are awaiting the opinion explaining his rationale.

The California federal courts have also been unwelcoming to 1202(b)(1) claims. In two cases – Anderson v. Stability AI and Doe 1 v. Gitub the courts dismissed 1202(b)(1) claims on the ground that the removal of CMI requires identicality between the original work and the copy, which the plaintiffs had failed to establish. However, the Github case has been certified for an interlocutory appeal to the Ninth Circuit, and that appeal is worth watching. I’ll note that the identicality requirement is not in the Copyright Act – it is an example of judge-made copyright doctrine.

Section 512(c) – Notice-and-Takedown

While you are likely familiar with the DMCA’s Section 512(c) notice-and-takedown system (think YouTube removing copyrighted videos or music), this law faces major hurdles in the AI context. A DMCA take-down notice must be specific about the location where the infringing material is hosted – typically a URL. In the case of an AI model the challenge is that data used by AI models is not accessible or identifiable, making it impossible for copyright owners to issue takedown notices.

Unsurprisingly, I can’t find any major AI case in which a plaintiff has alleged violation of Section 512(c).

Conclusion

The collision between AI technology and copyright law highlights a fundamental challenge: our existing legal framework, designed for the digital age of the late 1990s, struggles to address the unique characteristics of AI systems. The DMCA, enacted when peer-to-peer file sharing was the primary concern, now faces unprecedented questions about its applicability to AI training data.

Stay tuned.

Copyright and the Challenge of Large Language Models (Part 3)

by Lee Gesmer | Nov 27, 2024 | Copyright

In the first two parts of this series I examined how large language models (LLMs) work and analyzed whether their training process can be justified under copyright law’s fair use doctrine. (Part 1, Part 2). However, I also noted that LLMs sometimes “memorize” content during the training stage and then “regurgitate” that content verbatim or near-verbatim when the model is accessed by users.

The “memorization/regurgitation” issue is featured prominently in The New York Times Company v. Microsoft and OpenAI, pending in federal court in the Southern District of New York. (A reference to OpenAI includes Microsoft, unless the context suggests otherwise). Because the technical details of every LLM and AI model are different I’m going to focus my discussion of this issue mostly on the OpenAI case. However, the issue has implications for every generative LLM trained on copyrighted content without permission.

The Controversy

Here’s the problem faced by OpenAI.

OpenAI claims that the LLM models it creates are not copies of the training data it uses to develop its models, but uncopyrightable patterns. In other words, a ChatGPT LLM is not a conventional database or search engine that stores and retrieves content. As OpenAI’s co-defendant Microsoft explains, “A program called a ‘transformer’ evaluates massive amounts of text, converts that text into trillions of constituent parts, discerns the relationships among all of them, and yields a natural language machine that can respond to human prompts.” (Link, p. 2)

However, the Times has presented compelling evidence that challenges this narrative. The Times showed that it was able to prompt OpenAI’s ChatGPT-4 and Microsoft’s CoPilot to produce lengthy, near-verbatim excerpts from specific Times articles, which the Times then cited in its complaint as proof of infringement.

The Times’ First Amended Complaint includes an exhibit with over 100 examples of ChatGPT “regurgitating” Times content verbatim or near-verbatim in response to specific prompts (copied text highlighted in red; click to enlarge):

This evidence poses a fundamental question: If OpenAI’s models truly transform copyright-protected content into abstract patterns rather than storing it, how can they reproduce exact or nearly exact copies of that content?

The Times argues that this evidence reveals a crucial truth: actual copyrighted expression—not just abstract patterns—is encoded within the model’s parameters. This allegation strikes at the foundation of OpenAI’s legal position and weakens its fair use defense by suggesting its use of copyrighted material is more extensive and less transformative than claimed.

Just how big a problem this is for OpenAI and the AI industry is difficult to determine. I’ve tried to replicate it in a variety of cases on ChatGPT and several other frontier models without success. In fact I can’t get the models to give me the text of Moby Dick, Tom Sawyer or other literary works whose copyrights have long expired.

Nevertheless, the Times was able to do this one hundred times, and it’s safe to assume that it could have continued well past that number, but thought that 100 examples was enough to make the point in its lawsuit.

OpenAI: “The Times Hacked ChatGPT”

What’s OpenAI’s response to this?

To date, OpenAI and Microsoft have not filed answers to the Complaint. However, they have given an indication of how they view these allegations in partial motions to dismiss filed by both companies.

Microsoft’s motion (p. 2) argues that the NYT’s methods to demonstrate how its content could be regurgitated did not represent real-world usage of the GPT tools at issue. “The Times,”it argues, “crafted unrealistic prompts to try to coax the GPT-based tools to output snippets of text matching The Times’s content.” (Emphasis in original) To get the NYT content regurgitated, a user would need to know the “genesis of that content.” “And in any event, the outputs the Complaint cites are not copies of works at all, but mere snippets” that do not rise to the level of copyright infringement.

OpenAI’s motion (p. 12.) argues that the NYT “appears to have [used] prolonged and extensive efforts to hack OpenAI’s models”:

In the real world, people do not use ChatGPT or any other OpenAI product for that purpose, … Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will. . . . The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.

It appears that OpenAI is referring to the provisions in its Terms of Service that prohibit anyone from “Us[ing] our Services in a way that infringes, misappropriates or violates anyone’s rights” or to “extract data.” OpenAI has labeled these “adversarial attacks.”

Copyright owners don’t buy this tortured explanation. As OpenAI has admitted in a submission to the Patent and Trademark Office, “An author’s expression may be implicated . . . because of a similarity between her works and an output of an AI system.” (link, n. 71, emphasis added).

Rights holders claim that their ability to extract memorized content from these systems puts the lie to, for example, OpenAI’s assertion that “an AI system can eventually generate media that shares some commonalities with works in the corpus (in the same way that English sentences share some commonalities with each other by sharing a common grammar and vocabulary) but cannot be found in it.” (link, p. 10).

Moreover, OpenAI’s “hacking” defense would seem to inadvertently support the Times’ position. After all, you cannot hack something that isn’t there. The very fact that this content can be extracted, regardless of the method, suggests it exists in the form of an unauthorized reproduction within the model.

OpenAI: “We Are Protected by the Betamax Case”

How will OpenAI and Microsoft respond to these allegations under copyright law?

To date, OpenAI and Microsoft have yet to file formal answers to the Times’ complaint. However, they have given us a hint of their defense strategy in their motions to dismiss, and it is based in part on the Supreme Court’s 1984 decision in Sony v. Universal City Studios, a case often referred to as “the Betamax case.”

In the Betamax case a group of entertainment companies sued Sony for copyright infringement, arguing that consumers used Sony VCRs to infringe by recording programs broadcast on television. The Supreme Court held that Sony could not be held contributorily liable for infringements committed by VCR owners. “[T]he sale of copying equipment . . . does not constitute contributory infringement if the product is . . . capable of substantial noninfringing uses.”

The take-away from this case is that under copyright law if a product can be put to either a legal or illegal purpose by end-users (a “dual-use”), it is not infringing so long the opportunity for noninfringing use is substantial.

OpenAI and Microsoft assert that the Betamax case applies because, like the VCR, ChatGPT is a “dual-use technology.” While end users may be able to use “adversarial prompts” to “coax” a model to produce a verbatim copy of training data, the system itself is a neutral, general-purpose tool. In most instances it will be put to a non-infringing use. Citing the Betamax case Microsoft argues that “copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine)”—all dual-use technologies.

No doubt, in support of this argument OpenAI will place strong emphasis on ChatGPT’s many important non-infringing uses. The model can create original content, analyze public domain texts, process user-provided content, educate, generate software code, and more.

However, OpenAI’s reliance on the Betamax dual-use doctrine faces a challenge central to the doctrine itself. The Betamax case was based on secondary liability—whether Sony could be held responsible for consumers using VCRs to record television programs. The alleged infringements occurred through consumer action, not through any action taken by the device’s manufacturer.

But with generative LLMs such as ChatGPT the initial copying happens during training when the model memorizes copyrighted works. This is direct infringement by the AI company itself, not secondary infringement based on user prompts. When an AI company creates a model that memorizes and can reproduce copyrighted works, the company itself is doing the copying—making this fundamentally different from Betamax.

Before leaving this topic it’s important to note that the full scope of memorization within AI models of GPT-4’s scale may be technically unverifiable. While the models’ creators can detect some instances of memorization through testing, due to the scale and complexity of the models they cannot comprehensively examine their internal representations to determine the full extent of memorized copyrighted content. While the Times’ complaint demonstrated one hundred instances of verbatim copying, this could represent just the tip of the iceberg, or conversely, the outer limit of the problem. This uncertainty itself poses a significant challenge for courts attempting to apply traditional copyright principles.

Technical Solutions

While these legal issues work their way through the courts, AI companies aren’t standing still. They recognize that their long-term success may depend on their ability to prevent or minimize memorization, regardless of how courts ultimately rule on the legal issues.

Their approaches to this challenge vary. OpenAI has told the public that it is taking measures to prevent the types of copying illustrated in the Times’ lawsuit: “we are continually making our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.” (link) This includes filtering or modifying user prompts to reject certain requests before they are submitted as prompts to the model and aligning the models to refuse to produce certain types of data. Try asking ChatGPT to give you the lyrics to Arlo Guthrie’s “Alice’s Restaurant Massacree” or Taylor Swift’s “Cruel Summer.” It will tell you that copyright law prohibits it from doing so.

And, it’s important to note that different AI companies are taking different approaches to this problem. For example, Google (which owns Gemini) uses supervised fine tuning (explained here). Anthropic (which owns Claude) focuses on what it calls “constitutional AI” – a training methodology that builds in constraints against certain behaviors, including the reproduction of copyrighted content. (link here). Meta (LLaMA models) has implemented what it calls “deduplication” during the training process – actively removing duplicate or near-duplicate content from training data to reduce the likelihood of memorization. Additionally, Meta has developed techniques to detect and filter out potential memorized content during the model’s response generation phase. (link here).

Conclusion

The AI industry faces a fundamental challenge that sits at the intersection of technology and law. Current research suggests that some degree of memorization may be inherent to large language models – raising a crucial question for courts: If memorization cannot be eliminated without sacrificing model performance, how should copyright law respond?

The answer could reshape both AI development and copyright doctrine. AI companies may need to accept reduced performance in exchange for legal compliance, while content creators must decide whether to license their works for AI training despite the risk of memorization. The industry’s ability to develop systems that truly learn patterns without memorizing specific expressions – or courts’ willingness to adapt copyright law to this technological reality – may determine its future.

The outcome of the Times lawsuit may establish crucial precedents for how copyright law treats AI systems that can memorize and reproduce protected content. At stake is not just the legality of current AI models, but the broader question of how to balance technological innovation with the rights of content creators in an era where the line between learning and copying has become increasingly blurred.

Copyright And The Challenge of Large Language Models (Part 2)

by Lee Gesmer | Oct 15, 2024 | Copyright

“Fair use is the great white whale of American copyright law. Enthralling, enigmatic, protean, it endlessly fascinates us even as it defeats our every attempt to subdue it.” – Paul Goldstein

__________________________

This is the second in a 3-part series of posts on Large Language Models (LLMs) and copyright. (Part 1 here)

In this post I’ll turn to a controversial and important legal question: does the use of copyrighted material in training LLMs for generative AI constitute fair use? This analysis requires a nuanced understanding of both copyright fair use and the technical aspects of LLM training (see Part 1). To examine this complex issue I’ll look at recent relevant case law and consider potential solutions to the legal challenges posed by AI technology.

Introduction

The issue is this: generative AI systems – systems that generate text, graphics, video, music – are being trained without permission on copies of millions of copyrighted books, artwork, software and music scraped from the internet. However, as I discussed in Part 1 of this series, the AI industry argues that the resulting models themselves are not infringing. Rightsholders argue that even if this is true (and they assert that it is not), the use of their content to train AI models is infringing, and that is the focus of this post.

To put this in perspective, consider where AI developers get their training data. It’s generally acknowledged that many of them have used resources such as Common Crawl, a digital archive containing 50 billion web pages, and Books3, a digital library of thousands of books. While these resources may contain works that are in the public domain, there’s no doubt that they contain a huge quantity of works that are protected by copyright.

In the AI industry, the thirst for this data is insatiable – the bigger the language models, the better they perform, and copyrighted works are an essential component of this data. In fact, the industry is already looking at a “data wall,” the time when they will run out of data. They may hit that wall in the next few years. If copyrighted works can’t be included in training data, it will be even sooner.

Rightsholders assert that the use of this content to train LLMs is outright, massive copyright infringement. The AI industry responds that fair use – codified in 17 U.S.C. § 107 – covers most types of model training where, as they assert, the resulting model functions differently than the input data. This is not just an academic difference – the issue is being litigated in more than a dozen lawsuits against AI companies, attracting a huge amount of attention from the copyright community.

No court has yet ruled on whether fair use protects the use of copyright-protected material as training material for LLMs. Eventually, the courts will answer this question by applying the language of the statute and the court decisions applying copyright fair use.

Legal Precedents Shaping the AI Copyright Landscape

To understand how the courts are likely to evaluate these cases we need to look at four recent cases that have shaped the fair use landscape: the two Google Books cases, Google v. Oracle, and Warhol Foundation v. Goldsmith. In addition the courts are likely to apply what is known as the “intermediate copying” line of cases.

The Google Books Cases. Let’s start with the two Google Books cases, which in many ways set the stage for the current AI copyright dilemma. The AI industry has put its greatest emphasis on these cases. (OpenAI: “Perhaps the most compelling case on point is Authors Guild v. Google”).

Authors Guild v. Google and Author’s Guild v. Hathitrust. In 2015, the Second Circuit Court of Appeals decided Authors Guild v. Google, a copyright case that had been winding through the courts for a decade. Google had scanned millions of books without permission from rightsholders, creating a searchable database.

The Second Circuit held that this was fair use. The court’s decision hinged on two key points. First, the court found Google’s use highly “transformative,” a concept central to fair use. Google wasn’t reproducing books for people to read; it was creating a new tool for search and analysis. While Google allowed users to see small “snippets” of text containing their search terms, this didn’t substitute for the actual books. Second, the court found that Google Books was more likely to enhance the market for books than harm it. The court also emphasized the immense public benefit of Google Books as a research tool.

A sister case in the Google Books saga was Authors Guild v. HathiTrust, decided by the Second Circuit in 2014. HathiTrust, a partnership of academic institutions, had created a digital library from book scans provided by Google. HathiTrust allowed researchers to conduct non-consumptive research, such as text mining and computational analysis, on the corpus of digitized works. Just as in Google Books, the court found the creation of a full-text searchable database to be a fair use, even though it involved copying entire works. Importantly, the court held this use of the copyrighted books to be transformative and “nonexpressive.”

The two cases were landmark fair use decisions, especially for their treatment of mass digitization and nonexpressive use of copyrighted works – a type of use that involves copying copyrighted works but does not communicate the expressive aspects of those works.

These two cases, while important, by no means guarantee the AI industry the fair use outcome they are seeking. Reliance on Google Books falters given the scope of potential output of AI models. Unlike Google Books’ limited snippets, LLMs can generate extensive text that may mirror the style and substance of copyrighted works in their training data. This raises concerns about market harm, a critical factor in fair use analysis, and whether LLM-generated content could eventually serve as a market substitute for the original works. The New York Times argues just this in its copyright infringement case against OpenAI and Microsoft.

Hathitrust is an even weaker precedent for LLM fair use. The Second Circuit held that HathiTrust’s full-text search “posed no harm to any existing or potential traditional market for the copyrighted works.” LLMs, in contrast, have the potential to generate content that could compete with or substitute for original works, potentially impacting markets for copyrighted material. Also, HathiTrust was created by universities and non-profit institutions for educational and research purposes. Commercial LLM development may not benefit from the same favorable consideration under fair use analysis.

In sum, the significant differences in purpose, scope, and potential market impact make both Google Books and Hathitrust imperfect authorities for justifying the comprehensive use of copyrighted materials in training LLMs.

Google v. Oracle. Fast forward to 2021 for another landmark fair use case, this time involving software code. In Google v. Oracle, the Supreme Court held that Google’s copying of 11,500 lines of code from Oracle’s Java API was intended to facilitate interoperability, and was fair use.

The Court found Google’s “purpose and character” was transformative because it “sought to create new products” and was “consistent with that creative ‘progress’ that is the basic constitutional objective of copyright itself.” The Court also downplayed the market harm to Oracle, noting that Oracle was “poorly positioned to succeed in the mobile phone market.”

This decision seemed to open the door for tech companies to make limited use of some copyrighted works in the name of innovation. However, the case’s focus on functional code limits its applicability to LLMs, which are trained on expressive works like books, articles, and images. The Supreme Court explicitly recognized the inherent differences between functional works, which lean towards fair use, and expressive creations at the heart of copyright protection. So, again, the AI industry will have difficulty deriving much support from this decision.

And, before we could fully digest Oracle’s implications for fair use, the Supreme Court threw a curveball.

Andy Warhol Foundation v. Goldsmith. In 2023, the Court decided Andy Warhol Foundation v. Goldsmith (Warhol), a case dealing with Warhol’s repurposing of a photograph of the musician Prince. While the case focused specifically on appropriation art, its core principles resonate with the ongoing debate surrounding LLMs’ use of copyrighted materials.

The Warhol decision emphasizes a use-based approach to fair use analysis, focusing on the purpose and character of the defendant’s use, particularly its commercial nature, and whether it serves as a market substitute for the original work. This emphasis on commerciality and market substitution poses challenges for LLM companies defending the fair use of copyrighted works in training data. The decision underscores the importance of considering potential markets for derivative works. As the use of copyrighted works for AI training becomes increasingly common, a market for licensing such data is emerging. The existence of such a market, even if nascent, could weaken the argument that using copyrighted materials for LLM training is a fair use, particularly when those materials are commercially valuable and readily licensable

The “Intermediate Copying” Cases. I also expect the AI industry to rely on the case law on “intermediate copying.” In this line of cases the users copied material to discover unprotectable information or as a minor step towards developing an entirely new product. So the final output – despite using copied material as an intermediate step – was noninfringing. In these cases the “intermediate use” was held to be fair use. See Sega v. Accolade (9th Cir. 1992) (defendant copied Sega’s copyrighted software to figure out the functional requirements to make games compatible with Sega’s gaming console). Sony v. Connectix (9th Cir. 2000)(defendant used a copy of Sony’s software to reverse engineer it and create a new gaming platform on which users could play games designed for Sony’s gaming system).

AI companies likely will argue that, just as in these cases, LLMs study language patterns as part of the process of transforming intermediate copying into noninfringing materials. Rightsholders likely will argue that whereas in those cases the copiers sought to study functionality or create compatibility, the scope and nature of use and the resulting product are vastly different from LLM fair use. I expect rightsholders will have the better argument on these cases.

Applying Legal Precedents to AI

So, where does this confusing collection of cases leave us? Here’s a summary:

The Content Industry Position – in a Nutshell: Rightsholders argue that – even assuming that the final LLM model does not contain expressive content (which they dispute) – the use of copyrighted works to train LLMs is an infringement not excused by fair use. They argue that all four fair use factors weigh against AI companies:

– Purpose and character: Many (but not all) AI applications are commercial, which cuts against the industries’ fair use argument, especially in light of Warhol’s emphasis on commercial purpose and the potential licensing market for training data. The existence of a licensing market for training datasets suggests that AI companies can obtain licenses rather than rely on fair use defenses. This last point – market effect – is particularly important in light of the Supreme Court’s holding in Andy Warhol.

– Nature of the work: Unlike the computer code in Google v. Oracle, which the Supreme Court noted receives “thin” protection, the content ingested by AI companies contains highly creative works like books, articles, and code. This distinguishes Oracle from AI training, and cuts against fair use.

– Amount used: Entire works are copied, a factor that weighs against fair use.

– Market effect: End users are able to extract verbatim content from LLMs, harming the market for original works and, as noted above , harming current and future AI training licensing markets.

The AI Industry Position – in a Nutshell. The AI industry will argue that the use of copyrighted works should be considered fair use:

– Transformative Use: The AI industry argues that AI training creates new tools with different purposes from the original works, using copyright material in a “nonexpressive” way. AI developers draw parallels to “context shifting” fair use cases dealing with search engines and digital libraries, such as the Google Books project, arguing AI use is even more transformative. I expect them to rely on Google v. Oracle to argue that, just as Google’s use of Oracle’s API code was found to be transformative because it created something new that expanded the use of the original code (the Android platform), AI training is transformative, as it creates new systems with different purposes from the original works. Just as the Supreme Court emphasized the public benefit of allowing programmers to use their acquired skills, similarly AI advocates are likely to highlight the broad societal benefits and innovation enabled by LLMs trained on diverse data.

– Intermediate Copying. AI proponents will support this argument by pointing to the “intermediate copying” line of cases, which hold that using copyrighted works for purposes incidental to a nonexpressive purpose (creating the non-infringing model itself), is permissible fair use.

– Market Impact: AI proponents will argue that AI training, and the models themselves, do not directly compete with or substitute for the original copyrighted works.

– Amount and Substantiality: Again, relying on Google v. Oracle, AI proponents will note that despite Google copying entire lines of code, the Court found fair use. This will support their argument that copying entire works for AI training doesn’t preclude fair use if the purpose is sufficiently transformative.

– Public Benefit: In Google v. Oracle the Court showed a willingness to interpret fair use flexibly to accommodate technological progress. AI proponents will rely on this, and argue that applying fair use to AI training has social benefits and aligns with copyright law’s goal of promoting progress. The alternative, restricting access to training data, could significantly hinder AI research and development. (AI “doomers” are unlikely to be persuaded by this argument).

– Practical Necessity: Given the vast amount of data needed, obtaining licenses for all copyrighted material used in training is impractical, impossible or would be so expensive that it would stifle AI development.

As I noted above, It’s important to note that, as alleged in several of the lawsuits filed to date, some generative AI models have “memorized” copyrighted materials and are able to output them in a way that could substitute for the copyrighted work. If the outputs of a system can infringe, the argument that the system itself does not implicate copyright’s purposes will be significantly weakened.

While Part 3 of this series will explore these output-related issues in depth, it’s important to recognize the intrinsic link between these concerns and input-side training challenges. In assessing AI’s impact on copyright law courts may adopt a holistic approach, considering the entire content lifecycle – from data ingestion to LLMs to final output. This interconnected perspective reflects the complex nature of AI systems, where training methods directly influence both the characteristics and potential infringement risks of generated content.

Potential Solutions and Future Directions

As challenging as these issues are, we need to start thinking about practical solutions that balance the interests of AI developers, content creators, and the public. Here are some possibilities, along with their potential advantages and drawbacks.

Licensing Schemes: One proposed solution is to develop comprehensive licensing systems for AI training data, similar to those that exist for certain music uses. This could provide a mechanism for compensating creators while ensuring AI developers have access to necessary training data.

Proponents argue that this approach would respect copyright holders’ rights and provide a clear framework for legal use. However, critics rightly point out that implementing such a system would be enormously complex and impractical. The sheer volume of content used in AI training, the difficulty of tracking usage, and the potential for exorbitant costs could stifle innovation, particularly for smaller AI developers.

New Copyright Exceptions: Another approach is to create specific exemptions for AI training, perhaps limited to non-commercial or research purposes. This could be similar to existing fair use exceptions for research and could promote innovation in AI development. The advantage of this approach is that it provides clarity and could accelerate AI research. However, defining the boundaries of “non-commercial” use in the rapidly evolving AI landscape could prove challenging.

International Harmonization: Given the global nature of AI development, the industry may need to work towards a unified international approach to copyright exceptions for AI. This could involve amendments to international copyright treaties or the development of new AI-specific agreements. However, international copyright negotiations are notoriously slow and complex. Different countries have varying interests and legal traditions, which could make reaching a consensus difficult.

Technological Solutions: We should also consider technological approaches to addressing these issues. For instance, AI companies could develop more sophisticated methods to anonymize or transform training data, making it harder to reconstruct original works on the “output” side. They could also implement filtering systems to prevent the output of copyrighted material. While promising, these solutions would require significant investment and might not fully address all legal concerns. There’s also a risk that overzealous filtering could limit the capabilities of AI systems.

Hybrid Approaches: Perhaps the most promising solutions will combine elements of the above approaches. For example, we could see a tiered system where certain uses are exempt, others require licensing, and still others are prohibited. This could be coupled with technological measures such as synthetic training data, and international guidelines.

Market-Driven Solutions: As the AI industry matures, we are likely to see the emergence of new business models that naturally address some of these copyright concerns. For instance, content creators might start producing AI-training-specific datasets, or AI companies might vertically integrate to produce their own training content. X’s Grok AI product and Meta are examples of this.

As we consider these potential solutions, it’s crucial to remember that the goal of copyright law is to foster innovation while fairly compensating creators and respecting intellectual property rights. Any solution will likely require compromise from all stakeholders and will need to be flexible enough to adapt to rapidly changing technology.

Moreover, these solutions will need to be developed with input from a diverse range of voices – not just large tech companies and major content producers, but also independent creators, smaller AI startups, legal experts, and public interest advocates. The path forward will require creativity, collaboration, and a willingness to rethink traditional approaches to copyright in the artificial intelligence age.

Conclusion – The Road Ahead

The intersection of AI and copyright law presents complex challenges that resist simple solutions. The Google Books cases provide some support for mass digitization and computational use of copyrighted works. Google v. Oracle suggests courts might look favorably on uses that promote new and beneficial AI technologies. But Warhol reminds us that transformative use has limits, especially in commercial contexts.

For AI companies, the path forward involves careful consideration of training data sources and potential licensing arrangements. It may also mean being prepared for legal challenges and working proactively with policymakers to develop workable solutions.

For content creators, it’s crucial to stay informed about how your work might be used in AI training. There may be new opportunities for licensing, but also new risks to consider.

For policymakers and courts, the challenge is to strike a balance that fosters innovation while protecting the rights and incentives of creators. This may require rethinking some fundamental aspects of copyright law.

The relationship between AI and copyright is likely to be a defining issue in intellectual property law for years to come. Stay tuned, stay informed, and be prepared for a wild ride.

Continue with Part 3 of this series here.

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

by Lee Gesmer | Oct 9, 2024 | Copyright

Google’s NotebookLM has been getting a lot of attention. You upload your sources (articles, Youtube videos, URLs, text documents, audio files) and NotebookLM can create a podcast based on the library you’ve created.

I thought I’d experiment with this a bit. I uploaded a variety of articles on copyright and AI and hit “go.” I didn’t give NotebookLM the subject or any prompts. It figured out the topic (correctly) and created the 11 minute podcast embedded below.

A few observations:

First, the speaker voices are natural and realistic – they interact fluidly, have natural intonation and use varied speech patterns.

Second, the content quality is very high – the podcast correctly highlights Google Books as the leading case on the issue and outlines the implications of the case for and against fair use.

It also discusses the New York Times v. Microsoft/OpenAI case in detail, and focuses on the fact that the NYT was able to force ChatGPT to regurgitate verbatim or near verbatim NYT content.

The podcast goes on to discuss StabilityAI, the four fair use factors (as applied) and the larger consequences of LLMs on the copyright system.

I downloaded the podcast and embedded it below, but I could just as easily have provided a link to the podcast in NotebookLM.

Copyright And The Challenge of Large Language Models (Part 1)

by Lee Gesmer | Jul 1, 2024 | Copyright

“AI models are what’s known in computer science as black boxes: You can see what goes in and what comes out; what happens in between is a mystery.”

Trust but Verify: Peeking Inside the “Black Box” of Machine Learning

In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging copyright infringement. This case, along with a number of similar cases filed against AI companies, brings to the forefront a fundamental challenge in applying traditional copyright law to a revolutionary technology: Large Language Models (LLMs). Perhaps more than any copyright case that precedes them, these cases grapple with a form of alleged infringement that defies conventional legal analysis.

This article is the first in a three-part series that will examine the copyright implications of the AI development process.

Disclaimer: I’m not a computer or AI scientist. However, neither are the judges and juries that will be asked to apply copyright law to this technology, or the legislators that may enact laws regulating it. It’s unlikely that they will go much beyond the level of detail I’ve used here.

What are Large Language Models (LLMs)?

Large Language Models, or LLMs, are gargantuan AI systems that use a vast corpus of training data and billions to trillions of parameters. They are designed to understand, generate, and manipulate human language. They learn patterns from the data, allowing them to perform a wide range of language tasks with remarkable fluency. Their inner workings are fundamentally different from any previous technology that has been the subject of copyright litigation, including traditional computer software.

LLMs typically use transformer-based neural networks: interconnected nodes organized into layers that can perform computations. The strengths of these connections—the influences that nodes have on another—are what is learned during training. These are called the model parameters or weights, and they are represented as numbers.

Here’s a simplified explanation of what happens when you use an AI like a large language model:

You input a prompt (your question or request).
The computer breaks down your prompt into smaller pieces called tokens. These can be words, parts of words, or even individual characters.
The AI processes these tokens through its neural network – imagine this like a complex web of connections. Each part of this network analyzes the tokens and figures out how they relate to each other.
As it processes, the AI predicts the probability distribution for the next token based on what it learned during its training.
The LLM selects tokens based on these probabilities and combines them to create a coherent response or output for you, the user.

The “large” in Large Language Models primarily refers to the enormous number of parameters these models contain – sometimes in the trillions. These parameters represent the model’s learned patterns and relationships, fine-tuned through exposure to massive amounts of text data. While larger and more diverse high-quality datasets can lead to better AI models, other factors such as model architecture, training techniques, and fine-tuning also play important roles in model performance.

How Do AI Companies Obtain Their Training Data?

AI companies employ various methods to acquire this data –

– Web scraping and crawling. One of the primary methods of data acquisition is web scraping – the automated process of extracting data from websites. AI companies deploy sophisticated crawlers that systematically browse the internet, copying text from millions of web pages. This method allows for the collection of diverse, up-to-date information but raises questions about the use of copyrighted material without explicit permission.

– Partnerships and licensing agreements. Some companies enter into partnerships or licensing agreements to access high-quality, curated datasets. For instance, OpenAI has partnered with organizations like the Associated Press to use its news archives for training purposes.

– Public datasets and academic corpuses. Many LLMs are trained, at least in part, on publicly available datasets and academic text collections. These might include Project Gutenberg’s collection of public domain books, scientific paper repositories, or curated datasets like the Common Crawl corpus.

– User-generated content. Platforms that interact directly with users, such as ChatGPT, can potentially use the conversations and inputs from users to further train and refine their models. This practice raises privacy concerns and questions about the ownership of user-contributed data.

In the context of the New York Times lawsuit, it’s worth noting that OpenAI, like many AI companies, has not publicly disclosed the full extent of its training data sources. However, it’s widely believed that the company uses a combination of publicly available web content, licensed datasets, and partnerships to build its training corpus. The lawsuit alleges that this corpus includes copyrighted New York Times articles, obtained without permission or compensation.

The Training Process: How Machines “Learn” From Data

Once acquired, the raw data undergoes several processing steps before it can be used to train an LLM –

– Data preprocessing and cleaning. The first step involves cleaning the raw data. This includes removing irrelevant information, correcting errors, and standardizing the format. This may involve stripping away HTML tags, removing advertisements, or filtering out low-quality content.

– Tokenization and encoding. Next, the text is broken down into smaller units called tokens. These might be words, parts of words, or even individual characters. Each token is then converted into a numerical representation that the AI can process. This step is crucial as it determines how the model will interpret and generate language.

During training, the LLM is exposed to this preprocessed data, learning to predict patterns and relationships between tokens. This is an iterative process where the model makes predictions, compares them to the actual data, and adjusts its internal parameters to improve accuracy. This process, known as “backpropagation,” is repeated billions of times across the entire dataset. In a large LLM this can take months, operating 24/7 on a massive system of graphics processing chips.

The Transformation From Text to Numbers

For purposes of copyright law, here’s the crux of the matter: the AI industry asserts that after this process, the original text no longer exists in any recognizable form within the LLM. The model becomes a vast sea of numbers, with no direct correspondence to the original text. If true, this transformation creates a fundamental challenge for copyright law –

– No Side-by-Side Comparison: In traditional copyright cases, courts rely heavily on comparing the original work side-by-side with the allegedly infringing material. With LLMs, this is impossible. You can’t “read” an LLM or print it out for comparison.

– Black Box Nature: The internal workings of LLMs are often referred to as a “black box.” Even the developers may not fully understand how the model arrives at its outputs.

– Dynamic Generation: The AI industry claims that LLMs don’t store and retrieve text in a conventional database format; they generate it dynamically based on learned patterns. This means that any similarity to copyrighted material in the output is a result of statistical prediction, not direct copying.

– Distributed Information: The AI industry claims that Information from any single source is distributed across countless parameters in the model, making it impossible to isolate the influence of any particular work.

However, copyright owners do not concede that completed AI models (as distinct from the training data) are only abstracted statistical patterns of the training data. Rightsholders assert that LLMs do indeed retain the expressions of the original works on which they have been trained. There are studies showing the LLM models are able to regurgitate their training materials, and the New York Times lawsuit against OpenAI and Microsoft shows 100 examples of this. See also Concord Music Group v. Anthropic (alleging that song lyrics can be accessed verbatim or near-verbatim from Claude). Rightsholders argue that this could only occur if the models encode the expressive content of these works.

Copyright Implications

Assuming the AI developers’ explanation to be correct (if its not the infringement case against them is strong), AI technology creates unprecedented challenges for copyright law –

– Proving Infringement: How can a plaintiff prove infringement when the allegedly infringing material can’t be directly observed or compared?

– Fair Use Analysis: Traditional fair use factors, such as the amount and substantiality of the portion used, become difficult to apply when the “portion used” is transformed beyond recognition.

– Substantial Similarity: The legal test of “substantial similarity” between works becomes almost meaningless in the context of LLMs.

– Expert Testimony: Courts will likely have to rely heavily on expert testimony to understand the technology, but even experts may struggle to definitively prove or disprove infringement.

For all of these reasons, to prove copyright infringement plaintiffs such as the New York Times may be limited to claiming copyright infringement based on the “intermediate” copies that are used in the training process and user-prompted output, rather than the LLM models themselves.

Conclusion

The NYT v. OpenAI case and others raising the same issue highlight a fundamental mismatch between traditional copyright law and the reality of LLM technology and the AI industries’ fair use defense. The outcome of this case could reshape our understanding of copyright in the digital age, potentially requiring new legal tests and standards that can account for the invisible, transformed nature of information within AI systems.

Part 2 in this series will focus on the legal issues around the “input problem” of using copyrighted material for training. Part 3 will look at the “output problem” of AI-generated content that may copy or resemble copyrighted works, including what the AI industry calls “memorization.” As we’ll see, each of these issues presents its own unique challenges in the context of a technology that defies traditional legal analysis.

Continue reading Part 2 and Part 3 of this series.

Secondary Liability and Sony v. Cox

by Lee Gesmer | May 26, 2024 | Copyright

Copyright secondary liability can be difficult to wrap your head around. This judge-made copyright doctrine allows copyright owners to seek damages from organizations that do not themselves engage in copyright infringement, but rather facilitate the infringing behavior of others. Often the target of these cases are internet service providers, or “ISPs.”

Secondary liability has three separate prongs, “contributory” and “vicarious” infringement, and “inducement.” The third prong – inducement – is important but seen infrequently. For the elements of this doctrine see my article here.

Here’s how I outlined the elements of contributory and vicarious liability when I was teaching CopyrightX:

These copyright rules were the key issue in the Fourth Circuit’s recent blockbuster decision in Sony v. Cox Communications (4th Cir. Feb. 20, 2024).

In a highly anticipated ruling the court reversed a $1 billion jury verdict against Cox for vicarious liability but affirmed the finding of contributory infringement. The decision is a significant development in the evolving landscape of ISP liability for copyright infringement.

Case Background

Cox Communications is a large telecommunications conglomerate based in Atlanta. In addition to providing cable television and phone services it acts as an internet service provider – an “ISP” – to millions of subscribers.

The case began when Sony and a coalition of record labels and music publishers sued Cox, arguing that the ISP should be held secondarily liable for the infringing activities of its subscribers. The plaintiffs alleged that Cox users employed peer-to-peer file-sharing platforms to illegally download and share a vast trove of copyrighted music, and that Cox fell short in its efforts to control this rampant infringement.

A jury found Cox liable under both contributory and vicarious infringement theories, levying a jaw-dropping $1 billion in statutory damages – $99,830.29 for each of the 10,017 infringed works. Cox challenged the verdict on multiple fronts, contesting the sufficiency of the evidence and the reasonableness of the damages award.

The Fourth Circuit Opinion

On appeal, the Fourth Circuit dissected the two theories of secondary liability, arriving at divergent conclusions. The court sided with Cox on the issue of vicarious liability, finding that the plaintiffs failed to establish that Cox reaped a direct financial benefit from its subscribers’ infringing conduct. Central to this determination was Cox’s flat-fee pricing model, which remained constant irrespective of whether subscribers engaged in infringing or non-infringing activities. The mere fact that Cox opted not to terminate certain repeat infringers, ostensibly to maintain subscription revenue, was deemed insufficient to prove Cox directly profited from the infringement itself.

However, the court took a different stance on contributory infringement, upholding the jury’s finding that Cox materially contributed to known infringement on its network. The court was unconvinced by Cox’s assertions that general awareness of infringement was inadequate, or that a level of intent tantamount to aiding and abetting was necessary for liability to attach. Instead, the court articulated that supplying a service with the knowledge that the recipient is highly likely to exploit it for infringing purposes meets the threshold for contributory liability.

Given the lack of differentiation between the two liability theories in the jury’s damages award, coupled with the potential influence of the now-overturned vicarious liability finding on the damages calculation, the court vacated the entire award. The case now returns to the lower court for a new trial, solely to determine the appropriate measure of statutory damages for contributory infringement.

Relationship to the DMCA

This article’s header graphic illustrates the relationship between the secondary liability doctrines and the protection of the Digital Millennium Copyright Act (DMCA), Section 512(c) of the Copyright Act. As the graphic reflects, all three theories of secondary liability lie outside the DMCA’s safe harbor protection for third-party copyright infringement. The DMCA requires that a defendant satisfy multiple safe harbor conditions (See my 2017 article – Mavrix v. LiveJournal: The Incredible Shrinking DMC for more on this). If a plaintiff can establish the elements of any one of the three theories of secondary liability the defendant will violate one or more safe harbor conditions and lose DMCA protection.

Implications

The court’s decision signals a notable shift in the contours of vicarious liability for ISPs in the context of copyright infringement. By requiring a causal nexus between the defendant’s financial gain and the infringing acts themselves, the court has raised the bar for plaintiffs seeking to prevail on this theory.

The ruling underscores that simply profiting from a service that may be used for both infringing and non-infringing ends is insufficient; instead, plaintiffs must demonstrate a more direct and meaningful link between the ISP’s revenue and the specific acts of infringement. This might entail evidence of premium fees for access to infringing content or a discernible correlation between the volume of infringement and subscriber growth or retention.

While Cox may take solace in the reversal of the $1 billion vicarious liability verdict, the specter of substantial contributory infringement damages looms large as the case heads back for a retrial.

For ISPs, the ruling serves as a warning to reevaluate and fortify their repeat infringer policies, ensuring they go beyond cosmetic compliance with the DMCA’s safe harbor provisions. Proactive monitoring, prompt responsiveness to specific infringement notices, and decisive action against recalcitrant offenders will be key to mitigating liability risks.

On the other side of the equation, copyright holders may need to recalibrate their enforcement strategies, recognizing the heightened evidentiary burden for establishing vicarious liability. While the contributory infringement pathway remains viable, particularly against ISPs that display willful blindness or tacit encouragement of infringement, the Sony v. Cox decision underscores the importance of marshaling compelling evidence of direct financial benefit to support vicarious liability claims.

As this case enters its next phase, the copyright and technology communities will be focused on the outcome of the damages retrial. Regardless of the ultimate figure, the Fourth Circuit’s decision has already left a mark on the evolving landscape of online copyright enforcement.

Header image is published under the Creative Commons Attribution 4.0 License.

Supreme Court Allows Copyright Damages Beyond 3 Years – But Leaves Key Question Open

by Lee Gesmer | May 17, 2024 | Copyright

Many aspects of copyright law are obscure and surprising, even to lawyers familiar with copyright’s peculiarities. An example of this is copyright law’s three-year statute of limitations.

The Copyright Act states that “no civil action shall be maintained under the provisions of this title unless it is commenced within three years after the claim accrued.” 17 U. S. C. §507(b). In the world of copyright practitioners this is understood to mean that so long as a copyright remains in effect and infringements continue, an owner’s rights are not barred by the statute of limitations. However, they may be limited to damages that accrued in the three years before the owner files suit. This is described variously as a “three-year look-back,” a “rolling limitations period” or the “separate-accrual rule.”

This is what allowed Randy Wolfe’s estate to sue Led Zeppelin in 2014 for an alleged infringement that began in 1971.

However, there is a nuance to this doctrine – what if the copyright owner isn’t aware of the infringement? Is the owner still limited to damages accrued in the three years before he files suit?

That is the scenario the Supreme Court addressed in Warner Chappell Music, Inc. v. Nealy (May 9, 2024).

Background Facts

Songwriter Sherman Nealy sued Warner Chappell in 2018 for infringing his music copyrights going back to 2008. Warner responded that under the “three year look-back” rule Nealy’s damages were limited to three years before he filed suit. Nealy argued that his damages period should extend back to 2008, since his claims were timely under the “discovery rule” – he was in prison during much of this period and only learned of the infringements in 2016.

Nealy lost on this issue in the district court, which limited his damages to the infringer’s profits during the 3 years before he filed suit. The 11th Circuit reversed, holding that Nealy could recover damages beyond 3 years if his claims were timely – meaning that the case was filed within three years of when Nealy discovered the infringement.

The Supreme Court Decision

The Supreme Court affirmed the 11th Circuit and resolved a circuit split, holding:

1 – The Copyright Act’s 3-year statute of limitations governs when a claim must be filed, not how far back damages can go.

2 – If a claim is timely, the plaintiff can recover damages for all infringements, even those occurring more than 3 years before suit. The Copyright Act places no separate time limit on damages.

However, lurking within this ruling is another copyright law doctrine that the Court did not address that could render its ruling in Nealy moot – that is the proper application of the “discovery rule” under the Copyright Act. Under the discovery rule a claim accrues when “the plaintiff discovers, or with due diligence should have discovered” the infringement. (Nealy, Slip Op. p. 2). Competing with this is the less liberal “occurrence” rule, which holds that, in the absence of fraud or concealment, the clock starts running when the infringement occurs. Under the discovery rule Nealy would be able to recover damages back to 2008. Under the occurrence rule his damages would be limited to the three years before he filed suit, since he does not allege fraud or concealment.

However, the question of which rule applies under the Copyright Act has never been addressed by the Supreme Court, and is itself the subject of a circuit split. The Court assumed, without deciding and solely for purposes of deciding the issue before it, that the discovery rule does apply to copyright claims. If the discovery rule applies Nealy has a claim to retroactive damages beyond three years. If it does not, Nealy’s damages would be limited to the three years before he filed suit.

Justice Gorsuch, joined by Justices Thomas and Alito, focused on this in his dissent, arguing the Court should not have decided the issue when the “discovery vs. occurrence” issue has not been addressed:

The Court discusses how a discovery rule of accrual should operate under the Copyright Act. But in doing so it sidesteps the logically antecedent question whether the Act has room for such a rule. Rather than address that question, the Court takes care to emphasize that its resolution must await a future case. The trouble is, the Act almost certainly does not tolerate a discovery rule. And that fact promises soon enough to make anything we might say today about the rule’s operational details a dead letter.

Clearly, in the view of at least three justices, if and when the discovery vs. occurrence rule issue comes before the Court it could decide against the discovery rule in copyright cases, rendering its decision on damages in the Nealy case, and cases like it, moot.

State of the Law Today

What does this all boil down to? Here are the rules as they exist today –

– A copyright owner has been aware of an infringing musical work for 20 years. She finally sues the infringer. Her damages are limited by the three year damages bar. They may be limited even further based on the laches doctrine.

– A copyright owner has been meditating alone in a cave in Tibet for 20 years. She’s had no access to information from the outside world. Upon her return she discovers that someone has been infringing her literary work for the last 20 years. Depending on whether the federal circuit applies the discovery or the occurrence rule, she may recover damages for the entire 20 period, or just the preceding three years. Her lawyers should do some careful forum shopping.

– A copyright owner discovers someone has secretly been infringing her copyright in computer source code for 20 years. The source code was non-public, and therefore the infringement was concealed. She may recover damages for the full 20 year period.

Implications

The decision is a win for copyright plaintiffs, allowing them to reach back and get damages beyond 3 years – assuming their claims are timely and they are in a circuit that apples the discovery rule. But the Court left the door open to decide the more important question of whether the discovery rule applies to the Copyright Act’s statute of limitations at all. If not, the window for both filing claims and recovering damages will shrink. When this issue will reach the Supreme Court is uncertain. However, the Court has the opportunity to take it up as soon as next term. See Hearst Newspapers, LLC v. Martinelli, No. 23-474 (U.S. petition for cert. filed Nov. 2, 2023). In the meantime, the outer boundary of damages is limited only by the discovery rule (if it apples), not any separate damages bar. Plaintiffs with older claims should take note, as should potential defendants doing due diligence on liability exposure.

Update: On May 20, 2024, the Supreme Court of the United States denied the petition for certiorari in Hearst Newspapers, L.L.C. v. Martinelli, thereby declining to decide whether the discovery rule applies to copyright infringement claims and leaving the rule intact.

Header image attribution: Resource by Nick Youngson CC BY-SA 3.0 Pix4free

Is It Legal To Use Copyrighted Works to Train Artificial Intelligence Systems?

by Lee Gesmer | Nov 27, 2023 | Copyright, General

If you follow developments in artificial intelligence, two recent items may have caught your attention. The first is a Copyright Office submission by the VC firm Andreessen Horowitz warning that billions of dollars in AI investments could be worth less if companies developing this technology are forced to pay for their use of copyrighted data. “The bottom line is this . . . imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development.”

The second item is OpenAI’s announcement that it would roll out a “Copyright Shield,” a program that will provide legal defense and cost-reimbursement for its business customers who face copyright infringement claims. OpenAI is following the trend set by other AI providers, like Microsoft and Adobe, who are promising to indemnify their customers who may fear copyright lawsuits from their use of generative AI.

Underlying these two news stories is the fact that the explosion of generative AI has the copyright community transfixed and the AI industry apprehensive. The issue is this: Does copyright fair use allow AI providers to ingest copyright-protected works, without authorization or compensation, to develop large language models, the data sets that are at the heart of generative artificial intelligence? Multiple lawsuits have been filed by content owners raising exactly this issue.

The Technology

The current breed of generative AIs is powered by large language models (LLMs), also known as Foundation Models. Examples of these systems are ChatGPT, DALL·E-3, MidJourney and Stable Diffusion.

This technology requires that developers collect enormous databases known as “training sets.” This almost always requires copying millions of images, videos, audio and text-based works, many of which are protected by copyright law. When the data is scraped from the web this is potentially a massive infringement of copyright. The risk for AI companies is that, depending on the content (text, images, music, movies), this could violate the exclusive rights of reproduction, distribution, public performance, and the right to create derivative works.

However, for purposes of copyright fair use analysis it’s important to recognize that the downloads are only an intermediate step in creating an LLM. Greatly simplified, here’s how it works:

In the process of creating an LLM model words are broken down into tokens, numerical representations of the word. Each token is a unique numerical ID. The numerical IDs are then transformed into high-dimensional vectors. These vectors are learned during the model’s training and capture semantic meanings and relationships.

Through multiple layers of transformation and abstraction the LLM identifies patterns and correlations within the data. Cutting edge systems like GPT-4 have trillions of parameters. Importantly, these are not copies or replications of the copyright-protected input data. This process of transformation minimizes the risk that any output will be infringing. A judge or jury viewing the data in an LLM would see no similarity between the original copyrighted text and the LLM.

Is Generative AI “Transformative“?

Because the initial downloads in this process are copies, they are technically a copyright infringement – a “reproduction.” Therefore, it’s up to the AI companies to present a legal defense that justifies the copying, and the AI development community has made it clear that this defense is based on copyright fair use. At the heart of the AI industry’s fair use argument is the assertion that AI training models are “non-expressive uses.” Copyright protects expression. Non-expressive use is the use of copyrighted material in a way that does not involve the expression of the original material.

For the reasons discussed above, the AI industry has a strong argument that a properly constructed LLM is a non-expressive use of the copied data.

However, depending on the specific technology this may be an oversimplification. Not all AI systems are the same. They may use different data sets. Some, but not all, are designed to minimize “memorization” which makes it easier for end users to retrieve blocks of text or infringing images. Some systems use output filters to prevent end users from utilizing the LLM to create infringing content.

For any given AI system the fair use defense turns on whether the LLM is trained and filtered in such a way that its outputs do not resemble protected inputs. If users can obtain the original content, the fair use defense is more difficult to sustain.

There is a widespread assumption in the AI industry that, assuming an AI is designed with adequate safety measures, using copyright-protected content to train LLMs is shielded by the fair use doctrine. After all, the reasoning goes, the Second Circuit allowed Google to create a searchable index of copyrighted books under fair use. (Google Books, Hathitrust). And the Supreme Court permitted Google to copy Oracle’s Java API computer code for a different use. (Oracle v. Google). AI companies also point to cases holding that search engines, intermediate copying for the purpose of reverse engineering and plagiarism-detection software are transformative and therefore allowed under fair use. (Perfect 10 v. Google; Sega Enterprises v. Accolade; A.V. et al. v. iParadigms)

In each of these cases the use was found to be “transformative.” So long as the act of copying did not communicate the original protected expression to end users it did not interfere with the original expression that copyright is designed to protect. The AI industry contends that LLM-based systems that are properly designed fall squarely under this line of cases.

How Does Generative AI Impact Content Owners?

In evaluating AI’s fair use defense the commercial impact on content owners is also important. This is particularly true under the Supreme Court’s decision earlier this year in Warhol Foundation v. Goldsmith. In Warhol the Court taught that, in a case that involved commercial copying of photographs, the fact that the copies were used in competition with the originals weighed against fair use.

AI developers will argue that, so long as users can’t use their generative AI systems to access protected works, there is no commercial impact on content owners. In other words, like in Google Books, the AI does not substitute for or compete with content owners’ original protected expression. No one can use a properly constructed AI to read a James Patterson novel or listen to a Joni Mitchell song.

The AIs should be able to distinguish Warhol by pointing out that they are not selling the actual copyrighted books or images in their data sets, and therefore – like in Google Books – they are causing the content owners no commercial harm. In other words, the AI developers will argue that the “intermediate copying” involved in creating and training an LLM is transformative where the resulting model does not substitute for any author’s original expression and the model targets a different economic market.

Does the authority of Google Books and the other intermediate copying cases extend to the type of machine learning that underpins generative AI? While the law regulating AI is in its infancy, several recent district court cases have given plaintiffs an unfriendly reception. In Thomson Reuters v. Ross Intelligence the defendant used West’s head notes and key number system to train a specialized natural language AI for lawyers. West claimed infringement. A Delaware federal district court judge denied Ross’s motion for summary judgment based on fair use, and held that the case must be decided by a jury. However, relying on the intermediate copying cases, the judge noted that Ross would have engaged in transformative fair use if its AI merely studied language patterns in the Westlaw headnotes and did not replicate the headnotes themselves. Since this is in fact how LLMs are trained on data, Ross’s fair use defense likely will succeed.

In a second case, Kadrey v. Meta, the plaintiffs, book authors, claimed that Meta’s inclusion of their books in its AI training data violated their exclusive ownership rights in derivative works. The Northern District of California federal judge dismissed this claim. The judge noted that the LLM models could not be viewed as recasting or adapting the plaintiff’s books. And, the plaintiffs had failed to allege that the content of any output was infringing. “The plaintiffs need to allege and ultimately to prove that the AI’s outputs incorporate in some form a portion of the plaintiffs’ books.” Another N.D. Cal. case, Andersen v. Stability AI is consistent with these rulings.

While these cases are early in the evolution of the law of artificial intelligence they suggest how AI developers can take precautions to insulate themselves from copyright liability. And, as discussed below, the industry is already taking steps in this direction.

The Industry Is Adapting To The Copyright Threat

In the face of legal uncertainty, the AI industry is adapting to legal risks. The potential damages for copyright infringement are massive, and the unofficial Silicon Valley motto – “move fast and break things” – doesn’t apply with the stakes this high.

ChatGPT4: Create an image showing Jack Nicholson in The Shining

Early in the current generative AI boom (only a year ago) it was possible to use some of these systems to generate copyright- protected content. However, the dominant AI companies seem to have plugged this hole. Today, if I ask OpenAI’s ChatGPT to provide the lyrics to “All Too Well” by Taylor Swift it declines to do so. When I ask for the text of the opening paragraph of Stephen King’s “The Shining,” again it refuses and tells me that it’s protected by copyright. When I ask OpenAI’s text-to-image creator Dall·E for an image of Batman, Dall·E refuses, and warns me that what it will create will be sufficiently different from the comic book character to avoid copyright infringement.

These technical filters are illustrative of the ways that the industry can address the copyright challenge, short of years of litigation in the federal courts.

The first, and most obvious, is to train the systems not to provide infringing output. As noted, Open AI is doing exactly this. The Shining may have been downloaded and used to create and train Chat GPT, but it won’t let me retrieve the text of even a small part of that novel.

ChatGPT4: Create an image of Taylor Swift performing her song “All Too Well”

Another technical measure is minimization of duplicates of the same work. Studies have found that the more duplicates that are downloaded and processed in an LLM the easier it is for end-users to retrieve verbatim protected content. “Deduplication” is a solution to this problem.

Another option is to license copyrighted content and pay its creators. While this would be logistically challenging, a challenge of similar complexity has been met in the music industry, which has complex licensing rules that address different types of music licensing and a centralized database system to make that process accessible. If the courts prove to be hostile to AI’s fair use defense the generative AI field could evolve into a licensing regime similar to that of music.

Another solution is for the industry to create “clean” databases, where there is no risk of copyright infringement. The material in the database will have been properly licensed or will be comprised of public domain materials. An example would be an LLM trained on Project Gutenberg, Wikipedia and government websites and documents.

Given the speed at which AI is advancing I expect a variety of yet-to-be conceived or discovered infringement mitigation strategies to evolve, perhaps even invented by artificial intelligence.

International Issues

Copyright laws vary significantly across countries. It’s worth noting that there has been more legislative activity on the topics discussed in this post in the EU than the US. That said, as of the date of this post near the close of 2023 there is no consensus on how LLMs should be treated under EU copyright law.

Under a recent proposal made in connection with the proposed EU “AI Act,” providers of LLMs would need to “prepare and make publicly available a sufficiently detailed summary of the content used to train the model or system and information on the provider’s internal policy for managing copyright-related aspects.”

Additionally, they would need to demonstrate “that adequate measures have been taken to ensure the training of the model or system is carried out in compliance with Union law on copyright and related rights . . .”

The second of these two provisions would allow rights holders to opt out of allowing their works to be used for LLM training.

In contrast, the recent US AI Executive Order orders the Copyright Office to conduct a study that would include “the treatment of copyrighted works in AI training,” but does not propose any changes to US copyright law or regulations. However, US AI companies will have to pay close attention to laws enacted in the EU (or elsewhere), since – as has been the case with the EU’s privacy laws (GDPR) – they have the potential to become a de facto minimal standard for legal compliance worldwide.

Andreessen Horowitz and the Copyright Shield

What about the two news items that I mentioned at the beginning of this post? With respect to the Andreessen Horowitz warning of the cost of copyright risk on AI developers, in my view the risk is overstated. If AI developers design their systems with the proper precautions, it seems likely that the courts will find them to qualify for fair use.

As to OpenAI’s promise to indemnify end users, the risk to OpenAI is slim, since its output is rarely similar to inputs in its training data and its filters are designed to frustrate users who try to output copyrighted content. In any event end users are rarely the targets of infringement suits, as seen in the many copyright suits that have been filed to date, which all target only AI companies as defendants.

The Future

The application of US copyright law to LLM-based AI systems is a complex topic. I expect more lawsuits to be filed as what appears to be a massive revolution in artificial intelligence continues at breakneck speed. While traditional copyright law seems to favor a fair use defense, the devil is in the details of these complex systems, and the legal outcome is by no means certain.

***

Selected pending cases:

Andersen v. Stability AI, N.D. Cal.

J.L. v. Alphabet Inc., N.D. Cal.

P.M. v. OpenAI, N. Dist. Cal.

Doe v. GitHub, N.D. Cal

Thomson Reuters Enter. Ctr. GmbH v. Ross Intel. Inc., D. Del.

Kadrey v. Meta, N.D. Cal.

Sancton v. OpenAI, S.D. N.Y.

Doe v. GitHub, N.D. Cal.

« Older Entries

Reddit v. Anthropic: Contract Law as the New Frontier in AI Data Governance

Copyright Office Issues Long-Awaited Report on Generative AI Training – Register of Copyrights Is Fired Next Day

Copyright, AI, and Meta’s Torrent Problem

SDNY Courts Split Over Copyright Management Information in AI Cases

A Postscript to my AI Series – Why Not Use the DMCA?

Copyright and the Challenge of Large Language Models (Part 3)

The Controversy

OpenAI: “The Times Hacked ChatGPT”

OpenAI: “We Are Protected by the Betamax Case”

Technical Solutions

Conclusion

Copyright And The Challenge of Large Language Models (Part 2)

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

Copyright And The Challenge of Large Language Models (Part 1)

Secondary Liability and Sony v. Cox

Supreme Court Allows Copyright Damages Beyond 3 Years – But Leaves Key Question Open

Is It Legal To Use Copyrighted Works to Train Artificial Intelligence Systems?

Recent Posts

Categories

Quote of the Day

Top Rated Attorney