Select Page
Copyright and the Challenge of Large Language Models (Part 3)

Copyright and the Challenge of Large Language Models (Part 3)

In the first two parts of this series I examined how large language models (LLMs) work and analyzed whether their training process can be justified under copyright law’s fair use doctrine. (Part 1, Part 2). However, I also noted that LLMs sometimes “memorize” content during the training stage and then “regurgitate” that content verbatim or near-verbatim when the model is accessed by users.

The “memorization/regurgitation” issue is featured prominently in The New York Times Company v. Microsoft and OpenAI, pending in federal court in the Southern District of New York. (A reference to OpenAI includes Microsoft, unless the context suggests otherwise). Because the technical details of every LLM and AI model are different I’m going to focus my discussion of this issue mostly on the OpenAI case. However, the issue has implications for every generative LLM trained on copyrighted content without permission.

The Controversy

Here’s the problem faced by OpenAI. 

OpenAI claims that the LLM models it creates are not copies of the training data it uses to develop its models, but uncopyrightable patterns. In other words, a ChatGPT LLM is not a conventional database or search engine that stores and retrieves content. As OpenAI’s co-defendant Microsoft explains, “A program called a ‘transformer’ evaluates massive amounts of text, converts that text into trillions of constituent parts, discerns the relationships among all of them, and yields a natural language machine that can respond to human prompts.” (Link, p. 2)

However, the Times has presented compelling evidence that challenges this narrative. The Times showed that it was able to prompt OpenAI’s ChatGPT-4 and Microsoft’s CoPilot to produce lengthy, near-verbatim excerpts from specific Times articles, which the Times then cited in its complaint as proof of infringement. 

The Times’ First Amended Complaint includes an exhibit with over 100 examples of ChatGPT “regurgitating” Times content verbatim or near-verbatim in response to specific prompts (copied text highlighted in red; click to enlarge):

 

 

This evidence poses a fundamental question: If OpenAI’s models truly transform copyright-protected content into abstract patterns rather than storing it, how can they reproduce exact or nearly exact copies of that content?

The Times argues that this evidence reveals a crucial truth: actual copyrighted expression—not just abstract patterns—is encoded within the model’s parameters. This allegation strikes at the foundation of OpenAI’s legal position and weakens its fair use defense by suggesting its use of copyrighted material is more extensive and less transformative than claimed.

Just how big a problem this is for OpenAI and the AI industry is difficult to determine. I’ve tried to replicate it in a variety of cases on ChatGPT and several other frontier models without success. In fact I can’t get the models to give me the text of Moby Dick, Tom Sawyer or other literary works whose copyrights have long expired. 

Nevertheless, the Times was able to do this one hundred times, and it’s safe to assume that it could have continued well past that number, but thought that 100 examples was enough to make the point in its lawsuit.

OpenAI: “The Times Hacked ChatGPT”

What’s OpenAI’s response to this?

To date, OpenAI and Microsoft have not filed answers to the Complaint. However, they have given an indication of how they view these allegations in partial motions to dismiss filed by both companies. 

Microsoft’s motion (p. 2) argues that the NYT’s methods to demonstrate how its content could be regurgitated did not represent real-world usage of the GPT tools at issue. “The Times,”it argues, “crafted unrealistic prompts to try to coax the GPT-based tools to output snippets of text matching The Times’s content.” (Emphasis in original) To get the NYT content regurgitated, a user would need to know the “genesis of that content.” “And in any event, the outputs the Complaint cites are not copies of works at all, but mere snippets” that do not rise to the level of copyright infringement. 

OpenAI’s motion (p. 12.) argues that the NYT “appears to have [used] prolonged and extensive efforts to hack OpenAI’s models”:

 In the real world, people do not use ChatGPT or any other OpenAI product for that purpose, … Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will. . . . The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.

It appears that OpenAI is referring to the provisions in its Terms of Service that prohibit anyone from “Us[ing] our Services in a way that infringes, misappropriates or violates anyone’s rights” or to “extract data.” OpenAI has labeled these “adversarial attacks.”

Copyright owners don’t buy this tortured explanation. As OpenAI has admitted in a submission to the Patent and Trademark Office, “An author’s expression may be implicated . . . because of a similarity between her works and an output of an AI system.” (link, n. 71, emphasis added). 

Rights holders claim that their ability to extract memorized content from these systems puts the lie to, for example, OpenAI’s assertion that “an AI system can eventually generate media that shares some commonalities with works in the corpus (in the same way that English sentences share some commonalities with each other by sharing a common grammar and vocabulary) but cannot be found in it.” (link, p. 10).

Moreover, OpenAI’s “hacking” defense would seem to inadvertently support the Times’ position. After all, you cannot hack something that isn’t there. The very fact that this content can be extracted, regardless of the method, suggests it exists in the form of an unauthorized reproduction within the model. 

OpenAI: “We Are Protected by the Betamax Case”

How will OpenAI and Microsoft respond to these allegations under copyright law? 

To date, OpenAI and Microsoft have yet to file formal answers to the Times’ complaint. However, they have given us a hint of their defense strategy in their motions to dismiss, and it is based in part on the Supreme Court’s 1984 decision in Sony v. Universal City Studios, a case often referred to as “the Betamax case.” 

In the Betamax case a group of entertainment companies sued Sony for copyright infringement, arguing that consumers used Sony VCRs to infringe by recording programs broadcast on television. The Supreme Court held that Sony could not be held contributorily liable for infringements committed by VCR owners. “[T]he sale of copying equipment . . . does not constitute contributory infringement if the product is . . . capable of substantial noninfringing uses.”

The take-away from this case is that under copyright law if a product can be put to either a legal or illegal purpose by end-users (a “dual-use”), it is not infringing so long the opportunity for noninfringing use is substantial. 

OpenAI and Microsoft assert that the Betamax case applies because, like the VCR, ChatGPT is a “dual-use technology.” While end users may be able to use “adversarial prompts” to “coax” a model to produce a verbatim copy of training data, the system itself is a neutral, general-purpose tool. In most instances it will be put to a non-infringing use. Citing the Betamax case Microsoft argues that “copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine)”—all dual-use technologies.

No doubt, in support of this argument OpenAI will place strong emphasis on ChatGPT’s many important non-infringing uses. The model can create original content, analyze public domain texts, process user-provided content, educate, generate software code, and more. 

However, OpenAI’s reliance on the Betamax dual-use doctrine faces a challenge central to the doctrine itself. The Betamax case was based on secondary liability—whether Sony could be held responsible for consumers using VCRs to record television programs. The alleged infringements occurred through consumer action, not through any action taken by the device’s manufacturer.

But with generative LLMs such as ChatGPT the initial copying happens during training when the model memorizes copyrighted works. This is direct infringement by the AI company itself, not secondary infringement based on user prompts. When an AI company creates a model that memorizes and can reproduce copyrighted works, the company itself is doing the copying—making this fundamentally different from Betamax.

Before leaving this topic it’s important to note that the full scope of memorization within AI models of GPT-4’s scale may be technically unverifiable. While the models’ creators can detect some instances of memorization through testing, due to the scale and complexity of the models they cannot comprehensively examine their internal representations to determine the full extent of memorized copyrighted content. While the Times’ complaint demonstrated one hundred instances of verbatim copying, this could represent just the tip of the iceberg, or conversely, the outer limit of the problem. This uncertainty itself poses a significant challenge for courts attempting to apply traditional copyright principles.

Technical Solutions

While these legal issues work their way through the courts, AI companies aren’t standing still. They recognize that their long-term success may depend on their ability to prevent or minimize memorization, regardless of how courts ultimately rule on the legal issues.

Their approaches to this challenge vary. OpenAI has told the public that it is taking measures to prevent the types of copying illustrated in the Times’ lawsuit: “we are continually making our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.” (link) This includes filtering or modifying user prompts to reject certain requests before they are submitted as prompts to the model and aligning the models to refuse to produce certain types of data. Try asking ChatGPT to give you the lyrics to Arlo Guthrie’s “Alice’s Restaurant Massacree” or Taylor Swift’s “Cruel Summer.” It will tell you that copyright law prohibits it from doing so. 

And, it’s important to note that different AI companies are taking different approaches to this problem. For example, Google (which owns Gemini) uses supervised fine tuning (explained here). Anthropic (which owns Claude) focuses on what it calls “constitutional AI” – a training methodology that builds in constraints against certain behaviors, including the reproduction of copyrighted content. (link here). Meta (LLaMA models) has implemented what it calls “deduplication” during the training process – actively removing duplicate or near-duplicate content from training data to reduce the likelihood of memorization. Additionally, Meta has developed techniques to detect and filter out potential memorized content during the model’s response generation phase. (link here).

Conclusion

The AI industry faces a fundamental challenge that sits at the intersection of technology and law. Current research suggests that some degree of memorization may be inherent to large language models – raising a crucial question for courts: If memorization cannot be eliminated without sacrificing model performance, how should copyright law respond?

The answer could reshape both AI development and copyright doctrine. AI companies may need to accept reduced performance in exchange for legal compliance, while content creators must decide whether to license their works for AI training despite the risk of memorization. The industry’s ability to develop systems that truly learn patterns without memorizing specific expressions – or courts’ willingness to adapt copyright law to this technological reality – may determine its future.

The outcome of the Times lawsuit may establish crucial precedents for how copyright law treats AI systems that can memorize and reproduce protected content. At stake is not just the legality of current AI models, but the broader question of how to balance technological innovation with the rights of content creators in an era where the line between learning and copying has become increasingly blurred.

 

Copyright And The Challenge of Large Language Models (Part 2)

Copyright And The Challenge of Large Language Models (Part 2)

“Fair use is the great white whale of American copyright law. Enthralling, enigmatic, protean, it endlessly fascinates us even as it defeats our every attempt to subdue it.” – Paul Goldstein

__________________________

This is the second in a 3-part series of posts on Large Language Models (LLMs) and copyright. (Part 1 here

In this post I’ll turn to a controversial and important legal question: does the use of copyrighted material in training LLMs for generative AI constitute fair use? This analysis requires a nuanced understanding of both copyright fair use and the technical aspects of LLM training (see Part 1). To examine this complex issue I’ll look at recent relevant case law and consider potential solutions to the legal challenges posed by AI technology. 

Introduction

The issue is this: generative AI systems – systems that generate text, graphics, video, music – are being trained without permission on copies of millions of copyrighted books, artwork, software and music scraped from the internet. However, as I discussed in Part 1 of this series, the AI industry argues that the resulting models themselves are not infringing. Rightsholders argue that even if this is true (and they assert that it is not), the use of their content to train AI models is infringing, and that is the focus of this post.

To put this in perspective, consider where AI developers get their training data. It’s generally acknowledged that many of them have used resources such as Common Crawl, a digital archive containing 50 billion web pages, and Books3, a digital library of thousands of books. While these resources may contain works that are in the public domain, there’s no doubt that they contain a huge quantity of works that are protected by copyright. 

In the AI industry, the thirst for this data is insatiable – the bigger the language models, the better they perform, and copyrighted works are an essential component of this data. In fact, the industry is already looking at a “data wall,” the time when they will run out of data. They may hit that wall in the next few years. If copyrighted works can’t be included in training data, it will be even sooner.

Rightsholders assert that the use of this content to train LLMs is outright, massive copyright infringement. The AI industry responds that fair use – codified in 17 U.S.C. § 107 – covers most types of model training where, as they assert, the resulting model functions differently than the input data. This is not just an academic difference – the issue is being litigated in more than a dozen lawsuits against AI companies, attracting a huge amount of attention from the copyright community.

No court has yet ruled on whether fair use protects the use of copyright-protected material as training material for LLMs. Eventually, the courts will answer this question by applying the language of the statute and the court decisions applying copyright fair use.

Legal Precedents Shaping the AI Copyright Landscape

To understand how the courts are likely to evaluate these cases we need to look at four recent cases that have shaped the fair use landscape: the two Google Books cases, Google v. Oracle, and Warhol Foundation v. Goldsmith. In addition the courts are likely to apply what is known as the “intermediate copying” line of cases. 

The Google Books Cases. Let’s start with the two Google Books cases, which in many ways set the stage for the current AI copyright dilemma. The AI industry has put its greatest emphasis on these cases. (OpenAI: “Perhaps the most compelling case on point is Authors Guild v. Google”).

Authors Guild v. Google and Author’s Guild v. Hathitrust. In 2015, the Second Circuit Court of Appeals decided Authors Guild v. Google, a copyright case that had been winding through the courts for a decade. Google had scanned millions of books without permission from rightsholders, creating a searchable database.

The Second Circuit held that this was fair use. The court’s decision hinged on two key points. First, the court found Google’s use highly “transformative,” a concept central to fair use. Google wasn’t reproducing books for people to read; it was creating a new tool for search and analysis. While Google allowed users to see small “snippets” of text containing their search terms, this didn’t substitute for the actual books. Second, the court found that Google Books was more likely to enhance the market for books than harm it. The court also emphasized the immense public benefit of Google Books as a research tool.

A sister case in the Google Books saga was Authors Guild v. HathiTrust, decided by the Second Circuit in 2014. HathiTrust, a partnership of academic institutions, had created a digital library from book scans provided by Google. HathiTrust allowed researchers to conduct non-consumptive research, such as text mining and computational analysis, on the corpus of digitized works. Just as in Google Books, the court found the creation of a full-text searchable database to be a fair use, even though it involved copying entire works. Importantly, the court held this use of the copyrighted books to be transformative and “nonexpressive.”

The two cases were landmark fair use decisions, especially for their treatment of mass digitization and nonexpressive use of copyrighted works – a type of use that involves copying copyrighted works but does not communicate the expressive aspects of those works.

These two cases, while important, by no means guarantee the AI industry the fair use outcome they are seeking. Reliance on Google Books falters given the scope of potential output of AI models. Unlike Google Books’ limited snippets, LLMs can generate extensive text that may mirror the style and substance of copyrighted works in their training data. This raises concerns about market harm, a critical factor in fair use analysis, and whether LLM-generated content could eventually serve as a market substitute for the original works. The New York Times argues just this in its copyright infringement case against OpenAI and Microsoft.

Hathitrust is an even weaker precedent for LLM fair use. The Second Circuit held that HathiTrust’s full-text search “posed no harm to any existing or potential traditional market for the copyrighted works.” LLMs, in contrast, have the potential to generate content that could compete with or substitute for original works, potentially impacting markets for copyrighted material. Also, HathiTrust was created by universities and non-profit institutions for educational and research purposes. Commercial LLM development may not benefit from the same favorable consideration under fair use analysis. 

In sum, the significant differences in purpose, scope, and potential market impact make both Google Books and Hathitrust imperfect authorities for justifying the comprehensive use of copyrighted materials in training LLMs.

Google v. Oracle. Fast forward to 2021 for another landmark fair use case, this time involving software code. In Google v. Oracle, the Supreme Court held that Google’s copying of 11,500 lines of code from Oracle’s Java API was intended to facilitate interoperability, and was fair use. 

The Court found Google’s “purpose and character” was transformative because it “sought to create new products” and was “consistent with that creative ‘progress’ that is the basic constitutional objective of copyright itself.” The Court also downplayed the market harm to Oracle, noting that Oracle was “poorly positioned to succeed in the mobile phone market.” 

This decision seemed to open the door for tech companies to make limited use of some copyrighted works in the name of innovation. However, the case’s focus on functional code limits its applicability to LLMs, which are trained on expressive works like books, articles, and images. The Supreme Court explicitly recognized the inherent differences between functional works, which lean towards fair use, and expressive creations at the heart of copyright protection. So, again, the AI industry will have difficulty deriving much support from this decision. 

And, before we could fully digest Oracle’s implications for fair use, the Supreme Court threw a curveball.

Andy Warhol Foundation v. Goldsmith. In 2023, the Court decided Andy Warhol Foundation v. Goldsmith (Warhol), a case dealing with Warhol’s repurposing of a photograph of the musician Prince. While the case focused specifically on appropriation art, its core principles resonate with the ongoing debate surrounding LLMs’ use of copyrighted materials.

The Warhol decision emphasizes a use-based approach to fair use analysis, focusing on the purpose and character of the defendant’s use, particularly its commercial nature, and whether it serves as a market substitute for the original work. This emphasis on commerciality and market substitution poses challenges for LLM companies defending the fair use of copyrighted works in training data. The decision underscores the importance of considering potential markets for derivative works. As the use of copyrighted works for AI training becomes increasingly common, a market for licensing such data is emerging. The existence of such a market, even if nascent, could weaken the argument that using copyrighted materials for LLM training is a fair use, particularly when those materials are commercially valuable and readily licensable

The “Intermediate Copying” Cases. I also expect the AI industry to rely on the case law on “intermediate copying.” In this line of cases the users copied material to discover unprotectable information or as a minor step towards developing an entirely new product. So the final output – despite using copied material as an intermediate step – was noninfringing. In these cases the “intermediate use” was held to be fair use. See Sega v. Accolade (9th Cir. 1992) (defendant copied Sega’s copyrighted software to figure out the functional requirements to make games compatible with Sega’s gaming console). Sony v. Connectix (9th Cir. 2000)(defendant used a copy of Sony’s software to reverse engineer it and create a new gaming platform on which users could play games designed for Sony’s gaming system).

AI companies likely will argue that, just as in these cases, LLMs study language patterns as part of the process of transforming intermediate copying into noninfringing materials. Rightsholders likely will argue that whereas in those cases the copiers sought to study functionality or create compatibility, the scope and nature of use and the resulting product are vastly different from LLM fair use. I expect rightsholders will have the better argument on these cases. 

Applying Legal Precedents to AI

So, where does this confusing collection of cases leave us? Here’s a summary:

The Content Industry Position – in a Nutshell: Rightsholders argue that – even assuming that the final LLM model does not contain expressive content (which they dispute) – the use of copyrighted works to train LLMs is an infringement not excused by fair use. They argue that all four fair use factors weigh against AI companies:

      –  Purpose and character: Many (but not all) AI applications are commercial, which cuts against the industries’ fair use argument, especially in light of Warhol’s emphasis on commercial purpose and the potential licensing market for training data. The existence of a licensing market for training datasets suggests that AI companies can obtain licenses rather than rely on fair use defenses. This last point – market effect – is particularly important in light of the Supreme Court’s holding in Andy Warhol

      –  Nature of the work: Unlike the computer code in Google v. Oracle, which the Supreme Court noted receives “thin” protection, the content ingested by AI companies contains highly creative works like books, articles, and code. This distinguishes Oracle from AI training, and cuts against fair use.

      –  Amount used: Entire works are copied, a factor that weighs against fair use.

      –  Market effect: End users are able to extract verbatim content from LLMs, harming the market for original works and, as noted above , harming current and future AI training licensing markets. 

The AI Industry Position – in a Nutshell. The AI industry will argue that the use of copyrighted works should be considered fair use:

      –  Transformative Use: The AI industry argues that AI training creates new tools with different purposes from the original works, using copyright material in a “nonexpressive” way. AI developers draw parallels to “context shifting” fair use cases dealing with search engines and digital libraries, such as the Google Books project, arguing AI use is even more transformative. I expect them to rely on Google v. Oracle to argue that, just as Google’s use of Oracle’s API code was found to be transformative because it created something new that expanded the use of the original code (the Android platform), AI training is transformative, as it creates new systems with different purposes from the original works. Just as the Supreme Court emphasized the public benefit of allowing programmers to use their acquired skills, similarly AI advocates are likely to highlight the broad societal benefits and innovation enabled by LLMs trained on diverse data.

      –  Intermediate Copying. AI proponents will support this argument by pointing to the “intermediate copying” line of cases, which hold that using copyrighted works for purposes incidental to a nonexpressive purpose (creating the non-infringing model itself), is permissible fair use.  

      –  Market Impact: AI proponents will argue that AI training, and the models themselves, do not directly compete with or substitute for the original copyrighted works

      –  Amount and Substantiality: Again, relying on Google v. Oracle, AI proponents will note that despite Google copying entire lines of code, the Court found fair use. This will support their argument that copying entire works for AI training doesn’t preclude fair use if the purpose is sufficiently transformative.

      –  Public Benefit: In Google v. Oracle the Court showed a willingness to interpret fair use flexibly to accommodate technological progress. AI proponents will rely on this, and argue that applying fair use to AI training has social benefits and aligns with copyright law’s goal of promoting progress. The alternative, restricting access to training data, could significantly hinder AI research and development. (AI “doomers” are unlikely to be persuaded by this argument).

      –  Practical Necessity: Given the vast amount of data needed, obtaining licenses for all copyrighted material used in training is impractical, impossible or would be so expensive that it would stifle AI development.

As I noted above, It’s important to note that, as alleged in several of the lawsuits filed to date, some generative AI models have “memorized” copyrighted materials and are able to output them in a way that could substitute for the copyrighted work. If the outputs of a system can infringe, the argument that the system itself does not implicate copyright’s purposes will be significantly weakened.

While Part 3 of this series will explore these output-related issues in depth, it’s important to recognize the intrinsic link between these concerns and input-side training challenges. In assessing AI’s impact on copyright law courts may adopt a holistic approach, considering the entire content lifecycle – from data ingestion to LLMs to final output. This interconnected perspective reflects the complex nature of AI systems, where training methods directly influence both the characteristics and potential infringement risks of generated content.

Potential Solutions and Future Directions

As challenging as these issues are, we need to start thinking about practical solutions that balance the interests of AI developers, content creators, and the public. Here are some possibilities, along with their potential advantages and drawbacks.

Licensing Schemes: One proposed solution is to develop comprehensive licensing systems for AI training data, similar to those that exist for certain music uses. This could provide a mechanism for compensating creators while ensuring AI developers have access to necessary training data. 

Proponents argue that this approach would respect copyright holders’ rights and provide a clear framework for legal use. However, critics rightly point out that implementing such a system would be enormously complex and impractical. The sheer volume of content used in AI training, the difficulty of tracking usage, and the potential for exorbitant costs could stifle innovation, particularly for smaller AI developers.

 New Copyright Exceptions: Another approach is to create specific exemptions for AI training, perhaps limited to non-commercial or research purposes. This could be similar to existing fair use exceptions for research and could promote innovation in AI development. The advantage of this approach is that it provides clarity and could accelerate AI research. However, defining the boundaries of “non-commercial” use in the rapidly evolving AI landscape could prove challenging.

International Harmonization: Given the global nature of AI development, the industry may need to work towards a unified international approach to copyright exceptions for AI. This could involve amendments to international copyright treaties or the development of new AI-specific agreements. However, international copyright negotiations are notoriously slow and complex. Different countries have varying interests and legal traditions, which could make reaching a consensus difficult.

Technological Solutions: We should also consider technological approaches to addressing these issues. For instance, AI companies could develop more sophisticated methods to anonymize or transform training data, making it harder to reconstruct original works on the “output” side. They could also implement filtering systems to prevent the output of copyrighted material. While promising, these solutions would require significant investment and might not fully address all legal concerns. There’s also a risk that overzealous filtering could limit the capabilities of AI systems.

Hybrid Approaches: Perhaps the most promising solutions will combine elements of the above approaches. For example, we could see a tiered system where certain uses are exempt, others require licensing, and still others are prohibited. This could be coupled with technological measures such as synthetic training data, and international guidelines.

 Market-Driven Solutions: As the AI industry matures, we are likely to see the emergence of new business models that naturally address some of these copyright concerns. For instance, content creators might start producing AI-training-specific datasets, or AI companies might vertically integrate to produce their own training content. X’s Grok AI product and Meta are examples of this.

As we consider these potential solutions, it’s crucial to remember that the goal of copyright law is to foster innovation while fairly compensating creators and respecting intellectual property rights. Any solution will likely require compromise from all stakeholders and will need to be flexible enough to adapt to rapidly changing technology.

Moreover, these solutions will need to be developed with input from a diverse range of voices – not just large tech companies and major content producers, but also independent creators, smaller AI startups, legal experts, and public interest advocates. The path forward will require creativity, collaboration, and a willingness to rethink traditional approaches to copyright in the artificial intelligence age.

Conclusion – The Road Ahead

The intersection of AI and copyright law presents complex challenges that resist simple solutions. The Google Books cases provide some support for mass digitization and computational use of copyrighted works. Google v. Oracle suggests courts might look favorably on uses that promote new and beneficial AI technologies. But Warhol reminds us that transformative use has limits, especially in commercial contexts.

For AI companies, the path forward involves careful consideration of training data sources and potential licensing arrangements. It may also mean being prepared for legal challenges and working proactively with policymakers to develop workable solutions.

For content creators, it’s crucial to stay informed about how your work might be used in AI training. There may be new opportunities for licensing, but also new risks to consider.

For policymakers and courts, the challenge is to strike a balance that fosters innovation while protecting the rights and incentives of creators. This may require rethinking some fundamental aspects of copyright law. 

The relationship between AI and copyright is likely to be a defining issue in intellectual property law for years to come. Stay tuned, stay informed, and be prepared for a wild ride. 

Continue with Part 3 of this series here.

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

Google’s NotebookLM has been getting a lot of attention. You upload your sources (articles, Youtube videos, URLs, text documents, audio files) and NotebookLM can create a podcast based on the library you’ve created.

I thought I’d experiment with this a bit. I uploaded a variety of articles on copyright and AI and hit “go.” I didn’t give NotebookLM the subject or any prompts. It figured out the topic (correctly) and created the 11 minute podcast embedded below.

A few observations:

First, the speaker voices are natural and realistic – they interact fluidly, have natural intonation and use varied speech patterns.

Second, the content quality is very high – the podcast correctly highlights Google Books as the leading case on the issue and outlines the implications of the case for and against fair use.

It also discusses the New York Times v. Microsoft/OpenAI case in detail, and focuses on the fact that the NYT was able to force ChatGPT to regurgitate verbatim or near verbatim NYT content.

The podcast goes on to discuss StabilityAI, the four fair use factors (as applied) and the larger consequences of LLMs on the copyright system.

I downloaded the podcast and embedded it below, but I could just as easily have provided a link to the podcast in NotebookLM.

 

Anderson v. TikTok: A Potential Sea Change for § 230 Immunity

Anderson v. TikTok: A Potential Sea Change for § 230 Immunity

In late August the U.S. Third Circuit Court of Appeals released a far reaching decision, holding that § 230 of the Communications Decency Act (CDA) did not provide a safe harbor for the social media company TikTok when its algorithms recommended and promoted a video which allegedly led a minor to accidentally kill herself. Anderson v. TikTok (3rd Cir. Aug. 27, 2024).

Introduction

First, a brief reminder – § 230, which was enacted in 1996, has been the guardian angel of internet platform owners. The law prohibits courts from treating a provider of an “interactive computer service” i.e., a website, as the “publisher or speaker” of third-party content posted on its platform. 47 U.S.C. § 230(c)(1). Under § 230 websites have been given broad legal protection. § 230 has created what is, in effect, a form of legal exceptionalism for Internet publishers. Without it any social media site (such as Facebook, X) or review site (such as Amazon) would be sued into oblivion.

On the whole the courts have given the law liberal application, dismissing cases against Internet providers under many fact scenarios. However, there is a vocal group that argues that the broad immunity protection given to § 230 of the CDA is based on overzealous interpretations far beyond its original intent.

Right now § 230 has one particularly prominent critic – Supreme Court Justice Clarence Thomas. Justice Thomas has not held back when expressing disagreement with the broad protection the courts have provided under § 230. 

In Malwarebytes, Inc. v. Enigma Software (2020) a petition for writ of certiorari was denied, but Justice Thomas issued a “statement” – 

Nowhere does [§ 230] protect a company that is itself the information content provider . . . And an information content provider is not just the primary author or creator; it is anyone “responsible, in whole or in part, for the creation or development” of the content.

Again in Doe ex rel. Roe v. Snap, Inc. (2024), Justice Thomas dissented from the denial of certiorari and was critical of the scope of § 230, stating – 

In the platforms’ world, they are fully responsible for their websites when it results in constitutional protections, but the moment that responsibility could lead to liability, they can disclaim any obligations and enjoy greater protections from suit than nearly any other industry. The Court should consider if this state of affairs is what § 230 demands. 

With these judicial headwinds, Anderson v. TikTok sailed into the Third Circuit. Even one Supreme Court justice is enough to create a Category Two storm in the legal world. And boy, did the Third Circuit deliver, joining the § 230 opposition and potentially rewriting the rulebook on internet platform immunity.

Anderson v. TikTok

Nylah Anderson, a 10-year-old girl, died after attempting the “Blackout Challenge” she saw on TikTok. The challenge, which encourages users to choke themselves until losing consciousness, appeared on Nylah’s “For You Page”, a feed of videos curated by TikTok’s algorithm.

Nylah’s mother sued TikTok, alleging the company was aware of the challenge and promoted the videos to minors. TikTok defended itself using § 230, arguing that its algorithm shouldn’t strip away its immunity for content posted by others.

The district court dismissed the complaint, holding that TikTok was immunized by § 230. The Third Circuit reversed.

The Third Circuit Ruling

The Third Circuit took a novel approach to interpreting § 230, concluding that when internet platforms use algorithms to curate and recommend content, they are engaging in “first-party speech,” essentially creating their own expressive content.

The court reached this conclusion largely based on the Supreme Court’s recent decision in Moody v. NetChoice (2024). In that case the Court held that an internet platform’s algorithm that reflects “editorial judgments” about content compilation is the platform’s own “expressive product,” protected by the First Amendment. The Third Circuit reasoned that if algorithms are first-party speech under the First Amendment, they must be first-party speech under § 230 too.

Here is the court’s reasoning:

230 immunizes [web sites] only to the extent that they are sued for “information provided by another information content provider.” In other words, [web sites] are immunized only if they are sued for someone else’s expressive activity or content (i.e., third-party speech), but they are not immunized if they are sued for their own expressive activity or content (i.e., first-party speech) . . .. Given the Supreme Court’s observations that platforms engage in protected first-party speech under the First Amendment when they curate compilations of others’ content via their expressive algorithms, it follows that doing so amounts to first-party speech under § 230. . . . TikTok’s algorithm, which recommended the Blackout Challenge to Nylah on her FYP, was TikTok’s own “expressive activity,” and thus its first-party speech.

Accordingly, TikTok was not protected under § 230, and Anderson’s case could proceed.

Whether the Third Circuit’s logic will be adopted by other courts (including the Supreme Court, as I discuss below), is an open question. The court’s reasoning assumes that the definition of “speech” should be consistent across First Amendment and CDA § 230 contexts. However, these are distinct legal frameworks with different purposes. The First Amendment protects freedom of expression from government interference. CDA § 230 provides liability protection for internet platforms regarding third-party content. Treating them as interchangeable may oversimplify the nuanced legal distinctions between them.

Implications for Online Platforms

If this ruling stands, platforms may need to reassess their content curation and targeted recommendation algorithms. The more a platform curates or recommends content, the more likely it is to lose § 230 protection for that activity. For now, this decision has opened the doors for more lawsuits in the Third Circuit against platforms based on their recommendation algorithms. If the holding is adopted by other courts it could lead to a fundamental rethinking of how social media platforms operate.

As a consequence, platforms might become hesitant to use sophisticated algorithms for fear of losing immunity, potentially resulting in a less curated, more chaotic online environment. This could, paradoxically, lead to more harmful content being visible, contrary to the court’s apparent intent.

Where To From Here?

Given the potential far-reaching consequences of this decision, it’s likely that TikTok will  seek en banc review by the full Third Circuit, and given the potential impact of the ruling there’s a strong case for full circuit review.

If unsuccessful there, this case is a strong candidate for Supreme Court review, since it creates a circuit split, diverging from § 230 interpretations in other jurisdictions. The Third Circuit even helpfully cites pre-Moody diverging opinions from the 1st, 2nd, 5th, 6th, 8th, 9th, and DC Circuits, essentially teeing it up for Supreme Court review.

In fact, all indications are that the Supreme Court would be receptive to an appeal in this case. The Court recently accepted the appeal of a case in which it would determine whether algorithm-based recommendations were protected under § 230. However, after hearing oral argument it decided the case on different grounds and didn’t reach the § 230 issue. Gonzalez v. Google (USSC May 18, 2023). Anderson presents another opportunity for the Supreme Court to weigh in on this issue.

In the meantime, platforms may start experimenting with different forms of content delivery that could potentially fall outside the court’s definition of curated recommendations. This could lead to innovative new approaches to content distribution, or it could result in less personalized, less engaging online experiences.

Conclusion

Anderson v. TikTok represents a potential paradigm shift in § 230 jurisprudence. While motivated by a tragic case, the legal reasoning employed could have sweeping consequences for online platforms, content moderation, and user-generated content. The decision raises fundamental questions about the nature of online platforms and the balance between protecting free expression online and holding platforms accountable for harmful content. As we move further into the age of AI curated feeds and content curation, these questions will only become more pressing.

Anderson v. TikTok, Inc. (3d Cir. Aug. 27, 2024)

For two earlier posts on this topic see: Section 230 Supreme Court Argument in Gonzalez v. Google: Keep An Eye on Justice Thomas and Supreme Court Will Decide Whether Google’s Algorithm-Based Recommendations are Protected Under Section 230

Copyright And The Challenge of Large Language Models (Part 1)

Copyright And The Challenge of Large Language Models (Part 1)

“AI models are what’s known in computer science as black boxes: You can see what goes in and what comes out; what happens in between is a mystery.”

Trust but Verify: Peeking Inside the “Black Box” of Machine Learning

In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging copyright infringement. This case, along with a number of similar cases filed against AI companies, brings to the forefront a fundamental challenge in applying traditional copyright law to a revolutionary technology: Large Language Models (LLMs). Perhaps more than any copyright case that precedes them, these cases grapple with a form of alleged infringement that defies conventional legal analysis.

This article is the first in a three-part series that will examine the copyright implications of the AI development process.

Disclaimer: I’m not a computer or AI scientist. However, neither are the judges and juries that will be asked to apply copyright law to this technology, or the legislators that may enact laws regulating it. It’s unlikely that they will go much beyond the level of detail I’ve used here.

What are Large Language Models (LLMs)?

Large Language Models, or LLMs, are gargantuan AI systems that use a vast corpus of training data and billions to trillions of parameters. They are designed to understand, generate, and manipulate human language. They learn patterns from the data, allowing them to perform a wide range of language tasks with remarkable fluency. Their inner workings are fundamentally different from any previous technology that has been the subject of copyright litigation, including traditional computer software.

LLMs typically use transformer-based neural networks: interconnected nodes organized into layers that can perform computations. The strengths of these connections—the influences that nodes have on another—are what is learned during training. These are called the model parameters or weights, and they are represented as numbers.

Here’s a simplified explanation of what happens when you use an AI like a large language model:

  1. You input a prompt (your question or request).
  2. The computer breaks down your prompt into smaller pieces called tokens. These can be words, parts of words, or even individual characters.
  3. The AI processes these tokens through its neural network – imagine this like a complex web of connections. Each part of this network analyzes the tokens and figures out how they relate to each other.
  4. As it processes, the AI predicts the probability distribution for the next token based on what it learned during its training.
  5. The LLM selects tokens based on these probabilities and combines them to create a coherent response or output for you, the user.

The “large” in Large Language Models primarily refers to the enormous number of parameters these models contain – sometimes in the trillions. These parameters represent the model’s learned patterns and relationships, fine-tuned through exposure to massive amounts of text data. While larger and more diverse high-quality datasets can lead to better AI models, other factors such as model architecture, training techniques, and fine-tuning also play important roles in model performance.

How Do AI Companies Obtain Their Training Data?

AI companies employ various methods to acquire this data – 

– Web scraping and crawling. One of the primary methods of data acquisition is web scraping – the automated process of extracting data from websites. AI companies deploy sophisticated crawlers that systematically browse the internet, copying text from millions of web pages. This method allows for the collection of diverse, up-to-date information but raises questions about the use of copyrighted material without explicit permission.

– Partnerships and licensing agreements. Some companies enter into partnerships or licensing agreements to access high-quality, curated datasets. For instance, OpenAI has partnered with organizations like the Associated Press to use its news archives for training purposes.

– Public datasets and academic corpuses. Many LLMs are trained, at least in part, on publicly available datasets and academic text collections. These might include Project Gutenberg’s collection of public domain books, scientific paper repositories, or curated datasets like the Common Crawl corpus.

– User-generated content. Platforms that interact directly with users, such as ChatGPT, can potentially use the conversations and inputs from users to further train and refine their models. This practice raises privacy concerns and questions about the ownership of user-contributed data.

In the context of the New York Times lawsuit, it’s worth noting that OpenAI, like many AI companies, has not publicly disclosed the full extent of its training data sources. However, it’s widely believed that the company uses a combination of publicly available web content, licensed datasets, and partnerships to build its training corpus. The lawsuit alleges that this corpus includes copyrighted New York Times articles, obtained without permission or compensation.

The Training Process: How Machines “Learn” From Data

Once acquired, the raw data undergoes several processing steps before it can be used to train an LLM – 

– Data preprocessing and cleaning. The first step involves cleaning the raw data. This includes removing irrelevant information, correcting errors, and standardizing the format. This may involve stripping away HTML tags, removing advertisements, or filtering out low-quality content.

– Tokenization and encoding. Next, the text is broken down into smaller units called tokens. These might be words, parts of words, or even individual characters. Each token is then converted into a numerical representation that the AI can process. This step is crucial as it determines how the model will interpret and generate language.

During training, the LLM is exposed to this preprocessed data, learning to predict patterns and relationships between tokens. This is an iterative process where the model makes predictions, compares them to the actual data, and adjusts its internal parameters to improve accuracy. This process, known as “backpropagation,” is repeated billions of times across the entire dataset. In a large LLM this can take months, operating 24/7 on a massive system of graphics processing chips.

The Transformation From Text to Numbers

For purposes of copyright law, here’s the crux of the matter: the AI industry asserts that after this process, the original text no longer exists in any recognizable form within the LLM. The model becomes a vast sea of numbers, with no direct correspondence to the original text. If true, this transformation creates a fundamental challenge for copyright law – 

– No Side-by-Side Comparison: In traditional copyright cases, courts rely heavily on comparing the original work side-by-side with the allegedly infringing material. With LLMs, this is impossible. You can’t “read” an LLM or print it out for comparison.

– Black Box Nature: The internal workings of LLMs are often referred to as a “black box.” Even the developers may not fully understand how the model arrives at its outputs.

– Dynamic Generation: The AI industry claims that LLMs don’t store and retrieve text in a conventional database format; they generate it dynamically based on learned patterns. This means that any similarity to copyrighted material in the output is a result of statistical prediction, not direct copying.

– Distributed Information: The AI industry claims that Information from any single source is distributed across countless parameters in the model, making it impossible to isolate the influence of any particular work.

However, copyright owners do not concede that completed AI models (as distinct from the training data) are only abstracted statistical patterns of the training data. Rightsholders assert that LLMs do indeed retain the expressions of the original works on which they have been trained. There are studies showing the LLM models are able to regurgitate their training materials, and the New York Times lawsuit against OpenAI and Microsoft shows 100 examples of this. See also Concord Music Group v. Anthropic (alleging that song lyrics can be accessed verbatim or near-verbatim from Claude). Rightsholders argue that this could only occur if the models encode the expressive content of these works.

Copyright Implications

Assuming the AI developers’ explanation to be correct (if its not the infringement case against them is strong), AI technology creates unprecedented challenges for copyright law – 

– Proving Infringement: How can a plaintiff prove infringement when the allegedly infringing material can’t be directly observed or compared?

– Fair Use Analysis: Traditional fair use factors, such as the amount and substantiality of the portion used, become difficult to apply when the “portion used” is transformed beyond recognition.

– Substantial Similarity: The legal test of “substantial similarity” between works becomes almost meaningless in the context of LLMs.

– Expert Testimony: Courts will likely have to rely heavily on expert testimony to understand the technology, but even experts may struggle to definitively prove or disprove infringement.

For all of these reasons, to prove copyright infringement plaintiffs such as the New York Times may be limited to claiming copyright infringement based on the “intermediate” copies that are used in the training process and user-prompted output, rather than the LLM models themselves. 

Conclusion

The NYT v. OpenAI case and others raising the same issue highlight a fundamental mismatch between traditional copyright law and the reality of LLM technology and the AI industries’ fair use defense. The outcome of this case could reshape our understanding of copyright in the digital age, potentially requiring new legal tests and standards that can account for the invisible, transformed nature of information within AI systems.

Part 2 in this series will focus on the legal issues around the “input problem” of using copyrighted material for training. Part 3 will look at the “output problem” of AI-generated content that may copy or resemble copyrighted works, including what the AI industry calls “memorization.” As we’ll see, each of these issues presents its own unique challenges in the context of a technology that defies traditional legal analysis.

Continue reading Part 2 and Part 3 of this series.