Select Page
Copyright And The Challenge of Large Language Models (Part 2)

Copyright And The Challenge of Large Language Models (Part 2)

“Fair use is the great white whale of American copyright law. Enthralling, enigmatic, protean, it endlessly fascinates us even as it defeats our every attempt to subdue it.” – Paul Goldstein

__________________________

This is the second in a 3-part series of posts on Large Language Models (LLMs) and copyright. (Part 1 here

In this post I’ll turn to a controversial and important legal question: does the use of copyrighted material in training LLMs for generative AI constitute fair use? This analysis requires a nuanced understanding of both copyright fair use and the technical aspects of LLM training (see Part 1). To examine this complex issue I’ll look at recent relevant case law and consider potential solutions to the legal challenges posed by AI technology. 

Introduction

The issue is this: generative AI systems – systems that generate text, graphics, video, music – are being trained without permission on copies of millions of copyrighted books, artwork, software and music scraped from the internet. However, as I discussed in Part 1 of this series, the AI industry argues that the resulting models themselves are not infringing. Rightsholders argue that even if this is true (and they assert that it is not), the use of their content to train AI models is infringing, and that is the focus of this post.

To put this in perspective, consider where AI developers get their training data. It’s generally acknowledged that many of them have used resources such as Common Crawl, a digital archive containing 50 billion web pages, and Books3, a digital library of thousands of books. While these resources may contain works that are in the public domain, there’s no doubt that they contain a huge quantity of works that are protected by copyright. 

In the AI industry, the thirst for this data is insatiable – the bigger the language models, the better they perform, and copyrighted works are an essential component of this data. In fact, the industry is already looking at a “data wall,” the time when they will run out of data. They may hit that wall in the next few years. If copyrighted works can’t be included in training data, it will be even sooner.

Rightsholders assert that the use of this content to train LLMs is outright, massive copyright infringement. The AI industry responds that fair use – codified in 17 U.S.C. § 107 – covers most types of model training where, as they assert, the resulting model functions differently than the input data. This is not just an academic difference – the issue is being litigated in more than a dozen lawsuits against AI companies, attracting a huge amount of attention from the copyright community.

No court has yet ruled on whether fair use protects the use of copyright-protected material as training material for LLMs. Eventually, the courts will answer this question by applying the language of the statute and the court decisions applying copyright fair use.

Legal Precedents Shaping the AI Copyright Landscape

To understand how the courts are likely to evaluate these cases we need to look at four recent cases that have shaped the fair use landscape: the two Google Books cases, Google v. Oracle, and Warhol Foundation v. Goldsmith. In addition the courts are likely to apply what is known as the “intermediate copying” line of cases. 

The Google Books Cases. Let’s start with the two Google Books cases, which in many ways set the stage for the current AI copyright dilemma. The AI industry has put its greatest emphasis on these cases. (OpenAI: “Perhaps the most compelling case on point is Authors Guild v. Google”).

Authors Guild v. Google and Author’s Guild v. Hathitrust. In 2015, the Second Circuit Court of Appeals decided Authors Guild v. Google, a copyright case that had been winding through the courts for a decade. Google had scanned millions of books without permission from rightsholders, creating a searchable database.

The Second Circuit held that this was fair use. The court’s decision hinged on two key points. First, the court found Google’s use highly “transformative,” a concept central to fair use. Google wasn’t reproducing books for people to read; it was creating a new tool for search and analysis. While Google allowed users to see small “snippets” of text containing their search terms, this didn’t substitute for the actual books. Second, the court found that Google Books was more likely to enhance the market for books than harm it. The court also emphasized the immense public benefit of Google Books as a research tool.

A sister case in the Google Books saga was Authors Guild v. HathiTrust, decided by the Second Circuit in 2014. HathiTrust, a partnership of academic institutions, had created a digital library from book scans provided by Google. HathiTrust allowed researchers to conduct non-consumptive research, such as text mining and computational analysis, on the corpus of digitized works. Just as in Google Books, the court found the creation of a full-text searchable database to be a fair use, even though it involved copying entire works. Importantly, the court held this use of the copyrighted books to be transformative and “nonexpressive.”

The two cases were landmark fair use decisions, especially for their treatment of mass digitization and nonexpressive use of copyrighted works – a type of use that involves copying copyrighted works but does not communicate the expressive aspects of those works.

These two cases, while important, by no means guarantee the AI industry the fair use outcome they are seeking. Reliance on Google Books falters given the scope of potential output of AI models. Unlike Google Books’ limited snippets, LLMs can generate extensive text that may mirror the style and substance of copyrighted works in their training data. This raises concerns about market harm, a critical factor in fair use analysis, and whether LLM-generated content could eventually serve as a market substitute for the original works. The New York Times argues just this in its copyright infringement case against OpenAI and Microsoft.

Hathitrust is an even weaker precedent for LLM fair use. The Second Circuit held that HathiTrust’s full-text search “posed no harm to any existing or potential traditional market for the copyrighted works.” LLMs, in contrast, have the potential to generate content that could compete with or substitute for original works, potentially impacting markets for copyrighted material. Also, HathiTrust was created by universities and non-profit institutions for educational and research purposes. Commercial LLM development may not benefit from the same favorable consideration under fair use analysis. 

In sum, the significant differences in purpose, scope, and potential market impact make both Google Books and Hathitrust imperfect authorities for justifying the comprehensive use of copyrighted materials in training LLMs.

Google v. Oracle. Fast forward to 2021 for another landmark fair use case, this time involving software code. In Google v. Oracle, the Supreme Court held that Google’s copying of 11,500 lines of code from Oracle’s Java API was intended to facilitate interoperability, and was fair use. 

The Court found Google’s “purpose and character” was transformative because it “sought to create new products” and was “consistent with that creative ‘progress’ that is the basic constitutional objective of copyright itself.” The Court also downplayed the market harm to Oracle, noting that Oracle was “poorly positioned to succeed in the mobile phone market.” 

This decision seemed to open the door for tech companies to make limited use of some copyrighted works in the name of innovation. However, the case’s focus on functional code limits its applicability to LLMs, which are trained on expressive works like books, articles, and images. The Supreme Court explicitly recognized the inherent differences between functional works, which lean towards fair use, and expressive creations at the heart of copyright protection. So, again, the AI industry will have difficulty deriving much support from this decision. 

And, before we could fully digest Oracle’s implications for fair use, the Supreme Court threw a curveball.

Andy Warhol Foundation v. Goldsmith. In 2023, the Court decided Andy Warhol Foundation v. Goldsmith (Warhol), a case dealing with Warhol’s repurposing of a photograph of the musician Prince. While the case focused specifically on appropriation art, its core principles resonate with the ongoing debate surrounding LLMs’ use of copyrighted materials.

The Warhol decision emphasizes a use-based approach to fair use analysis, focusing on the purpose and character of the defendant’s use, particularly its commercial nature, and whether it serves as a market substitute for the original work. This emphasis on commerciality and market substitution poses challenges for LLM companies defending the fair use of copyrighted works in training data. The decision underscores the importance of considering potential markets for derivative works. As the use of copyrighted works for AI training becomes increasingly common, a market for licensing such data is emerging. The existence of such a market, even if nascent, could weaken the argument that using copyrighted materials for LLM training is a fair use, particularly when those materials are commercially valuable and readily licensable

The “Intermediate Copying” Cases. I also expect the AI industry to rely on the case law on “intermediate copying.” In this line of cases the users copied material to discover unprotectable information or as a minor step towards developing an entirely new product. So the final output – despite using copied material as an intermediate step – was noninfringing. In these cases the “intermediate use” was held to be fair use. See Sega v. Accolade (9th Cir. 1992) (defendant copied Sega’s copyrighted software to figure out the functional requirements to make games compatible with Sega’s gaming console). Sony v. Connectix (9th Cir. 2000)(defendant used a copy of Sony’s software to reverse engineer it and create a new gaming platform on which users could play games designed for Sony’s gaming system).

AI companies likely will argue that, just as in these cases, LLMs study language patterns as part of the process of transforming intermediate copying into noninfringing materials. Rightsholders likely will argue that whereas in those cases the copiers sought to study functionality or create compatibility, the scope and nature of use and the resulting product are vastly different from LLM fair use. I expect rightsholders will have the better argument on these cases. 

Applying Legal Precedents to AI

So, where does this confusing collection of cases leave us? Here’s a summary:

The Content Industry Position – in a Nutshell: Rightsholders argue that – even assuming that the final LLM model does not contain expressive content (which they dispute) – the use of copyrighted works to train LLMs is an infringement not excused by fair use. They argue that all four fair use factors weigh against AI companies:

      –  Purpose and character: Many (but not all) AI applications are commercial, which cuts against the industries’ fair use argument, especially in light of Warhol’s emphasis on commercial purpose and the potential licensing market for training data. The existence of a licensing market for training datasets suggests that AI companies can obtain licenses rather than rely on fair use defenses. This last point – market effect – is particularly important in light of the Supreme Court’s holding in Andy Warhol

      –  Nature of the work: Unlike the computer code in Google v. Oracle, which the Supreme Court noted receives “thin” protection, the content ingested by AI companies contains highly creative works like books, articles, and code. This distinguishes Oracle from AI training, and cuts against fair use.

      –  Amount used: Entire works are copied, a factor that weighs against fair use.

      –  Market effect: End users are able to extract verbatim content from LLMs, harming the market for original works and, as noted above , harming current and future AI training licensing markets. 

The AI Industry Position – in a Nutshell. The AI industry will argue that the use of copyrighted works should be considered fair use:

      –  Transformative Use: The AI industry argues that AI training creates new tools with different purposes from the original works, using copyright material in a “nonexpressive” way. AI developers draw parallels to “context shifting” fair use cases dealing with search engines and digital libraries, such as the Google Books project, arguing AI use is even more transformative. I expect them to rely on Google v. Oracle to argue that, just as Google’s use of Oracle’s API code was found to be transformative because it created something new that expanded the use of the original code (the Android platform), AI training is transformative, as it creates new systems with different purposes from the original works. Just as the Supreme Court emphasized the public benefit of allowing programmers to use their acquired skills, similarly AI advocates are likely to highlight the broad societal benefits and innovation enabled by LLMs trained on diverse data.

      –  Intermediate Copying. AI proponents will support this argument by pointing to the “intermediate copying” line of cases, which hold that using copyrighted works for purposes incidental to a nonexpressive purpose (creating the non-infringing model itself), is permissible fair use.  

      –  Market Impact: AI proponents will argue that AI training, and the models themselves, do not directly compete with or substitute for the original copyrighted works

      –  Amount and Substantiality: Again, relying on Google v. Oracle, AI proponents will note that despite Google copying entire lines of code, the Court found fair use. This will support their argument that copying entire works for AI training doesn’t preclude fair use if the purpose is sufficiently transformative.

      –  Public Benefit: In Google v. Oracle the Court showed a willingness to interpret fair use flexibly to accommodate technological progress. AI proponents will rely on this, and argue that applying fair use to AI training has social benefits and aligns with copyright law’s goal of promoting progress. The alternative, restricting access to training data, could significantly hinder AI research and development. (AI “doomers” are unlikely to be persuaded by this argument).

      –  Practical Necessity: Given the vast amount of data needed, obtaining licenses for all copyrighted material used in training is impractical, impossible or would be so expensive that it would stifle AI development.

As I noted above, It’s important to note that, as alleged in several of the lawsuits filed to date, some generative AI models have “memorized” copyrighted materials and are able to output them in a way that could substitute for the copyrighted work. If the outputs of a system can infringe, the argument that the system itself does not implicate copyright’s purposes will be significantly weakened.

While Part 3 of this series will explore these output-related issues in depth, it’s important to recognize the intrinsic link between these concerns and input-side training challenges. In assessing AI’s impact on copyright law courts may adopt a holistic approach, considering the entire content lifecycle – from data ingestion to LLMs to final output. This interconnected perspective reflects the complex nature of AI systems, where training methods directly influence both the characteristics and potential infringement risks of generated content.

Potential Solutions and Future Directions

As challenging as these issues are, we need to start thinking about practical solutions that balance the interests of AI developers, content creators, and the public. Here are some possibilities, along with their potential advantages and drawbacks.

Licensing Schemes: One proposed solution is to develop comprehensive licensing systems for AI training data, similar to those that exist for certain music uses. This could provide a mechanism for compensating creators while ensuring AI developers have access to necessary training data. 

Proponents argue that this approach would respect copyright holders’ rights and provide a clear framework for legal use. However, critics rightly point out that implementing such a system would be enormously complex and impractical. The sheer volume of content used in AI training, the difficulty of tracking usage, and the potential for exorbitant costs could stifle innovation, particularly for smaller AI developers.

 New Copyright Exceptions: Another approach is to create specific exemptions for AI training, perhaps limited to non-commercial or research purposes. This could be similar to existing fair use exceptions for research and could promote innovation in AI development. The advantage of this approach is that it provides clarity and could accelerate AI research. However, defining the boundaries of “non-commercial” use in the rapidly evolving AI landscape could prove challenging.

International Harmonization: Given the global nature of AI development, the industry may need to work towards a unified international approach to copyright exceptions for AI. This could involve amendments to international copyright treaties or the development of new AI-specific agreements. However, international copyright negotiations are notoriously slow and complex. Different countries have varying interests and legal traditions, which could make reaching a consensus difficult.

Technological Solutions: We should also consider technological approaches to addressing these issues. For instance, AI companies could develop more sophisticated methods to anonymize or transform training data, making it harder to reconstruct original works on the “output” side. They could also implement filtering systems to prevent the output of copyrighted material. While promising, these solutions would require significant investment and might not fully address all legal concerns. There’s also a risk that overzealous filtering could limit the capabilities of AI systems.

Hybrid Approaches: Perhaps the most promising solutions will combine elements of the above approaches. For example, we could see a tiered system where certain uses are exempt, others require licensing, and still others are prohibited. This could be coupled with technological measures such as synthetic training data, and international guidelines.

 Market-Driven Solutions: As the AI industry matures, we are likely to see the emergence of new business models that naturally address some of these copyright concerns. For instance, content creators might start producing AI-training-specific datasets, or AI companies might vertically integrate to produce their own training content. X’s Grok AI product and Meta are examples of this.

As we consider these potential solutions, it’s crucial to remember that the goal of copyright law is to foster innovation while fairly compensating creators and respecting intellectual property rights. Any solution will likely require compromise from all stakeholders and will need to be flexible enough to adapt to rapidly changing technology.

Moreover, these solutions will need to be developed with input from a diverse range of voices – not just large tech companies and major content producers, but also independent creators, smaller AI startups, legal experts, and public interest advocates. The path forward will require creativity, collaboration, and a willingness to rethink traditional approaches to copyright in the artificial intelligence age.

Conclusion – The Road Ahead

The intersection of AI and copyright law presents complex challenges that resist simple solutions. The Google Books cases provide some support for mass digitization and computational use of copyrighted works. Google v. Oracle suggests courts might look favorably on uses that promote new and beneficial AI technologies. But Warhol reminds us that transformative use has limits, especially in commercial contexts.

For AI companies, the path forward involves careful consideration of training data sources and potential licensing arrangements. It may also mean being prepared for legal challenges and working proactively with policymakers to develop workable solutions.

For content creators, it’s crucial to stay informed about how your work might be used in AI training. There may be new opportunities for licensing, but also new risks to consider.

For policymakers and courts, the challenge is to strike a balance that fosters innovation while protecting the rights and incentives of creators. This may require rethinking some fundamental aspects of copyright law. 

The relationship between AI and copyright is likely to be a defining issue in intellectual property law for years to come. Stay tuned, stay informed, and be prepared for a wild ride. 

And watch for Part 3 of this 3-part blog post series.

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law

Google’s NotebookLM has been getting a lot of attention. You upload your sources (articles, Youtube videos, URLs, text documents, audio files) and NotebookLM can create a podcast based on the library you’ve created.

I thought I’d experiment with this a bit. I uploaded a variety of articles on copyright and AI and hit “go.” I didn’t give NotebookLM the subject or any prompts. It figured out the topic (correctly) and created the 11 minute podcast embedded below.

A few observations:

First, the speaker voices are natural and realistic – they interact fluidly, have natural intonation and use varied speech patterns.

Second, the content quality is very high – the podcast correctly highlights Google Books as the leading case on the issue and outlines the implications of the case for and against fair use.

It also discusses the New York Times v. Microsoft/OpenAI case in detail, and focuses on the fact that the NYT was able to force ChatGPT to regurgitate verbatim or near verbatim NYT content.

The podcast goes on to discuss StabilityAI, the four fair use factors (as applied) and the larger consequences of LLMs on the copyright system.

I downloaded the podcast and embedded it below, but I could just as easily have provided a link to the podcast in NotebookLM.

 

Anderson v. TikTok: A Potential Sea Change for § 230 Immunity

Anderson v. TikTok: A Potential Sea Change for § 230 Immunity

In late August the U.S. Third Circuit Court of Appeals released a far reaching decision, holding that § 230 of the Communications Decency Act (CDA) did not provide a safe harbor for the social media company TikTok when its algorithms recommended and promoted a video which allegedly led a minor to accidentally kill herself. Anderson v. TikTok (3rd Cir. Aug. 27, 2024).

Introduction

First, a brief reminder – § 230, which was enacted in 1996, has been the guardian angel of internet platform owners. The law prohibits courts from treating a provider of an “interactive computer service” i.e., a website, as the “publisher or speaker” of third-party content posted on its platform. 47 U.S.C. § 230(c)(1). Under § 230 websites have been given broad legal protection. § 230 has created what is, in effect, a form of legal exceptionalism for Internet publishers. Without it any social media site (such as Facebook, X) or review site (such as Amazon) would be sued into oblivion.

On the whole the courts have given the law liberal application, dismissing cases against Internet providers under many fact scenarios. However, there is a vocal group that argues that the broad immunity protection given to § 230 of the CDA is based on overzealous interpretations far beyond its original intent.

Right now § 230 has one particularly prominent critic – Supreme Court Justice Clarence Thomas. Justice Thomas has not held back when expressing disagreement with the broad protection the courts have provided under § 230. 

In Malwarebytes, Inc. v. Enigma Software (2020) a petition for writ of certiorari was denied, but Justice Thomas issued a “statement” – 

Nowhere does [§ 230] protect a company that is itself the information content provider . . . And an information content provider is not just the primary author or creator; it is anyone “responsible, in whole or in part, for the creation or development” of the content.

Again in Doe ex rel. Roe v. Snap, Inc. (2024), Justice Thomas dissented from the denial of certiorari and was critical of the scope of § 230, stating – 

In the platforms’ world, they are fully responsible for their websites when it results in constitutional protections, but the moment that responsibility could lead to liability, they can disclaim any obligations and enjoy greater protections from suit than nearly any other industry. The Court should consider if this state of affairs is what § 230 demands. 

With these judicial headwinds, Anderson v. TikTok sailed into the Third Circuit. Even one Supreme Court justice is enough to create a Category Two storm in the legal world. And boy, did the Third Circuit deliver, joining the § 230 opposition and potentially rewriting the rulebook on internet platform immunity.

Anderson v. TikTok

Nylah Anderson, a 10-year-old girl, died after attempting the “Blackout Challenge” she saw on TikTok. The challenge, which encourages users to choke themselves until losing consciousness, appeared on Nylah’s “For You Page”, a feed of videos curated by TikTok’s algorithm.

Nylah’s mother sued TikTok, alleging the company was aware of the challenge and promoted the videos to minors. TikTok defended itself using § 230, arguing that its algorithm shouldn’t strip away its immunity for content posted by others.

The district court dismissed the complaint, holding that TikTok was immunized by § 230. The Third Circuit reversed.

The Third Circuit Ruling

The Third Circuit took a novel approach to interpreting § 230, concluding that when internet platforms use algorithms to curate and recommend content, they are engaging in “first-party speech,” essentially creating their own expressive content.

The court reached this conclusion largely based on the Supreme Court’s recent decision in Moody v. NetChoice (2024). In that case the Court held that an internet platform’s algorithm that reflects “editorial judgments” about content compilation is the platform’s own “expressive product,” protected by the First Amendment. The Third Circuit reasoned that if algorithms are first-party speech under the First Amendment, they must be first-party speech under § 230 too.

Here is the court’s reasoning:

230 immunizes [web sites] only to the extent that they are sued for “information provided by another information content provider.” In other words, [web sites] are immunized only if they are sued for someone else’s expressive activity or content (i.e., third-party speech), but they are not immunized if they are sued for their own expressive activity or content (i.e., first-party speech) . . .. Given the Supreme Court’s observations that platforms engage in protected first-party speech under the First Amendment when they curate compilations of others’ content via their expressive algorithms, it follows that doing so amounts to first-party speech under § 230. . . . TikTok’s algorithm, which recommended the Blackout Challenge to Nylah on her FYP, was TikTok’s own “expressive activity,” and thus its first-party speech.

Accordingly, TikTok was not protected under § 230, and Anderson’s case could proceed.

Whether the Third Circuit’s logic will be adopted by other courts (including the Supreme Court, as I discuss below), is an open question. The court’s reasoning assumes that the definition of “speech” should be consistent across First Amendment and CDA § 230 contexts. However, these are distinct legal frameworks with different purposes. The First Amendment protects freedom of expression from government interference. CDA § 230 provides liability protection for internet platforms regarding third-party content. Treating them as interchangeable may oversimplify the nuanced legal distinctions between them.

Implications for Online Platforms

If this ruling stands, platforms may need to reassess their content curation and targeted recommendation algorithms. The more a platform curates or recommends content, the more likely it is to lose § 230 protection for that activity. For now, this decision has opened the doors for more lawsuits in the Third Circuit against platforms based on their recommendation algorithms. If the holding is adopted by other courts it could lead to a fundamental rethinking of how social media platforms operate.

As a consequence, platforms might become hesitant to use sophisticated algorithms for fear of losing immunity, potentially resulting in a less curated, more chaotic online environment. This could, paradoxically, lead to more harmful content being visible, contrary to the court’s apparent intent.

Where To From Here?

Given the potential far-reaching consequences of this decision, it’s likely that TikTok will  seek en banc review by the full Third Circuit, and given the potential impact of the ruling there’s a strong case for full circuit review.

If unsuccessful there, this case is a strong candidate for Supreme Court review, since it creates a circuit split, diverging from § 230 interpretations in other jurisdictions. The Third Circuit even helpfully cites pre-Moody diverging opinions from the 1st, 2nd, 5th, 6th, 8th, 9th, and DC Circuits, essentially teeing it up for Supreme Court review.

In fact, all indications are that the Supreme Court would be receptive to an appeal in this case. The Court recently accepted the appeal of a case in which it would determine whether algorithm-based recommendations were protected under § 230. However, after hearing oral argument it decided the case on different grounds and didn’t reach the § 230 issue. Gonzalez v. Google (USSC May 18, 2023). Anderson presents another opportunity for the Supreme Court to weigh in on this issue.

In the meantime, platforms may start experimenting with different forms of content delivery that could potentially fall outside the court’s definition of curated recommendations. This could lead to innovative new approaches to content distribution, or it could result in less personalized, less engaging online experiences.

Conclusion

Anderson v. TikTok represents a potential paradigm shift in § 230 jurisprudence. While motivated by a tragic case, the legal reasoning employed could have sweeping consequences for online platforms, content moderation, and user-generated content. The decision raises fundamental questions about the nature of online platforms and the balance between protecting free expression online and holding platforms accountable for harmful content. As we move further into the age of AI curated feeds and content curation, these questions will only become more pressing.

Anderson v. TikTok, Inc. (3d Cir. Aug. 27, 2024)

For two earlier posts on this topic see: Section 230 Supreme Court Argument in Gonzalez v. Google: Keep An Eye on Justice Thomas and Supreme Court Will Decide Whether Google’s Algorithm-Based Recommendations are Protected Under Section 230

Copyright And The Challenge of Large Language Models

Copyright And The Challenge of Large Language Models

“AI models are what’s known in computer science as black boxes: You can see what goes in and what comes out; what happens in between is a mystery.”

Trust but Verify: Peeking Inside the “Black Box” of Machine Learning

In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging copyright infringement. This case, along with a number of similar cases filed against AI companies, brings to the forefront a fundamental challenge in applying traditional copyright law to a revolutionary technology: Large Language Models (LLMs). Perhaps more than any copyright case that precedes them, these cases grapple with a form of alleged infringement that defies conventional legal analysis.

This article is the first in a three-part series that will examine the copyright implications of the AI development process.

Disclaimer: I’m not a computer or AI scientist. However, neither are the judges and juries that will be asked to apply copyright law to this technology, or the legislators that may enact laws regulating it. It’s unlikely that they will go much beyond the level of detail I’ve used here.

What are Large Language Models (LLMs)?

Large Language Models, or LLMs, are gargantuan AI systems that use a vast corpus of training data and billions to trillions of parameters. They are designed to understand, generate, and manipulate human language. They learn patterns from the data, allowing them to perform a wide range of language tasks with remarkable fluency. Their inner workings are fundamentally different from any previous technology that has been the subject of copyright litigation, including traditional computer software.

LLMs typically use transformer-based neural networks: interconnected nodes organized into layers that can perform computations. The strengths of these connections—the influences that nodes have on another—are what is learned during training. These are called the model parameters or weights, and they are represented as numbers.

Here’s a simplified explanation of what happens when you use an AI like a large language model:

  1. You input a prompt (your question or request).
  2. The computer breaks down your prompt into smaller pieces called tokens. These can be words, parts of words, or even individual characters.
  3. The AI processes these tokens through its neural network – imagine this like a complex web of connections. Each part of this network analyzes the tokens and figures out how they relate to each other.
  4. As it processes, the AI predicts the probability distribution for the next token based on what it learned during its training.
  5. The LLM selects tokens based on these probabilities and combines them to create a coherent response or output for you, the user.

The “large” in Large Language Models primarily refers to the enormous number of parameters these models contain – sometimes in the trillions. These parameters represent the model’s learned patterns and relationships, fine-tuned through exposure to massive amounts of text data. While larger and more diverse high-quality datasets can lead to better AI models, other factors such as model architecture, training techniques, and fine-tuning also play important roles in model performance.

How Do AI Companies Obtain Their Training Data?

AI companies employ various methods to acquire this data – 

– Web scraping and crawling. One of the primary methods of data acquisition is web scraping – the automated process of extracting data from websites. AI companies deploy sophisticated crawlers that systematically browse the internet, copying text from millions of web pages. This method allows for the collection of diverse, up-to-date information but raises questions about the use of copyrighted material without explicit permission.

– Partnerships and licensing agreements. Some companies enter into partnerships or licensing agreements to access high-quality, curated datasets. For instance, OpenAI has partnered with organizations like the Associated Press to use its news archives for training purposes.

– Public datasets and academic corpuses. Many LLMs are trained, at least in part, on publicly available datasets and academic text collections. These might include Project Gutenberg’s collection of public domain books, scientific paper repositories, or curated datasets like the Common Crawl corpus.

– User-generated content. Platforms that interact directly with users, such as ChatGPT, can potentially use the conversations and inputs from users to further train and refine their models. This practice raises privacy concerns and questions about the ownership of user-contributed data.

In the context of the New York Times lawsuit, it’s worth noting that OpenAI, like many AI companies, has not publicly disclosed the full extent of its training data sources. However, it’s widely believed that the company uses a combination of publicly available web content, licensed datasets, and partnerships to build its training corpus. The lawsuit alleges that this corpus includes copyrighted New York Times articles, obtained without permission or compensation.

The Training Process: How Machines “Learn” From Data

Once acquired, the raw data undergoes several processing steps before it can be used to train an LLM – 

– Data preprocessing and cleaning. The first step involves cleaning the raw data. This includes removing irrelevant information, correcting errors, and standardizing the format. This may involve stripping away HTML tags, removing advertisements, or filtering out low-quality content.

– Tokenization and encoding. Next, the text is broken down into smaller units called tokens. These might be words, parts of words, or even individual characters. Each token is then converted into a numerical representation that the AI can process. This step is crucial as it determines how the model will interpret and generate language.

During training, the LLM is exposed to this preprocessed data, learning to predict patterns and relationships between tokens. This is an iterative process where the model makes predictions, compares them to the actual data, and adjusts its internal parameters to improve accuracy. This process, known as “backpropagation,” is repeated billions of times across the entire dataset. In a large LLM this can take months, operating 24/7 on a massive system of graphics processing chips.

The Transformation From Text to Numbers

For purposes of copyright law, here’s the crux of the matter: the AI industry asserts that after this process, the original text no longer exists in any recognizable form within the LLM. The model becomes a vast sea of numbers, with no direct correspondence to the original text. If true, this transformation creates a fundamental challenge for copyright law – 

– No Side-by-Side Comparison: In traditional copyright cases, courts rely heavily on comparing the original work side-by-side with the allegedly infringing material. With LLMs, this is impossible. You can’t “read” an LLM or print it out for comparison.

– Black Box Nature: The internal workings of LLMs are often referred to as a “black box.” Even the developers may not fully understand how the model arrives at its outputs.

– Dynamic Generation: The AI industry claims that LLMs don’t store and retrieve text in a conventional database format; they generate it dynamically based on learned patterns. This means that any similarity to copyrighted material in the output is a result of statistical prediction, not direct copying.

– Distributed Information: The AI industry claims that Information from any single source is distributed across countless parameters in the model, making it impossible to isolate the influence of any particular work.

However, copyright owners do not concede that completed AI models (as distinct from the training data) are only abstracted statistical patterns of the training data. Rightsholders assert that LLMs do indeed retain the expressions of the original works on which they have been trained. There are studies showing the LLM models are able to regurgitate their training materials, and the New York Times lawsuit against OpenAI and Microsoft shows 100 examples of this. See also Concord Music Group v. Anthropic (alleging that song lyrics can be accessed verbatim or near-verbatim from Claude). Rightsholders argue that this could only occur if the models encode the expressive content of these works.

Copyright Implications

Assuming the AI developers’ explanation to be correct (if its not the infringement case against them is strong), AI technology creates unprecedented challenges for copyright law – 

– Proving Infringement: How can a plaintiff prove infringement when the allegedly infringing material can’t be directly observed or compared?

– Fair Use Analysis: Traditional fair use factors, such as the amount and substantiality of the portion used, become difficult to apply when the “portion used” is transformed beyond recognition.

– Substantial Similarity: The legal test of “substantial similarity” between works becomes almost meaningless in the context of LLMs.

– Expert Testimony: Courts will likely have to rely heavily on expert testimony to understand the technology, but even experts may struggle to definitively prove or disprove infringement.

For all of these reasons, to prove copyright infringement plaintiffs such as the New York Times may be limited to claiming copyright infringement based on the “intermediate” copies that are used in the training process and user-prompted output, rather than the LLM models themselves. 

Conclusion

The NYT v. OpenAI case and others raising the same issue highlight a fundamental mismatch between traditional copyright law and the reality of LLM technology and the AI industries’ fair use defense. The outcome of this case could reshape our understanding of copyright in the digital age, potentially requiring new legal tests and standards that can account for the invisible, transformed nature of information within AI systems.

Part 2 in this series will focus on the legal issues around the “input problem” of using copyrighted material for training. Part 3 will look at the “output problem” of AI-generated content that may copy or resemble copyrighted works, including what the AI industry calls “memorization.” As we’ll see, each of these issues presents its own unique challenges in the context of a technology that defies traditional legal analysis.

Missed Deadline Leads to Dismissal of the Case

Missed Deadline Leads to Dismissal of the Case

Being a litigation attorney can be a scary business. You’re constantly thinking about how to organize the facts to fit your theory of the case, what legal precedent you may have overlooked, discovery, trial preparation and much, much more.

With all that pressure it’s not surprising that lawyers make mistakes, and one of the scariest things in litigation practice is the risk of missing a deadline. Depending on a lawyer’s caseload it may be difficult to keep track of deadlines. There are pleading deadlines, discovery deadlines, motion-briefing deadlines and appeal deadlines, to name just a few. And with some deadlines there is absolutely no court discretion available to save you, appeal deadlines being the best example of this. 

So, despite computerized docketing systems and best efforts, lawyers sometimes miss deadlines. That’s just a painful fact of life. Sometimes the courts will exercise their discretion and allow lawyers to make up a missed deadline. Many lawyers have spent many sleepless nights waiting to see if a court will overlook a missed deadline and give the lawyer a second chance.

But sometimes they won’t. A recent painful example of this is the 6th Circuit decision in RJ Control Consultants v. Multiject. That case involved a complex topic, the alleged illegal copying of computer source code. The case had been in litigation since 2016 and had already been the subject of two appeals. In other words, a lot of time and money had been invested. A glance at the docket sheet confirms this, with over 200 docket entries.

The mistake in that case was pedestrian: the court set a specific expert-disclosure deadline of February 26, 2021. By that date each party was obligated to provide expert reports. In federal court expert reports require a proposed expert to provide a detailed summary of the expert’s qualifications, opinions, and the information the expert relied on for his or her opinion. Fed. R. Civ. P. 26(a)(2)(B). The rule is specific and onerous. It often requires a great deal of time and effort to prepare expert disclosures. 

In the Multiject case neither party submitted expert reports by February 26, 2021. However, the real burden to do so was on the plaintiff, which has a challenging and complex burden of proof in software copyright cases. In a software copyright case it’s up to the plaintiff’s expert to analyze the code and separate elements that may not be protected (by reason of scenes a faire and merger, for example) from those that are protected expression. As the 6th Circuit stated in an interim appeal in this case –  

The technology here is complex, as are the questions necessary to establish whether that technology is properly protected under the Copyright Act. Which aspects or lines of the software code are functional? Which are expressive? Which are commonplace or standard in the industry? Which elements, if any, are inextricably intertwined?

The defendant, on the other hand, had a choice: it could submit its own expert report or just wait until it saw the plaintiff’s report. It could challenge the plaintiff’s report before trial or the plaintiff’s expert’s testimony at trial. So the defendant’s failure to file an expert report was not fatal to its case – it could wait.

The plaintiff’s expert was David Lockhart, and when the plaintiff failed to submit his report on the due date, the defendant filed a motion to exclude the report, and for summary judgment. The plaintiff asked for a chance to file Lockhard’s report late, but the court showed no mercy – it denied the motion and, since the plaintiff would need an expert to establish illegal copying, granted the defendant’s motion for summary judgment. 

In other words, end-of-case.

Why was the court unwilling to cut the plaintiff a break in this case? While the 6th Circuit discussed several issues justifying the denial, the one that strikes home for me is the plaintiff’s argument that they “reasonably misinterpreted” the court’s discovery order and made an “honest mistake” as to when the report was due. However, in the view of the trial judge this was not “harmless” error since it disrupted the court’s docket. The legal standard was “abuse of discretion,” and the Sixth Circuit held that the trial judge did not abuse his discretion in excluding Lockhart’s expert report after the missed deadline.

This is a sad way for a case to end, and the price is paid by the client, who likely had nothing to do with the missed deadline, but whose case was dismissed as a consequence. As I mentioned, the case began in 2016, and it was heavily litigated. There are seven reported decisions on Google Scholar, which is an unusually large number, and suggests that a lot of time and money was invested by both sides. To make matters worse, not only did the plaintiff lose this case, but the court awarded the defendants more than $318,000 in attorneys fees. 

Be careful out there.

RJ Control Consultants v. Multiject (6th Cir. April 3, 2024)

(Header image credit: Designed by Wannapik)