Mass Law Blog

Is It Legal To Use Copyrighted Works to Train Artificial Intelligence Systems?

by | Nov 27, 2023

If you follow developments in artificial intelligence, two recent items may have caught your attention. The first is a Copyright Office submission by the VC firm Andreessen Horowitz warning that billions of dollars in AI investments could be worth less if companies developing this technology are forced to pay for their use of copyrighted data. “The bottom line is this . . . imposing the cost of actual or potential copyright liability on the creators of AI models will either kill or significantly hamper their development.”

The second item is OpenAI’s announcement that it would roll out a “Copyright Shield,” a program that will provide legal defense and cost-reimbursement for its business customers who face copyright infringement claims. OpenAI is following the trend set by other AI providers, like Microsoft and Adobe, who are promising to indemnify their customers who may fear copyright lawsuits from their use of generative AI.

Underlying these two news stories is the fact that the explosion of generative AI has the copyright community transfixed and the AI industry apprehensive. The issue is this: Does copyright fair use allow AI providers to ingest copyright-protected works, without authorization or compensation, to develop large language models, the data sets that are at the heart of generative artificial intelligence? Multiple lawsuits have been filed by content owners raising exactly this issue.

The Technology

The current breed of generative AIs is powered by large language models (LLMs), also known as Foundation Models. Examples of these systems are ChatGPT, DALL·E-3, MidJourney and Stable Diffusion.

This technology requires that developers collect enormous databases known as “training sets.” This almost always requires copying millions of images, videos, audio and text-based works, many of which are protected by copyright law. When the data is scraped from the web this is potentially a massive infringement of copyright. The risk for AI companies is that, depending on the content (text, images, music, movies), this could violate the exclusive rights of reproduction, distribution, public performance, and the right to create derivative works.

However, for purposes of copyright fair use analysis it’s important to recognize that the downloads are only an intermediate step in creating an LLM. Greatly simplified, here’s how it works:

In the process of creating an LLM model words are broken down into tokens, numerical representations of the word. Each token is a unique numerical ID. The numerical IDs are then transformed into high-dimensional vectors. These vectors are learned during the model’s training and capture semantic meanings and relationships.

Through multiple layers of transformation and abstraction the LLM identifies patterns and correlations within the data. Cutting edge systems like GPT-4 have trillions of parameters. Importantly, these are not copies or replications of the copyright-protected input data. This process of transformation minimizes the risk that any output will be infringing. A judge or jury viewing the data in an LLM would see no similarity between the original copyrighted text and the LLM.

Is Generative AI Transformative?

Because the initial downloads in this process are copies, they are technically a copyright infringement – a “reproduction.” Therefore, it’s up to the AI companies to present a legal defense that justifies the copying, and the AI development community has made it clear that this defense is based on copyright fair use. At the heart of the AI industry’s fair use argument is the assertion that AI training models are “non-expressive uses.” Copyright protects expression. Non-expressive use is the use of copyrighted material in a way that does not involve the expression of the original material. 

For the reasons discussed above, the AI industry has a strong argument that a properly constructed LLM is a non-expressive use of the copied data.

However, depending on the specific technology this may be an oversimplification. Not all AI systems are the same. They may use different data sets. Some, but not all, are designed to minimize “memorization” which makes it easier for end users to retrieve blocks of text or infringing images. Some systems use output filters to prevent end users from utilizing the LLM to create infringing content.

For any given AI system the fair use defense turns on whether the  LLM is trained and filtered in such a way that its outputs do not resemble protected inputs. If users can obtain the original content, the fair use defense is more difficult to sustain.

There is a widespread assumption in the AI industry that, assuming an AI is designed with adequate safety measures, using copyright-protected content to train LLMs is shielded by the fair use doctrine. After all, the reasoning goes, the Second Circuit allowed Google to create a searchable index of copyrighted books under fair use. (Google Books, Hathitrust). And the Supreme Court permitted Google to copy Oracle’s Java API computer code for a different use. (Oracle v. Google). AI companies also point to cases holding that search engines, intermediate copying for the purpose of reverse engineering and plagiarism-detection software are transformative and therefore allowed under fair use. (Perfect 10 v. Google; Sega Enterprises v. Accolade; A.V. et al. v. iParadigms

In each of these cases the use was found to be “transformative.” So long as the act of copying did not communicate the original protected expression to end users it did not interfere with the original expression that copyright is designed to protect. The AI industry contends that LLM-based systems that are properly designed fall squarely under this line of cases.

How Does Generative AI Impact Content Owners?

In evaluating AI’s fair use defense the commercial impact on content owners is also important. This is particularly true under the Supreme Court’s decision earlier this year in Warhol Foundation v. Goldsmith. In Warhol the Court taught that, in a case that involved commercial copying of photographs, the fact that the copies were used in competition with the originals weighed against fair use. 

AI developers will argue that, so long as users can’t use their generative AI systems to access protected works, there is no commercial impact on content owners. In other words, like in Google Books, the AI does not substitute for or compete with content owners’ original protected expression. No one can use a properly constructed AI to read a James Patterson novel or listen to a Joni Mitchell song.

The AIs should be able to distinguish Warhol by pointing out that they are not selling the actual copyrighted books or images in their data sets, and therefore – like in Google Books – they are causing the content owners no commercial harm. In other words, the AI developers will argue that the “intermediate copying” involved in creating and training an LLM is transformative where the resulting model does not substitute for any author’s original expression and the model targets a different economic market. 

Does the authority of Google Books and the other intermediate copying cases extend to the type of machine learning that underpins generative AI? While the law regulating AI is in its infancy, several recent district court cases have given plaintiffs an unfriendly reception. In Thomson Reuters v. Ross Intelligence the defendant used West’s head notes and key number system to train a specialized natural language AI for lawyers. West claimed infringement. A Delaware federal district court judge denied Ross’s motion for summary judgment based on fair use, and held that the case must be decided by a jury. However, relying on the intermediate copying cases, the judge noted that Ross would have engaged in transformative fair use if its AI merely studied language patterns in the Westlaw headnotes and did not replicate the headnotes themselves. Since this is in fact how LLMs are trained on data, Ross’s fair use defense likely will succeed.

In a second case, Kadrey v. Meta, the plaintiffs, book authors, claimed that Meta’s inclusion of their books in its AI training data violated their exclusive ownership rights in derivative works. The Northern District of California federal judge dismissed this claim. The judge noted that the LLM models could not be viewed as recasting or adapting the plaintiff’s books. And, the plaintiffs had failed to allege that the content of any output was infringing. “The plaintiffs need to allege and ultimately to prove that the AI’s outputs incorporate in some form a portion of the plaintiffs’ books.” Another N.D. Cal. case, Andersen v. Stability AI is consistent with these rulings.

While these cases are early in the evolution of the law of artificial intelligence they suggest how AI developers can take precautions to insulate themselves from copyright liability. And, as discussed below, the industry is already taking steps in this direction.

The Industry Is Adapting To The Copyright Threat

In the face of legal uncertainty, the AI industry is adapting to legal risks. The potential damages for copyright infringement are massive, and the unofficial Silicon Valley motto – “move fast and break things” – doesn’t apply with the stakes this high.

ChatGPT4: Create an image showing Jack Nicholson in The Shining

Early in the current generative AI boom (only a year ago) it was possible to use some of these systems to generate copyright- protected content. However, the dominant AI companies seem to have plugged this hole. Today, if I ask OpenAI’s ChatGPT to provide the lyrics to “All Too Well” by Taylor Swift it declines to do so. When I ask for the text of the opening paragraph of Stephen King’s “The Shining,” again it refuses and tells me that it’s protected by copyright. When I ask OpenAI’s text-to-image creator Dall·E for an image of Batman, Dall·E refuses, and warns me that what it will create will be sufficiently different from the comic book character to avoid copyright infringement.

These technical filters are illustrative of the ways that the industry can address the copyright challenge, short of years of litigation in the federal courts.

The first, and most obvious, is to train the systems not to provide infringing output. As noted, Open AI is doing exactly this. The Shining may have been downloaded and used to create and train Chat GPT, but it won’t let me retrieve the text of even a small part of that novel.

ChatGPT4: Create an image of Taylor Swift performing her song “All Too Well”

Another technical measure is minimization of duplicates of the same work. Studies have found that the more duplicates that are downloaded and processed in an LLM the easier it is for end-users to retrieve verbatim protected content. “Deduplication” is a solution to this problem.

Another option is to license copyrighted content and pay its creators. While this would be logistically challenging, a challenge of similar complexity has been met in the music industry, which has complex licensing rules that address different types of music licensing and a centralized database system to make that process accessible. If the courts prove to be hostile to AI’s fair use defense the generative AI field could evolve into a licensing regime similar to that of music.

Another solution is for the industry to create “clean” databases, where there is no risk of copyright infringement. The material in the database will have been properly licensed or will be comprised of public domain materials. An example would be an LLM trained on Project Gutenberg, Wikipedia and government websites and documents. 

Given the speed at which AI is advancing I expect a variety of yet-to-be conceived or discovered infringement mitigation strategies to evolve, perhaps even invented by artificial intelligence.

International Issues

Copyright laws vary significantly across countries. It’s worth noting that there has been more legislative activity on the topics discussed in this post in the EU than the US. That said, as of the date of this post near the close of 2023 there is no consensus on how LLMs should be treated under EU copyright law. 

Under a recent proposal made in connection with the proposed EU “AI Act,” providers of LLMs would need to “prepare and make publicly available a sufficiently detailed summary of the content used to train the model or system and information on the provider’s internal policy for managing copyright-related aspects.”

Additionally, they would need to demonstrate “that adequate measures have been taken to ensure the training of the model or system is carried out in compliance with Union law on copyright and related rights . . .”

The second of these two provisions would allow rights holders to opt out of allowing their works to be used for LLM training. 

In contrast, the recent US AI Executive Order orders the Copyright Office to conduct a study that would include “the treatment of copyrighted works in AI training,” but does not propose any changes to US copyright law or regulations. However, US AI companies will have to pay close attention to laws enacted in the EU (or elsewhere), since – as has been the case with the EU’s privacy laws (GDPR) – they have the potential to become a de facto minimal standard for legal compliance worldwide. 

Andreessen Horowitz and the Copyright Shield

What about the two news items that I mentioned at the beginning of this post? With respect to the Andreessen Horowitz warning of the cost of copyright risk on AI developers, in my view the risk is overstated. If AI developers design their systems with the proper precautions, it seems likely that the courts will find them to qualify for fair use.

As to OpenAI’s promise to indemnify end users, the risk to OpenAI is slim, since its output is rarely similar to inputs in its training data and its filters are designed to frustrate users who try to output copyrighted content. In any event end users are rarely the targets of infringement suits, as seen in the many copyright suits that have been filed to date, which all target only AI companies as defendants.

The Future

The application of US copyright law to LLM-based AI systems is a complex topic. I expect more lawsuits to be filed as what appears to be a massive revolution in artificial intelligence continues at breakneck speed. While traditional copyright law seems to favor a fair use defense, the devil is in the details of these complex systems, and the legal outcome is by no means certain.


Selected pending cases:

Andersen v. Stability AI, N.D. Cal. 

J.L. v. Alphabet Inc., N.D. Cal.

P.M. v. OpenAI, N. Dist. Cal. 

Doe v. GitHub, N.D. Cal

Thomson Reuters Enter. Ctr. GmbH v. Ross Intel. Inc., D. Del.

Kadrey v. Meta, N.D. Cal. 

Sancton v. OpenAI, S.D. N.Y.

Doe v. GitHub, N.D. Cal.