by Lee Gesmer | Apr 28, 2025 | Copyright
“Move fast and break things.” Mark Zuckerberg’s famous motto seems especially apt when examining how Meta developed Llama, its flagship AI model.
Like OpenAI, Google, Anthropic, and others, Meta faces copyright lawsuits for using massive amounts of copyrighted material to train its large language models (LLMs). However, the claims against Meta go further. In Kadrey v. Meta, the plaintiffs allege that Meta didn’t just scrape data — it pirated it, using BitTorrent to pull hundreds of terabytes of copyrighted books from shadow libraries like LibGen and Z-Library.
This decision could significantly weaken Meta’s fair use defense and reshape the legal framework for AI training-data acquisition.
Meta’s BitTorrent Activities
In Kadrey v. Meta, plaintiffs allege that discovery has revealed that Meta’s GenAI team pivoted from tentative licensing discussions with publishers to mass BitTorrent downloading after receiving internal approvals that allegedly escalated “all the way to MZ”—Mark Zuckerberg.
BitTorrent is a peer-to-peer file-sharing protocol that efficiently distributes large files by breaking them into small pieces and sharing them across a decentralized “swarm” of users. Once a user downloads a piece, they immediately begin uploading it to others—a process known as “seeding.” While BitTorrent powers many legitimate projects like open source software distribution, it’s also the lifeblood of piracy networks. Courts have long treated unauthorized BitTorrent traffic as textbook copyright infringement (e.g., Glacier Films v. Turchin, 9th Cir. 2018).
The plaintiffs allege that Meta engineers, worried that BitTorrent “doesn’t feel right for a Fortune 500 company,” nevertheless torrented 267 terabytes between April and June 2024—roughly twenty Libraries of Congress worth of data. This included the entire LibGen non-fiction archive, Z-Library’s cache, and massive swaths of the Internet Archive. According to the plaintiffs’ forensic analysis, Meta’s servers re-seeded the files back into the swarm, effectively redistributing mountains of pirated works.
The Legal Framework and Why BitTorrent Matters
Meta’s alleged use of BitTorrent complicates its copyright defense in several critical ways:
1. Reproduction vs. Distribution Liability
Most LLM training involves reproducing copyrighted works, which defendants typically argue is protected as fair use. But BitTorrent introduces unauthorized distribution under § 106(3) of the Copyright Act. Even if the court finds Llama’s training to be fair use, unauthorized seeding could constitute a separate violation harder to defend as transformative.
2. Willfulness and Statutory Damages
Internal communications allegedly showed engineers warning about the legal risks, describing the pirated sources as “dodgy,” and joking about torrenting from corporate laptops. Plaintiffs allege that Meta ran the jobs on Amazon Web Services rather than Facebook servers, in a deliberate effort to make the traffic harder to trace back to Menlo Park. If proven, these facts could support a finding of willful infringement, exposing Meta to enhanced statutory damages of up to $150,000 per infringed work.
3. “Unclean Hands” and Fair Use Implications
The method of acquisition may significantly impact fair use analysis. Plaintiffs point to Harper & Row v. Nation Enterprise (1985), where the Supreme Court found that bad faith acquisition—stealing Gerald Ford’s manuscript—undermined the defendant’s fair use defense. They argue that torrenting from pirate libraries is today’s equivalent of exploiting a purloined manuscript.
Meta’s Defense and Its Vulnerabilities
Meta argues that its use of the plaintiffs’ books is transformative: it extracts statistical patterns, not expressive content. They rely on Authors Guild v. Google Books (2nd Cir. 2015) and emphasize that fair use focuses on how a work is used, not obtained. Meta claims that its engineers took steps to minimize seeding—however, the internal data logs that would prove this are missing.
The company also frames Llama’s outputs as new, non-infringing content—asserting that bad faith, even if proven, should not defeat fair use.
However, the plaintiffs counter that Llama differs from Google Books in key respects:
– Substitution risk: Llama is a commercial product capable of producing long passages that may mimic authors’ voices, not merely displaying snippets.
– Scale: The amount of copying—terabytes of entire book databases—dwarfs that upheld in Google Books.
– Market harm: Licensing markets for AI training datasets are emerging, and Meta’s decision to torrent pirated copies directly undermines that market.
Moreover, courts have routinely rejected defenses based on the idea that pirated material is “publicly available.” Downloading infringing content over BitTorrent has never been viewed kindly—even when defendants claimed to have good intentions.
Even if Meta persuades the court that its training of Llama is transformative, the torrenting evidence remains a serious threat because:
– The automatic seeding function of BitTorrent means Meta likely distributed copyrighted material, independent of any transformative use
– The apparent bad faith (jokes about piracy, euphemisms describing pirated archives as “public” datasets) and efforts to conceal traffic present a damaging -arrative
– The deletion of torrent logs may support an adverse inference that distribution occurred
– Judge Vince Chhabria might prefer to decide the case on familiar grounds—traditional copyright infringement—rather than attempting to set sweeping precedent on AI fair use
Broader Implications
If the court rules that unlawful acquisition via BitTorrent taints subsequent transformative uses, the AI industry will face a paradigm shift. Companies will need to document clean sourcing for training datasets—or face massive statutory damages.
If Meta prevails, however, it may open the door for more aggressive data acquisition practices: anything “publicly available” online could become fair game for AI training, so long as the final product is sufficiently transformative.
Regardless of the outcome, the record in Kadrey v. Meta is already reshaping AI companies’ risk calculus. “Scrape now, pay later” is beginning to look less like a clever strategy and more like a legal time bomb.
Conclusion
BitTorrent itself isn’t on trial in Kadrey v. Meta, but its DNA lies at the center of the dispute. For decades, most fair use battles have focused on how a copyrighted work is exploited. This case asks a new threshold question: does how you got the work come first?
The answer could define how the next generation of AI is built.
by Lee Gesmer | Mar 22, 2025 | General
In 2019, Stephen Thaler filed an unusual copyright application. Instead of submitting traditional artwork, the piece—titled “A Recent Entrance to Paradise” (image at top)—identified an unusual “creator”: the “Creativity Machine.” The Creativity Machine is an AI system invented by Thaler. In his application for registration, Thaler informed the Copyright Office that the work was “created autonomously by machine,” and he claimed the copyright based on his “ownership of the machine.”
After appealing the Copyright Office denial of registration to the District Court and losing, Thaler appealed to the U.S. Court of Appeals for the District of Columbia.
On March 18, 2025, the D.C. Circuit upheld the Copyright Office as well as the District Court, holding that copyright protection under the 1976 Act cannot be granted to a work generated solely by artificial intelligence.
Notably, this ruling does not exclude AI-assisted works from protection; it merely confirms that a human must exercise genuine creative control. The key question now is how much human input is necessary to qualify as the author—a point the court left open for future clarification.
Here are the key takeaways:
Human Authorship Is Mandatory. The court held that the Copyright Act of 1976 requires an “author” to be a human being. Works generated solely by AI—where AI is listed as the sole creator—do not qualify. Under the Copyright Act “author” means human. A machine, including an AI system, is not a legal creator.
AI-Assisted Works May Still Be Protected. The court underscored that human creators remain free to use AI tools. Such works can be copyrighted, provided a person (not just AI) exercises creative control. This is consistent with the recently released Copyright Office Report on ‘Copyright and Artificial Intelligence (Part 2), which confirms that the use of AI tools to assist human creativity is not a bar for copyright protection of the output as long as there is sufficient human control over the expressive elements.

A Single Piece of American Cheese
In fact, on January 30, 2025, the Copyright Office registered A Single Piece of American Cheese, based on the “selection, coordination, and arrangement of material generated by artificial intelligence”. (Image at left). See How We Received The First Copyright for a Single Image Created Entirely with AI-Generated Material.
Work-Made-for-Hire Doesn’t Save AI-Only Authorship. Dr. Thaler’s argument that AI could be his “employee” under the work-for-hire doctrine failed because the underlying creation must still have a human author in the first place.
Waived Argument. Dr. Thaler tried to claim he was effectively the author by directing the AI. The court found he had not properly raised this argument at the administrative level and therefore declined to consider it. This might have been his best argument, had he made it.
Policy Questions Left to Congress. While noting that new AI technologies could raise important policy issues, the court emphasized that it is for Congress, not the judiciary, to expand copyright beyond human authors.
Thaler v. Perlmutter (D.C. Cir. Mar. 20, 2025)
(For an earlier post on this case see: Court Denies Copyright Protection to AI Generated Artwork.)
by Lee Gesmer | Mar 20, 2025 | General
In October 2024 I created (probably not the right word – delivered?) a podcast using NotebookLM: An Experiment: An AI Generated Podcast on Artificial Intelligence and Copyright Law. The podcast that NotebookLM created was quite good, so I thought I’d try another one.
This is in the nature of experimentation, simply to explore this unusual AI tool.
This time the topic is the Oracle v. Google copyright litigation. I thought this would be a good topic to experiment with, since it is a complex topic and there are decisions by federal district court judge William Alsup (link), two Federal Circuit opinions (1, 2), and finally the Supreme Court decision. So, here goes.
Google v. Oracle: Copyright and Fair Use of Software APIs
. . . (May load a bit slowly – give it time).
by Lee Gesmer | Mar 17, 2025 | Copyright, DMCA/CDA
In my recent post—Postscript to my AI Series – Why Not Use the DMCA?—I discussed early developments in two cases pending against OpenAI in the U.S. District Federal District Court for the Southern District of New York (SDNY). Both cases focus on the claim that in the process of training its AI models, OpenAI illegally removed “copyright management information.” And, as I discuss below, they reach different outcomes.
What Is Copyright Management Information?
Many people who are familiar with the Digital Millennium Copyright Act’s (DMCA) “notice and takedown” provisions are unfamiliar with a part of the DMCA that makes it illegal to remove “copyright management information,” or “CMI.”
CMI includes copyright notices, information identifying the author, and details about the terms of use or rights associated with the work. It can be visible directly on the work, or metadata in the underlying code.
The CMI removal statute—Section 1202(b)(1) of the DMCA—is a “double scienter” law, requiring that a plaintiff prove that (1) CMI was intentionally removed from a copyrighted work, and (2) that the alleged infringer knew or had reasonable grounds to know that the removal of CMI would “induce, enable, facilitate, or conceal” copyright infringement.
Here is an example of how this law might work.
Assume that I have copied a work and that I have a legitimate fair use defense. However, assume further that I duplicated the work, removed the copyright notice and published the work without it. I have a fair use defense as to duplication and distribution, but could I still be liable for CMI removal?
The answer is yes. A violation of the DMCA is independent of my fair use defense. And, the penalty is not trivial. Liability for CMI removal can result in statutory damages ranging from $2,500 to $25,000 per violation, as well as attorneys’ fees and injunctive relief. Moreover, unlike infringement actions, a claim for CMI removal does not require prior registration of the copyright.
All of this adds up to a powerful tool for copyright plaintiffs, a fact that has not been lost on plaintiffs’ counsel in AI litigation.
CMI – Why Don’t AI Companies Want To Include It?
AI companies’ removal of CMI during training stems from both technical necessities and strategic considerations. From a technical perspective, large language model training requires standardized data preparation processes that typically strip metadata, formatting, and peripheral information to create uniform training examples. This preprocessing is fundamental to how neural networks learn from text—they require clean, consistent inputs to identify linguistic patterns effectively.
The computational overhead is also significant. Preserving and processing CMI for billions of training examples would increase storage requirements and computational costs. AI companies argue that this additional information provides minimal benefit to model performance while significantly increasing training complexity.
Content owners, however, contend that these technical justifications mask more strategic motivations. They argue that AI companies deliberately eliminate attribution information to obscure the provenance of training data, making it difficult to detect when copyrighted material has been incorporated into models. This removal, they claim, facilitates a form of “laundering” copyrighted content through AI systems, where original sources become untraceable.
More pointedly, content creators assert that CMI removal directly enables downstream infringement by making it impossible for users to identify when an AI output derives from or reproduces copyrighted works. Without embedded attribution information, neither the AI company nor end users can properly credit or license content that appears in generated outputs.
The technical reality and legal implications of this process sit at the heart of these emerging cases, with courts now being asked to determine whether standard machine learning preprocessing constitutes intentional CMI removal under the DMCA’s “double scienter” standard.
Raw Story Media v. OpenAI
In the first of the two SDNY cases, Raw Story Media v. OpenAI, federal district court judge Colleen McMahon dismissed Raw Story’s claim that when training ChatGPT, OpenAI had illegally removed CMI.
At the heart of Judge McMahon’s decision was her observation that although OpenAI removed CMI from Raw Story articles, Raw Story was unable to allege that the works from which CMI had been removed had ever been disseminated by ChatGPT to anyone. On these facts, Judge McMahon held that Raw Story lacked standing under the Article III standing principles established by the Supreme Court in Transunion v. Ramirez (2021). It’s worth noting her observation that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote” based on “the quantity of information contained in the [AI model].”
The Intercept Media v. OpenAI
In the second case, The Intercept Media v. OpenAI, The Intercept made the same allegation. It asserted that OpenAI had intentionally removed CMI—in this case authors, copyright notices, terms of use and title information—from its AI training set.
However, in this case Judge Jed Rakoff came to the opposite conclusion. In November 2024 he issued a bottom-line order declining to dismiss plaintiff’s CMI claim and stated that an opinion explaining his rationale would be forthcoming.
That opinion was issued on February 20, 2025.
At this early stage of the case (before discovery or trial) the judge found that The Intercept met the “double scienter” standard. As to the first part of the test, The Intercept alleged that the algorithm that OpenAI uses to build its AI training data sets can only capture an article’s main text, which excludes CMI. This satisfied the intentional removal element.
As to the second component of the standard, the court was persuaded by The Intercept’s theory of “downstream infringement,” which argues that OpenAI’s model might enable users to generate content based on The Intercept’s copyrighted works without proper attribution. And importantly, unlike in Raw Story, The Intercept was able to provide examples of verbatim regurgitation of its content from ChatGPT based on prompts from The Intercept’s data scientist.
The district court held that a copyright injury “does not require publication to a third party,” finding unpersuasive OpenAI’s argument that the Intercept failed to demonstrate a concrete injury because it had not conclusively established that users had actually accessed The Intercept’s articles via ChatGPT.
Curiously, Judge Rakoff’s decision failed to mention the earlier ruling in Raw Story Media, Inc. v. OpenAI, where Judge McMahon held, on similar facts, that the plaintiffs lacked standing to assert removal of CMI claims. Both cases were decided by SDNY district court judges. However, unlike the ruling in Raw Story Media Judge Rakoff concluded that The Intercept’s alleged injury was closely related to the property-based harms typically protected under copyright law, satisfying the Article III standing requirement.
Thus, while Raw Story’s CMI claims against OpenAI have been dismissed, The Intercept’s CMI removal case against OpenAI will proceed.
by Lee Gesmer | Feb 17, 2025 | General
The community of copyright AI watchers has been eagerly awaiting the first case to evaluate the legality of using copyright-protected works as training data. We finally have it, and it has a lot of copyright law experts scratching their heads and wondering what it means for the AI industry.
On February 11, 2025, Third Circuit federal appeals court Judge Stephanos Bibas—sitting by designation in the U.S. District Court for the District of Delaware—issued a decision that is likely to shape the future of AI copyright litigation. By granting partial summary judgment to Thomson Reuters Enterprise Centre GmbH (“Thomson Reuters”) against Ross Intelligence Inc. (“Ross”), the court revisited and reversed its earlier 2023 opinion and rejected Ross’s fair use defense. Although this case involves a non-generative AI application, the reasoning has implications for the more than 30 ongoing AI copyright cases currently being litigated.
Case Overview
The Ross litigation centers on allegations that Ross used copyrighted material from Thomson Reuters’ Westlaw—a leading legal research platform—to train its AI-driven legal research tool. Ross wanted to use the Westlaw headnotes to train its AI model, but Thomson Reuters would not grant Ross a license. Instead, Ross commissioned “Bulk Memos” from a third-party provider. These memos, designed to simulate legal questions and answers, closely mirrored Westlaw headnotes—concise summaries that encapsulate judicial opinions. After determining that 2,243 headnotes were substantially similar to the Westlaw headnotes the court held that this was direct copyright infringement and rejected Ross’s fair use defense.
Breaking Down the Fair Use Analysis
The court evaluated the four statutory fair use factors, with two—“purpose and character” and “market effect”—proving decisive:
1 – Purpose and Character of the Use: The court found that Ross’s use was commercial and aimed at developing a product that directly competes with Westlaw. Despite Ross’s argument that its copying was merely an “intermediate step” in a broader process, the judge rejected the intermediate copying cases (discussed below), emphasizing that “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw.” Importantly, the court’s analysis was informed by the framework established in the recent Supreme Court decision in Warhol v. Goldsmith, which stressed that reproduction fails to constitute a transformative use if the copying serves a similar market function as the original. The Warhol precedent underlines that transformation requires a “further purpose or different character” from the original work, a requirement Ross did not meet.
2 – Market Effect: The market effect factor proved even more influential. By positioning itself as a direct substitute for Westlaw, Ross both disrupted the existing market and undercut potential licensing markets for Thomson Reuters’s content (notwithstanding that Thomson refused to license to Ross). The court noted that any harm to this market—“undoubtedly the single most important element of fair use”—weighed decisively against Ross.
While the factors addressing the nature of the copyrighted work and the amount used modestly favored Ross, they were insufficient to overcome the adverse findings regarding the purpose of the use and market harm.
The Court’s 2023 Ruling vs. The Current Ruling
It’s worth noting the struggle the judge went through in deciding the fair use issue in this case. Judges rarely reverse themselves on major rulings, but that’s what happened here.
As I noted, the judge in this case had issued a 2023 decision on the fair use issue. There, he held that the question of whether Ross’s use of the West headnotes was fair use to be a jury issue.
In the current decision he reversed himself.
Here’s what the judge said in 2023:
If Ross’s characterization of its activities is accurate, it translated human language into something understandable by a computer as a step in the process of trying to develop a “wholly new,” albeit competing, product—a search tool that would produce highly relevant quotations from judicial opinions in response to natural language questions. This also means that Ross’s final product would not contain or output infringing material. Under Sega [v. Accolade] and Sony [v. Connectix], this is transformative intermediate copying.
And here is what he said in his 2025 decision:
My prior opinion wrongly concluded that I had to send this factor to a jury. I based that conclusion on Sony and Sega. Since then, I have realized that the intermediate-copying cases [Sony, Sega] (1) are computer-programming copying cases; and (2) depend in part on the need to copy to reach the underlying ideas. Neither is true here. Because of that, this case fits more neatly into the newer framework advanced by Warhol. I thus look to the broad purpose and character of Ross’s use. Ross took the headnotes to make it easier to develop a competing legal research tool. So Ross’s use is not transformative. Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.
This was a major change in direction, and it reflects the challenge the judge perceived in applying copyright fair use to artificial intelligence under the facts in this case.
Implications for Generative AI Litigation
The question on the minds of most copyright AI observers is, “what does this mean for the more than 30 copyright cases against frontier AI model developers—OpenAI, Google, Anthropic, Facebook, X/Twitter, and many others”?
My answer? In most cases, likely not much.
The 2025 Ross decision underscores that even intermediate copying can fall outside fair use when it ultimately facilitates the creation of a product that directly competes with the copyrighted work. For example, unlike Authors Guild v. Google Books, where the transformation involved enabled a unique search function without substituting for the original works, Ross’s use of headnotes was aimed squarely at developing an AI legal research tool that encroaches on Westlaw’s market. This market harm—central to fair use analysis—undermines the fair use defense by establishing that the copying, even if temporary or intermediate, has a direct commercial impact. The ruling aligns with recent precedents like Warhol, which require a truly transformative purpose rather than mere replication, thereby narrowing the scope of permissible intermediate copying in AI training contexts.
However, the case may not have much significance for most of the pending AI copyright cases. While the Ross decision tightens the fair use framework in situations where the end product directly competes with the original work, most current generative AI cases do not involve direct competition. Most generative AI systems produce entirely new content rather than serving as a substitute for the copyrighted materials used during training. As a result, the market harm and competitive concerns central to the Ross ruling may not be as relevant in these cases, and its impact on the broader generative AI landscape may be limited.
Conclusion
The ruling in Thomson Reuters v. Ross Intelligence sets an important precedent for how courts may evaluate the use of copyrighted works in AI training. Although fact-specific and limited to a non-generative AI context, the decision’s reliance on principles from the Warhol case—particularly the need for a transformative purpose and the critical weight of market impact—will likely influence future disputes, including those involving frontier generative AI models, particularly where the AI model competes with the owner of the training data.
Developers and content owners alike should take note: as the legal landscape adapts to the realities of AI, robust data sourcing strategies and a clear understanding of copyright limitations will be crucial. For companies working on generative AI, the challenge will be to innovate without replicating the competitive functions of existing copyrighted works—a balancing act that this decision has now brought into focus.
It’s also important to note that this ruling doesn’t end the case. There are remaining issues of fact that the judge reserved for trial. However, it appears that Ross Intelligence is bankrupt, and therefore may not have the financial resources to continue to trial. And, of course, Ross could appeal the trial judge’s rulings at the conclusion of the case, although it is questionable whether it will be able to do so for the same reason. It seems likely that this case will end here.
Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. (D. Del. Feb. 11, 2025)