At the heart of large language model (LLM) technology lies a deceptively simple triad: compute, algorithms, and data.
Compute powers training – vast arrays of graphics processing units crunching numbers at a scale measured in billions of parameters and trillions of tokens. Algorithms shape the intelligence – breakthroughs like the Transformer architecture enable models to understand, predict, and generate human-like text. Data is the raw material, the fuel that teaches the models everything they know about the world.
While compute and algorithms continue to advance at breakneck speed, data has emerged as a critical bottleneck. Training a powerful LLM requires unimaginably large volumes of text, and the web’s low-hanging fruit – books, Wikipedia, forums, blogs – has largely been harvested.
The large AI companies have gone to extraordinary lengths to collect new data. For example, Meta allegedly used bittorrent to download massive amounts of copyrighted content. OpenAI reportedly transcribed over a million hours of Youtube videos to train its models. Despite this, the industry may be approaching a data ceiling.
In this context, Reddit represents an unharvested goldmine – a sprawling archive containing terabytes of linguistically rich user-generated content.
And, according to Reddit, Anthropic, desperate for training data, couldn’t resist the temptation: it illegally scraped Reddit without a license and trained Claude on Reddit data.
The Strategic Calculus: Why Contract, Not Copyright
Reddit filed suit against Anthropic in San Francisco Superior Court on June 4, 2025. However, it chose not to follow the well-trodden path of federal copyright litigation in favor of a novel contract-centric strategy. This tactical pivot from The New York Times Co. v. OpenAI and similar copyright-based challenges signals a fundamental shift in how platforms may assert control over their data assets.
Reddit’s decision to anchor its complaint in state contract law reflects the limitations of copyright doctrine and the structural realities of user-generated content platforms. Unlike traditional media companies that own their content outright, Reddit operates under a licensing model where individual users retain copyright ownership while granting the platform non-exclusive rights under Section 5 of its User Agreement.
This ownership structure creates obstacles for copyright enforcement at scale. Reddit would face complex Article III standing challenges, since it lacks the requisite ownership interest to sue for direct infringement. Moreover, the copyright registration requirements (a precondition to filing suit) would prove prohibitively expensive and logistically impossible for millions of user posts. And, even if Reddit could establish standing, it would face the copyright fair use defenses that have been raised in the more than 40 pending AI copyright cases.
By pivoting to contract law, Reddit sidesteps these constraints. Contract claims require neither content ownership nor copyright registration – only proof of agreement formation and breach.
The Technical Architecture of Alleged Infringement
Reddit’s complaint alleges that Anthropic’s bot made over 100,000 automated requests after July 2024, when Anthropic publicly announced it had ceased crawling Reddit. This timeline is significant because it suggests deliberate circumvention of robots.txt exclusion protocols and platform-specific blocking mechanisms.
According to Reddit, the technical details reveal sophisticated evasion tactics. Unlike legitimate web crawlers that respect robots.txt directives, the alleged “ClaudeBot” activity employed distributed request patterns designed to avoid detection. Reddit’s complaint specifically references CDN bandwidth costs and engineering resources consumed by this traffic – technical details that will be crucial for establishing the “impairment” element required under California’s trespass to chattels doctrine.
The complaint’s emphasis on Anthropic’s “whitelist” of high-quality subreddits demonstrates technical sophistication in data curation. This selective approach undermines any defense that the scraping was merely incidental web browsing, instead revealing a targeted data extraction operation designed to maximize training value while minimizing detection risk.
Here’s a quick look at the three strongest counts in Reddit’s complaint: breach of contract, trespass to chattels and unjust enrichment.
Browse-Wrap Formation: The Achilles’ Heel
The most vulnerable aspect of Reddit’s contract theory lies in contract formation under browse-wrap principles. Reddit argues that each automated request constitutes acceptance of its User Agreement terms, which prohibit commercial exploitation under Section 3 and automated data collection under Section 7.
However, under Ninth Circuit precedent applying California law, browse-wrap contracts require reasonably conspicuous notice, and Reddit’s User Agreement link appears in small text at the page footer without prominent placement or mandatory acknowledgment – what some courts have termed “inquiry notice” rather than “actual notice.”
And, unlike human users who might scroll past terms of service, automated bots often access content endpoints directly, without rendering full page layouts. This raises fundamental questions about whether algorithmic agents can form contractual intent under traditional offer-and-acceptance doctrine.
California courts have been increasingly skeptical of browse-wrap enforcement, and have required more than mere website access to establish assent. Reddit’s theory will need to survive a motion to dismiss where Anthropic will likely argue that no reasonable bot operator would have constructive notice of buried terms.
The Trespass to Chattels Gambit
Reddit’s trespass claim faces a high bar. California courts require proof of actual system impairment rather than mere unauthorized access. Reddit tries to meet this standard by citing CDN overage charges, server strain, and engineering time spent mitigating bot traffic. These bandwidth costs and engineering expenses, while real, may not rise to the level of system impairment that the courts demand.
The technical evidence will be crucial here. Reddit must demonstrate quantifiable performance degradation – slower response times, server crashes, or capacity limitations – rather than merely increased operational costs. This evidentiary burden may prove difficult given modern cloud infrastructure’s elastic scaling capabilities.
Unjust Enrichment and the Licensing Market
Reddit’s unjust enrichment claim rests on its data’s demonstrable market value, evidenced by licensing agreements with OpenAI, Google, and other AI companies. These deals, reportedly worth tens of millions annually, establish a market price for Reddit’s content.
The legal theory here is straightforward: Anthropic received the same valuable data as paying licensees but avoided the associated costs, creating an unfair competitive advantage. Under California law unjust enrichment requires showing that the defendant received a benefit that would be unjust to retain without compensation.
Reddit’s technically sophisticated Compliance API bolsters this claim. Licensed partners receive real-time deletion signals, content moderation flags, and structured data feeds that ensure training datasets remain current and compliant with user privacy preferences. Anthropic’s alleged automated data extraction bypassed these technical safeguards, potentially training on content that users had subsequently deleted or restricted.
Broader Implications for AI Governance
If Reddit’s contract theory succeeds, it would establish a powerful precedent allowing platforms to impose licensing requirements through terms of service. Every website with clear usage restrictions could potentially demand compensation from AI companies, fundamentally altering the economics of model training.
Conversely, if browse-wrap formation fails or federal preemption invalidates state law claims, AI developers would gain confidence that user generated web content remains accessible, subject to copyright limitations.
The Constitutional AI Paradox
Most damaging to Anthropic may be the reputational challenge to its “Constitutional AI” branding. The company has positioned itself as the ethical alternative in AI development, emphasizing safety and responsible practices. Reddit’s allegations create a narrative tension that extends beyond legal liability to market positioning.
This reputational dimension may drive settlement negotiations regardless of the legal merits, as Anthropic seeks to preserve its differentiated market position among enterprise customers increasingly focused on AI governance and compliance.
Conclusion
While Reddit’s legal claims face significant doctrinal challenges, the case underscores the importance of understanding both the technical architecture of web scraping and the evolving legal frameworks governing AI development. The outcome may determine whether platforms can use contractual mechanisms to assert control over their data assets, or whether AI companies can continue treating public web content as freely available training material subject only to copyright challenge.