Google’s NotebookLM has been getting a lot of attention. You upload your sources (articles, Youtube videos, URLs, text documents, audio files) and NotebookLM can create a podcast based on the library you’ve created.
I thought I’d experiment with this a bit. I uploaded a variety of articles on copyright and AI and hit “go.” I didn’t give NotebookLM the subject or any prompts. It figured out the topic (correctly) and created the 11 minute podcast embedded below.
A few observations:
First, the speaker voices are natural and realistic – they interact fluidly, have natural intonation and use varied speech patterns.
Second, the content quality is very high – the podcast correctly highlights Google Books as the leading case on the issue and outlines the implications of the case for and against fair use.
It also discusses the New York Times v. Microsoft/OpenAI case in detail, and focuses on the fact that the NYT was able to force ChatGPT to regurgitate verbatim or near verbatim NYT content.
The podcast goes on to discuss StabilityAI, the four fair use factors (as applied) and the larger consequences of LLMs on the copyright system.
In late August the U.S. Third Circuit Court of Appeals released a far reaching decision, holding that § 230 of the Communications Decency Act (CDA) did not provide a safe harbor for the social media company TikTok when its algorithms recommended and promoted a video which allegedly led a minor to accidentally kill herself. Anderson v. TikTok (3rd Cir. Aug. 27, 2024).
Introduction
First, a brief reminder – § 230, which was enacted in 1996, has been the guardian angel of internet platform owners. The law prohibits courts from treating a provider of an “interactive computer service” i.e., a website, as the “publisher or speaker” of third-party content posted on its platform. 47 U.S.C. § 230(c)(1). Under § 230 websites have been given broad legal protection. § 230 has created what is, in effect, a form of legal exceptionalism for Internet publishers. Without it any social media site (such as Facebook, X) or review site (such as Amazon) would be sued into oblivion.
On the whole the courts have given the law liberal application, dismissing cases against Internet providers under many fact scenarios. However, there is a vocal group that argues that the broad immunity protection given to § 230 of the CDA is based on overzealous interpretations far beyond its original intent.
Right now § 230 has one particularly prominent critic – Supreme Court Justice Clarence Thomas. Justice Thomas has not held back when expressing disagreement with the broad protection the courts have provided under § 230.
Nowhere does [§ 230] protect a company that is itself the information content provider . . . And an information content provider is not just the primary author or creator; it is anyone “responsible, in whole or in part, for the creation or development” of the content.
Again in Doe ex rel. Roe v. Snap, Inc. (2024), Justice Thomas dissented from the denial of certiorari and was critical of the scope of § 230, stating –
In the platforms’ world, they are fully responsible for their websites when it results in constitutional protections, but the moment that responsibility could lead to liability, they can disclaim any obligations and enjoy greater protections from suit than nearly any other industry. The Court should consider if this state of affairs is what § 230 demands.
With these judicial headwinds, Anderson v. TikTok sailed into the Third Circuit. Even one Supreme Court justice is enough to create a Category Two storm in the legal world. And boy, did the Third Circuit deliver, joining the § 230 opposition and potentially rewriting the rulebook on internet platform immunity.
Anderson v. TikTok
Nylah Anderson, a 10-year-old girl, died after attempting the “Blackout Challenge” she saw on TikTok. The challenge, which encourages users to choke themselves until losing consciousness, appeared on Nylah’s “For You Page”, a feed of videos curated by TikTok’s algorithm.
Nylah’s mother sued TikTok, alleging the company was aware of the challenge and promoted the videos to minors. TikTok defended itself using § 230, arguing that its algorithm shouldn’t strip away its immunity for content posted by others.
The Third Circuit took a novel approach to interpreting § 230, concluding that when internet platforms use algorithms to curate and recommend content, they are engaging in “first-party speech,” essentially creating their own expressive content.
The court reached this conclusion largely based on the Supreme Court’s recent decision in Moody v. NetChoice (2024). In that case the Court held that an internet platform’s algorithm that reflects “editorial judgments” about content compilation is the platform’s own “expressive product,” protected by the First Amendment. The Third Circuit reasoned that if algorithms are first-party speech under the First Amendment, they must be first-party speech under § 230 too.
Here is the court’s reasoning:
230 immunizes [web sites] only to the extent that they are sued for “information provided by another information content provider.” In other words, [web sites] are immunized only if they are sued for someone else’s expressive activity or content (i.e., third-party speech), but they are not immunized if they are sued for their own expressive activity or content (i.e., first-party speech) . . .. Given the Supreme Court’s observations that platforms engage in protected first-party speech under the First Amendment when they curate compilations of others’ content via their expressive algorithms, it follows that doing so amounts to first-party speech under § 230. . . . TikTok’s algorithm, which recommended the Blackout Challenge to Nylah on her FYP, was TikTok’s own “expressive activity,” and thus its first-party speech.
Accordingly, TikTok was not protected under § 230, and Anderson’s case could proceed.
Whether the Third Circuit’s logic will be adopted by other courts (including the Supreme Court, as I discuss below), is an open question. The court’s reasoning assumes that the definition of “speech” should be consistent across First Amendment and CDA § 230 contexts. However, these are distinct legal frameworks with different purposes. The First Amendment protects freedom of expression from government interference. CDA § 230 provides liability protection for internet platforms regarding third-party content. Treating them as interchangeable may oversimplify the nuanced legal distinctions between them.
Implications for Online Platforms
If this ruling stands, platforms may need to reassess their content curation and targeted recommendation algorithms. The more a platform curates or recommends content, the more likely it is to lose § 230 protection for that activity. For now, this decision has opened the doors for more lawsuits in the Third Circuit against platforms based on their recommendation algorithms. If the holding is adopted by other courts it could lead to a fundamental rethinking of how social media platforms operate.
As a consequence, platforms might become hesitant to use sophisticated algorithms for fear of losing immunity, potentially resulting in a less curated, more chaotic online environment. This could, paradoxically, lead to more harmful content being visible, contrary to the court’s apparent intent.
Where To From Here?
Given the potential far-reaching consequences of this decision, it’s likely that TikTok will seek en banc review by the full Third Circuit, and given the potential impact of the ruling there’s a strong case for full circuit review.
If unsuccessful there, this case is a strong candidate for Supreme Court review, since it creates a circuit split, diverging from § 230 interpretations in other jurisdictions. The Third Circuit even helpfully cites pre-Moody diverging opinions from the 1st, 2nd, 5th, 6th, 8th, 9th, and DC Circuits, essentially teeing it up for Supreme Court review.
In fact, all indications are that the Supreme Court would be receptive to an appeal in this case. The Court recently accepted the appeal of a case in which it would determine whether algorithm-based recommendations were protected under § 230. However, after hearing oral argument it decided the case on different grounds and didn’t reach the § 230 issue. Gonzalez v. Google (USSC May 18, 2023). Anderson presents another opportunity for the Supreme Court to weigh in on this issue.
In the meantime, platforms may start experimenting with different forms of content delivery that could potentially fall outside the court’s definition of curated recommendations. This could lead to innovative new approaches to content distribution, or it could result in less personalized, less engaging online experiences.
Conclusion
Anderson v. TikTok represents a potential paradigm shift in § 230 jurisprudence. While motivated by a tragic case, the legal reasoning employed could have sweeping consequences for online platforms, content moderation, and user-generated content. The decision raises fundamental questions about the nature of online platforms and the balance between protecting free expression online and holding platforms accountable for harmful content. As we move further into the age of AI curated feeds and content curation, these questions will only become more pressing.
In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging copyright infringement. This case, along with a number of similar cases filed against AI companies, brings to the forefront a fundamental challenge in applying traditional copyright law to a revolutionary technology: Large Language Models (LLMs). Perhaps more than any copyright case that precedes them, these cases grapple with a form of alleged infringement that defies conventional legal analysis.
This article is the first in a three-part series that will examine the copyright implications of the AI development process.
Disclaimer: I’m not a computer or AI scientist. However, neither are the judges and juries that will be asked to apply copyright law to this technology, or the legislators that may enact laws regulating it. It’s unlikely that they will go much beyond the level of detail I’ve used here.
What are Large Language Models (LLMs)?
Large Language Models, or LLMs, are gargantuan AI systems that use a vast corpus of training data and billions to trillions of parameters. They are designed to understand, generate, and manipulate human language. They learn patterns from the data, allowing them to perform a wide range of language tasks with remarkable fluency. Their inner workings are fundamentally different from any previous technology that has been the subject of copyright litigation, including traditional computer software.
LLMs typically use transformer-based neural networks: interconnected nodes organized into layers that can perform computations. The strengths of these connections—the influences that nodes have on another—are what is learned during training. These are called the model parameters or weights, and they are represented as numbers.
Here’s a simplified explanation of what happens when you use an AI like a large language model:
You input a prompt (your question or request).
The computer breaks down your prompt into smaller pieces called tokens. These can be words, parts of words, or even individual characters.
The AI processes these tokens through its neural network – imagine this like a complex web of connections. Each part of this network analyzes the tokens and figures out how they relate to each other.
As it processes, the AI predicts the probability distribution for the next token based on what it learned during its training.
The LLM selects tokens based on these probabilities and combines them to create a coherent response or output for you, the user.
The “large” in Large Language Models primarily refers to the enormous number of parameters these models contain – sometimes in the trillions. These parameters represent the model’s learned patterns and relationships, fine-tuned through exposure to massive amounts of text data. While larger and more diverse high-quality datasets can lead to better AI models, other factors such as model architecture, training techniques, and fine-tuning also play important roles in model performance.
How Do AI Companies Obtain Their Training Data?
AI companies employ various methods to acquire this data –
– Web scraping and crawling. One of the primary methods of data acquisition is web scraping – the automated process of extracting data from websites. AI companies deploy sophisticated crawlers that systematically browse the internet, copying text from millions of web pages. This method allows for the collection of diverse, up-to-date information but raises questions about the use of copyrighted material without explicit permission.
– Partnerships and licensing agreements. Some companies enter into partnerships or licensing agreements to access high-quality, curated datasets. For instance, OpenAI has partnered with organizations like the Associated Press to use its news archives for training purposes.
– Public datasets and academic corpuses. Many LLMs are trained, at least in part, on publicly available datasets and academic text collections. These might include Project Gutenberg’s collection of public domain books, scientific paper repositories, or curated datasets like the Common Crawl corpus.
– User-generated content. Platforms that interact directly with users, such as ChatGPT, can potentially use the conversations and inputs from users to further train and refine their models. This practice raises privacy concerns and questions about the ownership of user-contributed data.
In the context of the New York Times lawsuit, it’s worth noting that OpenAI, like many AI companies, has not publicly disclosed the full extent of its training data sources. However, it’s widely believed that the company uses a combination of publicly available web content, licensed datasets, and partnerships to build its training corpus. The lawsuit alleges that this corpus includes copyrighted New York Times articles, obtained without permission or compensation.
The Training Process: How Machines “Learn” From Data
Once acquired, the raw data undergoes several processing steps before it can be used to train an LLM –
– Data preprocessing and cleaning. The first step involves cleaning the raw data. This includes removing irrelevant information, correcting errors, and standardizing the format. This may involve stripping away HTML tags, removing advertisements, or filtering out low-quality content.
– Tokenization and encoding. Next, the text is broken down into smaller units called tokens. These might be words, parts of words, or even individual characters. Each token is then converted into a numerical representation that the AI can process. This step is crucial as it determines how the model will interpret and generate language.
During training, the LLM is exposed to this preprocessed data, learning to predict patterns and relationships between tokens. This is an iterative process where the model makes predictions, compares them to the actual data, and adjusts its internal parameters to improve accuracy. This process, known as “backpropagation,” is repeated billions of times across the entire dataset. In a large LLM this can take months, operating 24/7 on a massive system of graphics processing chips.
The Transformation From Text to Numbers
For purposes of copyright law, here’s the crux of the matter: the AI industry asserts that after this process, the original text no longer exists in any recognizable form within the LLM. The model becomes a vast sea of numbers, with no direct correspondence to the original text. If true, this transformation creates a fundamental challenge for copyright law –
– No Side-by-Side Comparison: In traditional copyright cases, courts rely heavily on comparing the original work side-by-side with the allegedly infringing material. With LLMs, this is impossible. You can’t “read” an LLM or print it out for comparison.
– Black Box Nature: The internal workings of LLMs are often referred to as a “black box.” Even the developers may not fully understand how the model arrives at its outputs.
– Dynamic Generation: The AI industry claims that LLMs don’t store and retrieve text in a conventional database format; they generate it dynamically based on learned patterns. This means that any similarity to copyrighted material in the output is a result of statistical prediction, not direct copying.
– Distributed Information: The AI industry claims that Information from any single source is distributed across countless parameters in the model, making it impossible to isolate the influence of any particular work.
However, copyright owners do not concede that completed AI models (as distinct from the training data) are only abstracted statistical patterns of the training data. Rightsholders assert that LLMs do indeed retain the expressions of the original works on which they have been trained. There are studies showing the LLM models are able to regurgitate their training materials, and the New York Times lawsuit against OpenAI and Microsoft shows 100 examples of this. See also Concord Music Group v. Anthropic (alleging that song lyrics can be accessed verbatim or near-verbatim from Claude). Rightsholders argue that this could only occur if the models encode the expressive content of these works.
Copyright Implications
Assuming the AI developers’ explanation to be correct (if its not the infringement case against them is strong), AI technology creates unprecedented challenges for copyright law –
– Proving Infringement: How can a plaintiff prove infringement when the allegedly infringing material can’t be directly observed or compared?
– Fair Use Analysis: Traditional fair use factors, such as the amount and substantiality of the portion used, become difficult to apply when the “portion used” is transformed beyond recognition.
– Substantial Similarity: The legal test of “substantial similarity” between works becomes almost meaningless in the context of LLMs.
– Expert Testimony: Courts will likely have to rely heavily on expert testimony to understand the technology, but even experts may struggle to definitively prove or disprove infringement.
For all of these reasons, to prove copyright infringement plaintiffs such as the New York Times may be limited to claiming copyright infringement based on the “intermediate” copies that are used in the training process and user-prompted output, rather than the LLM models themselves.
Conclusion
The NYT v. OpenAI case and others raising the same issue highlight a fundamental mismatch between traditional copyright law and the reality of LLM technology and the AI industries’ fair use defense. The outcome of this case could reshape our understanding of copyright in the digital age, potentially requiring new legal tests and standards that can account for the invisible, transformed nature of information within AI systems.
Part 2 in this series will focus on the legal issues around the “input problem” of using copyrighted material for training. Part 3 will look at the “output problem” of AI-generated content that may copy or resemble copyrighted works, including what the AI industry calls “memorization.” As we’ll see, each of these issues presents its own unique challenges in the context of a technology that defies traditional legal analysis.
Being a litigation attorney can be a scary business. You’re constantly thinking about how to organize the facts to fit your theory of the case, what legal precedent you may have overlooked, discovery, trial preparation and much, much more.
With all that pressure it’s not surprising that lawyers make mistakes, and one of the scariest things in litigation practice is the risk of missing a deadline. Depending on a lawyer’s caseload it may be difficult to keep track of deadlines. There are pleading deadlines, discovery deadlines, motion-briefing deadlines and appeal deadlines, to name just a few. And with some deadlines there is absolutely no court discretion available to save you, appeal deadlines being the best example of this.
So, despite computerized docketing systems and best efforts, lawyers sometimes miss deadlines. That’s just a painful fact of life. Sometimes the courts will exercise their discretion and allow lawyers to make up a missed deadline. Many lawyers have spent many sleepless nights waiting to see if a court will overlook a missed deadline and give the lawyer a second chance.
But sometimes they won’t. A recent painful example of this is the 6th Circuit decision in RJ Control Consultants v. Multiject. That case involved a complex topic, the alleged illegal copying of computer source code. The case had been in litigation since 2016 and had already been the subject of two appeals. In other words, a lot of time and money had been invested. A glance at the docket sheet confirms this, with over 200 docket entries.
The mistake in that case was pedestrian: the court set a specific expert-disclosure deadline of February 26, 2021. By that date each party was obligated to provide expert reports. In federal court expert reports require a proposed expert to provide a detailed summary of the expert’s qualifications, opinions, and the information the expert relied on for his or her opinion. Fed. R. Civ. P. 26(a)(2)(B). The rule is specific and onerous. It often requires a great deal of time and effort to prepare expert disclosures.
In the Multiject case neither party submitted expert reports by February 26, 2021. However, the real burden to do so was on the plaintiff, which has a challenging and complex burden of proof in software copyright cases. In a software copyright case it’s up to the plaintiff’s expert to analyze the code and separate elements that may not be protected (by reason of scenes a faire and merger, for example) from those that are protected expression. As the 6th Circuit stated in an interim appeal in this case –
The technology here is complex, as are the questions necessary to establish whether that technology is properly protected under the Copyright Act. Which aspects or lines of the software code are functional? Which are expressive? Which are commonplace or standard in the industry? Which elements, if any, are inextricably intertwined?
The defendant, on the other hand, had a choice: it could submit its own expert report or just wait until it saw the plaintiff’s report. It could challenge the plaintiff’s report before trial or the plaintiff’s expert’s testimony at trial. So the defendant’s failure to file an expert report was not fatal to its case – it could wait.
The plaintiff’s expert was David Lockhart, and when the plaintiff failed to submit his report on the due date, the defendant filed a motion to exclude the report, and for summary judgment. The plaintiff asked for a chance to file Lockhard’s report late, but the court showed no mercy – it denied the motion and, since the plaintiff would need an expert to establish illegal copying, granted the defendant’s motion for summary judgment.
In other words, end-of-case.
Why was the court unwilling to cut the plaintiff a break in this case? While the 6th Circuit discussed several issues justifying the denial, the one that strikes home for me is the plaintiff’s argument that they “reasonably misinterpreted” the court’s discovery order and made an “honest mistake” as to when the report was due. However, in the view of the trial judge this was not “harmless” error since it disrupted the court’s docket. The legal standard was “abuse of discretion,” and the Sixth Circuit held that the trial judge did not abuse his discretion in excluding Lockhart’s expert report after the missed deadline.
This is a sad way for a case to end, and the price is paid by the client, who likely had nothing to do with the missed deadline, but whose case was dismissed as a consequence. As I mentioned, the case began in 2016, and it was heavily litigated. There are seven reported decisions on Google Scholar, which is an unusually large number, and suggests that a lot of time and money was invested by both sides. To make matters worse, not only did the plaintiff lose this case, but the court awarded the defendants more than $318,000 in attorneys fees.
Copyright secondary liability can be difficult to wrap your head around. This judge-made copyright doctrine allows copyright owners to seek damages from organizations that do not themselves engage in copyright infringement, but rather facilitate the infringing behavior of others. Often the target of these cases are internet service providers, or “ISPs.”
Secondary liability has three separate prongs, “contributory” and “vicarious” infringement, and “inducement.” The third prong – inducement – is important but seen infrequently. For the elements of this doctrine see my article here.
Here’s how I outlined the elements of contributory and vicarious liability when I was teaching CopyrightX:
These copyright rules were the key issue in the Fourth Circuit’s recent blockbuster decision in Sony v. Cox Communications (4th Cir. Feb. 20, 2024).
In a highly anticipated ruling the court reversed a $1 billion jury verdict against Cox for vicarious liability but affirmed the finding of contributory infringement. The decision is a significant development in the evolving landscape of ISP liability for copyright infringement.
Case Background
Cox Communications is a large telecommunications conglomerate based in Atlanta. In addition to providing cable television and phone services it acts as an internet service provider – an “ISP” – to millions of subscribers.
The case began when Sony and a coalition of record labels and music publishers sued Cox, arguing that the ISP should be held secondarily liable for the infringing activities of its subscribers. The plaintiffs alleged that Cox users employed peer-to-peer file-sharing platforms to illegally download and share a vast trove of copyrighted music, and that Cox fell short in its efforts to control this rampant infringement.
A jury found Cox liable under both contributory and vicarious infringement theories, levying a jaw-dropping $1 billion in statutory damages – $99,830.29 for each of the 10,017 infringed works. Cox challenged the verdict on multiple fronts, contesting the sufficiency of the evidence and the reasonableness of the damages award.
The Fourth Circuit Opinion
On appeal, the Fourth Circuit dissected the two theories of secondary liability, arriving at divergent conclusions. The court sided with Cox on the issue of vicarious liability, finding that the plaintiffs failed to establish that Cox reaped a direct financial benefit from its subscribers’ infringing conduct. Central to this determination was Cox’s flat-fee pricing model, which remained constant irrespective of whether subscribers engaged in infringing or non-infringing activities. The mere fact that Cox opted not to terminate certain repeat infringers, ostensibly to maintain subscription revenue, was deemed insufficient to prove Cox directly profited from the infringement itself.
However, the court took a different stance on contributory infringement, upholding the jury’s finding that Cox materially contributed to known infringement on its network. The court was unconvinced by Cox’s assertions that general awareness of infringement was inadequate, or that a level of intent tantamount to aiding and abetting was necessary for liability to attach. Instead, the court articulated that supplying a service with the knowledge that the recipient is highly likely to exploit it for infringing purposes meets the threshold for contributory liability.
Given the lack of differentiation between the two liability theories in the jury’s damages award, coupled with the potential influence of the now-overturned vicarious liability finding on the damages calculation, the court vacated the entire award. The case now returns to the lower court for a new trial, solely to determine the appropriate measure of statutory damages for contributory infringement.
Relationship to the DMCA
This article’s header graphic illustrates the relationship between the secondary liability doctrines and the protection of the Digital Millennium Copyright Act (DMCA), Section 512(c) of the Copyright Act. As the graphic reflects, all three theories of secondary liability lie outside the DMCA’s safe harbor protection for third-party copyright infringement. The DMCA requires that a defendant satisfy multiple safe harbor conditions (See my 2017 article – Mavrix v. LiveJournal: The Incredible Shrinking DMC for more on this). If a plaintiff can establish the elements of any one of the three theories of secondary liability the defendant will violate one or more safe harbor conditions and lose DMCA protection.
Implications
The court’s decision signals a notable shift in the contours of vicarious liability for ISPs in the context of copyright infringement. By requiring a causal nexus between the defendant’s financial gain and the infringing acts themselves, the court has raised the bar for plaintiffs seeking to prevail on this theory.
The ruling underscores that simply profiting from a service that may be used for both infringing and non-infringing ends is insufficient; instead, plaintiffs must demonstrate a more direct and meaningful link between the ISP’s revenue and the specific acts of infringement. This might entail evidence of premium fees for access to infringing content or a discernible correlation between the volume of infringement and subscriber growth or retention.
While Cox may take solace in the reversal of the $1 billion vicarious liability verdict, the specter of substantial contributory infringement damages looms large as the case heads back for a retrial.
For ISPs, the ruling serves as a warning to reevaluate and fortify their repeat infringer policies, ensuring they go beyond cosmetic compliance with the DMCA’s safe harbor provisions. Proactive monitoring, prompt responsiveness to specific infringement notices, and decisive action against recalcitrant offenders will be key to mitigating liability risks.
On the other side of the equation, copyright holders may need to recalibrate their enforcement strategies, recognizing the heightened evidentiary burden for establishing vicarious liability. While the contributory infringement pathway remains viable, particularly against ISPs that display willful blindness or tacit encouragement of infringement, the Sony v. Cox decision underscores the importance of marshaling compelling evidence of direct financial benefit to support vicarious liability claims.
As this case enters its next phase, the copyright and technology communities will be focused on the outcome of the damages retrial. Regardless of the ultimate figure, the Fourth Circuit’s decision has already left a mark on the evolving landscape of online copyright enforcement.
This site is hosted by Gesmer Updegrove LLP, a technology law firm based in Boston, Massachusetts. You can find a summary of our services here. To learn how GU can help you, contact: Lee Gesmer