Training Data on Trial: AI's First Fair Use Test

Training Data on Trial: AI's First Fair Use Test
Three landmark 2025 cases define the boundary between permissible AI training and copyright infringement. Federal district courts delivered the first comprehensive answers to the central question facing AI development: when does using copyrighted works to train models constitute fair use?
Paul Roberts
2025 Analysis
Copyright
Fair Use
Artificial Intelligence
The Question Before the Courts
Does AI training constitute fair use under 17 U.S.C. § 107? Three federal district courts delivered the first comprehensive answers in 2025, establishing a framework that will shape AI development for years to come.
Thomson Reuters v. Ross Intelligence
D. Del.
Legal research AI trained on competitor's database
Bartz v. Anthropic PBC
N.D. Cal.
Authors challenge LLM training on their novels
Kadrey v. Meta Platforms Inc.
N.D. Cal.
Fiction authors sue over LLaMA training dataset
These cases established divergent outcomes based on a single principle: the function-specific nature of transformation.
The Framework: Function-Specific Transformation
Courts evaluate AI training through function-specific transformation analysis. The inquiry is straightforward but consequential: Does the AI serve the same purpose as the copyrighted work it was trained on?
1
Same Function
AI serves the same purpose as the original work
Result: Not transformative, likely infringement
2
Different Function
AI serves an entirely different purpose
Result: Transformative, likely fair use
Critical insight: Technology alone doesn't transform. Purpose does. The sophistication of the AI model is irrelevant if it competes with the copyrighted work's commercial function.
Case 1: Ross Intelligence Failed
Fair Use Denied
Thomson Reuters v. Ross Intelligence Inc.
No. 1:20-cv-613-SB (D. Del. Feb. 11, 2025)
The Facts
Ross Intelligence built an AI-powered legal research tool designed to compete directly with Thomson Reuters' Westlaw service. To train its model, Ross used Westlaw's proprietary headnotes—editorial summaries of legal principles extracted from cases.
The Court's Analysis
The court found no transformation. Ross used headnotes for legal research. Westlaw created headnotes for legal research. Same function, same market, direct competition.
Training Data
Westlaw headnotes
AI Purpose
Legal research tool
Original Purpose
Legal research database
Result
Fair use denied
Ross Intelligence: The Four-Factor Analysis
The court meticulously applied the statutory fair use factors from 17 U.S.C. § 107, demonstrating how functional overlap defeats fair use even when using sophisticated AI technology.
01
Purpose and Character of Use
Commercial, non-transformative, same purpose
Ross's use was commercial and served the identical purpose as the original headnotes: enabling legal research. The court found this factor weighed against fair use.
02
Nature of Copyrighted Work
Creative editorial works
Headnotes represent creative editorial judgment by Thomson Reuters' attorney-editors. This factor weighed against fair use.
03
Amount and Substantiality
Headnotes not publicly accessible
Ross copied content not freely available to the public. However, the court found this factor favored fair use because wholesale copying was necessary for the training process.
04
Effect on Market
Direct competition, licensing market harmed
Ross competed directly with Westlaw, and Thomson Reuters had refused to license the headnotes. The court found cognizable market harm and held this factor decisively against fair use.
Holding: Factors 1, 2, and 4 outweighed factor 3. Fair use defense failed because the AI served the same commercial function as the copyrighted works used for training.
Case 2: Anthropic Succeeded
Fair Use Granted
Bartz v. Anthropic PBC
No. 3:23-cv-04648-WHO (N.D. Cal. June 23, 2025)
The Facts
Authors sued Anthropic, alleging the company trained its Claude LLM on their copyrighted novels without permission. The plaintiffs argued this constituted wholesale copying for commercial gain.
The Court's Analysis
The court granted summary judgment for Anthropic, finding the use transformative as a matter of law. Claude doesn't compete with novels—it extracts statistical patterns to generate new text across domains.
Original Function
Novels deliver narrative, entertainment, creative expression
AI Function
Extract linguistic patterns, learn language structure, enable text generation
Result
Transformative analytical use—fair use granted
Anthropic: The Four-Factor Analysis
The Northern District of California delivered a resounding endorsement of AI training for analytical purposes, finding fair use as a matter of law with no need for trial.
Factor 1: Purpose and Character
Highly transformative and analytical, extracting patterns, not expressive content. This strongly favored fair use.
Factor 2: Nature of Work
Despite creative works as input, the analytical use rendered this factor of minimal weight. The AI extracted non-copyrightable elements.
Factor 3: Amount and Substantiality
Complete copying was necessary for training, but Claude's outputs were non-substitutive. This factor favored fair use.
Factor 4: Effect on Market
No evidence of market substitution or an established licensing market for AI training was presented. This factor strongly favored fair use.
Holding: Fair use granted, as all four factors supported analytical AI training that doesn't compete with original works' expressive function.
Case 3: Meta Succeeded
Fair Use Granted
Kadrey v. Meta Platforms Inc.
No. 3:23-cv-04647-VC (N.D. Cal. June 25, 2025)
The Facts
Authors sued Meta for training its LLaMA model on novels obtained from shadow libraries—pirated repositories of copyrighted works. Plaintiffs argued this wholesale copying from illegal sources could never constitute fair use.
The Court's Analysis
The court granted summary judgment for Meta, finding the use transformative, non-expressive, and non-substitutive. The sourcing question didn't alter the fair use calculus when the fundamental use was analytical.
Entirely new function
Statistical learning
Non-substitutive outputs
Meta: The Four-Factor Analysis
The court's analysis closely paralleled Bartz, establishing that sourcing from shadow libraries doesn't defeat fair use when the underlying use is transformative and non-competitive.
1
Purpose and Character of Use
Entirely new function, statistical learning
LLaMA extracts patterns to enable text generation across domains—an analytical function unrelated to experiencing novels as narrative works. Strongly favored fair use.
2
Nature of Copyrighted Work
Creative fiction, but weak force in analytical use
The court acknowledged novels are highly creative but held this factor carries minimal weight when the use extracts non-copyrightable statistical patterns rather than expressive content.
3
Amount and Substantiality Used
Complete copying necessary, outputs don't expose works
Meta copied complete novels, but LLaMA's outputs don't reproduce them in any form that would substitute for reading the originals. Favored fair use.
4
Effect on Market or Value
No displacement, no evidence, no licensing market
Plaintiffs provided no evidence of market harm—no sales data, surveys, or economic analysis. The court rejected speculation about future licensing markets as insufficient. Decisively favored fair use.
Holding: Fair use as a matter of law. The sourcing from shadow libraries didn't alter the fundamental calculus when the use was analytical and non-substitutive.
Principle 1: Transformation Is Function-Specific
The central question courts ask is deceptively simple: Does the AI serve the same purpose as the original copyrighted work? Function determines transformation, not the sophistication of the technology.
Ross Intelligence
Used headnotes for legal research
Westlaw uses headnotes for legal research
Same function → Not transformative
Anthropic & Meta
LLMs extract statistical patterns from novels
Novels deliver narrative and entertainment
Different function → Transformative
This functional test provides a clear framework: competitive substitution fails, while analytical repurposing succeeds. The AI's output capabilities and technical architecture are secondary to the fundamental question of market function.
Principle 2: Intermediate Copying When Non-Expressive
Complete copying of copyrighted works during training is permissible when three conditions are satisfied. This principle reconciles the technical requirements of AI development with copyright's exclusive reproduction right.
Technologically Necessary
Complete copying is required to achieve the transformative purpose. Partial copying would prevent effective pattern extraction and model training.
Transformative Purpose
The copying serves an analytical function entirely different from the copyrighted work's expressive purpose.
Non-Substitutive Output
Copyrighted works are not exposed to users in substitutive form. The model generates new outputs rather than reproducing training data.
Critical Distinction
Copying in memory vs. copying in output
Memory copying for training purposes is acceptable if outputs are non-substitutive. Courts distinguish between intermediate copies necessary for computation and expressive copies that compete with originals.
Principle 3: Market Harm Requires Evidence
Factor 4—effect on the market—is "the single most important element of fair use" according to the Supreme Court in Harper & Row. But speculation doesn't suffice. Courts demand empirical evidence of actual or likely market harm.
Acceptable Evidence
Sales data showing displacement
Consumer surveys demonstrating substitution
Economic analysis of market effects
Lost licensing revenue with documentation
Insufficient Evidence
Theoretical harm without data
Hypothetical future markets
Speculation about licensing
Assertions without empirical support
In both Bartz and Kadrey, plaintiffs provided no sales data, consumer surveys, or economic analysis. Courts rejected their arguments as purely speculative. The lesson: document market effects or expect to lose on Factor 4.
Principle 4: Creative Nature Has Diminished Weight
Under traditional fair use analysis, the creative nature of a work weighs against fair use. But this weight diminishes substantially when the use is analytical rather than expressive.
Traditional Analysis
Creative works like novels, photographs, and music receive stronger copyright protection than factual works. Using creative works typically weighs against fair use under Factor 2.
AI Training Context
When AI extracts statistical patterns—non-copyrightable elements—from creative works, the creativity of the source matters less. The use is analytical, not expressive.
The Input
Fiction is highly creative and deserves strong protection
The Use
Learning language patterns is analytical, not expressive
The Result
Factor 2 carries minimal weight in AI training cases
This principle reflects a fundamental insight: copyright protects expression, not the underlying ideas, facts, or patterns that can be extracted through analytical methods.
The Divergence: Licensing Markets
The three cases diverged sharply on Factor 4 based on a single distinction: established versus hypothetical licensing markets. This difference proved dispositive.
Ross Intelligence
Recognized Derivative Market
Ross sought a license from Thomson Reuters for Westlaw headnotes. Thomson Reuters refused, asserting its right to control derivative works.
Court's holding: Potential licensing market exists and is cognizable. Market harm is real, not speculative.
Bartz & Kadrey
Rejected Hypothetical Markets
No established licensing practice for AI training. No industry standards. No evidence of functioning markets.
Court's holding: Purely speculative future markets are insufficient. Without empirical evidence, Factor 4 favors fair use.
The practical lesson: courts distinguish between licensing markets that exist today (evidenced by actual negotiations, industry practice, and documented harm) and licensing markets that might exist tomorrow (theoretical frameworks without empirical support).
Open Questions
These three cases established a framework, but significant questions remain unresolved. Courts, practitioners, and policymakers continue to grapple with edge cases and emerging issues.
Shadow Libraries
Does sourcing training data from pirated content affect the fair use analysis? Kadrey suggested no, but the question may resurface if plaintiffs can demonstrate that legitimate licensing markets were bypassed.
Emergent Licensing Markets
When do hypothetical markets become cognizable? If publishers establish functioning licensing mechanisms for AI training, courts may recognize these markets under Factor 4.
Hybrid Pipelines
What if training is analytical but outputs occasionally reproduce text verbatim? Courts may distinguish between systems designed to avoid reproduction and those that permit substantial memorization.
Congressional Action
Will legislation intervene with safe harbors, compulsory licensing regimes, or transparency requirements? The framework established by these cases may inform statutory reforms.
The Boundary Defined
These three cases establish a clear boundary between permissible and impermissible AI training. The distinction is functional, not technological.
1
Competitive Substitution Fails
Ross Intelligence: Built legal research AI to compete with legal research database
Same market → Same function → No transformation → Fair use denied
2
Analytical Repurposing Succeeds
Anthropic & Meta: LLMs extract patterns from novels to generate new, non-substitutive outputs
Different market → Different function → Transformation → Fair use granted
This boundary reflects copyright's fundamental purpose: protecting creative markets while permitting analytical uses that generate new value without competing with the original work's commercial function.
Practical Guidance: For AI Developers
AI developers should approach training with a clear fair use strategy. These four principles maximize your likelihood of prevailing if challenged.
Focus on Function
Ensure your AI serves a different purpose than the copyrighted works used for training. Document that your model provides analytical capabilities, not substitutes for experiencing the original content.
Be Transparent
Document your training process thoroughly. Demonstrate that you're learning patterns, not replacing originals. Transparency strengthens your fair use defense and builds credibility with courts.
Prepare Market Analysis
Collect data showing your AI doesn't substitute for copyrighted works. Track whether users still purchase or access originals. Build an empirical record that defeats speculation about market harm.
Consider Sourcing Strategy
Build relationships with publishers. Explore licensing when available. Use openly licensed materials where possible. While sourcing may not determine fair use, it demonstrates good faith and reduces litigation risk.
Practical Guidance: For Creators
Copyright owners should understand both the protections they retain and the analytical uses courts will permit. Strategic engagement beats blanket opposition.
Your Market Is Protected
No AI can replace your creative work and serve as a substitute for your commercial output without permission. If an AI competes with your work's function, you have a strong infringement claim.
Analytical Use Is Different
Others can learn from your work and extract non-copyrightable patterns. This includes AI training for analytical purposes that don't compete with your market. Opposition to all training may prove futile.
Consider Your Options
Opposition, licensing, or embrace—each represents a valid strategy. Some creators oppose all AI training. Others negotiate licenses. Still others embrace AI as a distribution channel. Choose thoughtfully based on your goals.
Practical Guidance: For Advisors
Legal advisors must apply this emerging framework while acknowledging its continued evolution. Four principles guide effective counseling.
01
Apply the Functional Test
Ask whether the AI serves the same function as the copyrighted works used for training. If yes, fair use is unlikely. If no, fair use becomes viable.
02
Demand Evidence
Build an empirical record on market effects. Speculation loses. Sales data, consumer surveys, and economic analysis win. Document everything.
03
Think Long-Term
These are district court cases. Appeals are coming. The framework will evolve through circuit and Supreme Court review. Advise clients to prepare for a multi-year development process.
04
Advise with Humility
The law continues to develop. Bright-line rules remain elusive. Effective counseling acknowledges uncertainty while providing actionable guidance based on current doctrine.
The Takeaway
Function-Specific Transformation
Three cases, one framework. Courts apply the four-factor fair use test to AI training through the lens of function-specific transformation analysis.
Competitive Substitution Fails
AI that serves the same commercial function as the copyrighted works used for training does not qualify for fair use
Analytical Repurposing Succeeds
AI that extracts patterns for analytical purposes without market substitution qualifies for fair use
Evidence Beats Speculation
Market harm requires empirical proof—sales data, surveys, economic analysis—not theoretical assertions
Fair use adapts to AI without statutory changes. Courts apply the existing four-factor test, with transformation and market effect dominating the analysis. The framework is established. Now comes the refinement through appellate review and continued development at the intersection of copyright law and artificial intelligence.