Can AI Learn from Books Without Breaking the Law?

Tuelee Anh
Jul 3
2 min read

Updated: Oct 3

Why the Anthropic Decision Is a Turning Point for AI Training Data

As generative AI continues to disrupt industries, one question has loomed large: Can AI legally train on copyrighted data? A recent U.S. court decision involving Anthropic, the maker of the Claude AI model, has delivered the first major legal precedent.

In short: Training AI models on copyrighted content may be allowed under fair use—but storing pirated copies is not. This ruling reshapes the landscape for AI startups, enterprise LLM builders, and venture capitalists evaluating AI companies.

What Happened: Claude AI, Copyright Law, and Fair Use

Anthropic trained its Claude model using millions of copyrighted books. The court ruled that:

AI model training can qualify as fair use in certain contexts
Storing pirated or unlicensed copyrighted works is illegal

This ruling is significant because it carves out a potential legal path for AI development under U.S. copyright law—but draws a strict line around data acquisition and storage compliance.

For startups building LLMs or other generative models, this precedent highlights the legal risks of using unverified datasets.

AI Startups: Data Compliance Is No Longer Optional

Many AI startups use scraped datasets containing copyrighted works—books, lyrics, articles, or multimedia—often assuming that “public” equals “permissible.” But this ruling makes clear: how you obtain and store training data matters.

Key takeaways for startups:

Verify all training data sources
Document data collection and licensing practices
Avoid storing or redistributing copyrighted content without rights
Anticipate legal discovery on data sourcing in future fundraising or M&A

What Investors Need to Know: New Due Diligence for AI

For venture capitalists, this ruling introduces a new layer of AI investment diligence. Legal exposure related to AI training data could materially impact a company’s valuation or risk profile.

Investors should now ask:

Where did the training data originate?
Was any copyrighted content used without proper rights?
Has the startup built a defensible compliance framework?
Are there legal audits or internal documentation of data use?

Backing companies with opaque or risky data pipelines could bring reputational and financial downside. Investors who prioritize lawful AI development will future-proof their portfolios.

The Future of Generative AI and Copyright Law

This ruling doesn’t end the debate—it ignites it. Other lawsuits involving OpenAI, Meta, and Google are still pending. But the Anthropic case sets a tone: courts are willing to recognize fair use in AI training, while penalizing illegal data practices.

What’s next:

Growth in AI data licensing platforms
More transparent, auditable training pipelines
Case-by-case legal guidance on fair use boundaries
Cross-border data governance frameworks for global AI markets

Build Responsibly: Legal, Ethical, and Scalable AI

At VinVentures, we champion founders who build responsibly, where innovation meets intention. In AI, that means complying with copyright law, respecting data ownership, and preparing for a world of regulatory clarity.

Whether you're developing a foundation model, fine-tuning an industry-specific LLM, or investing in AI infrastructure, data legality is now a strategic differentiator.

References list:

Capoot, A. (2025, June 24). Judge rules Anthropic did not violate authors’ copyrights with AI book training. CNBC. https://www.cnbc.com/2025/06/24/ai-training-books-anthropic.html