top of page
Color logo - no background.png
Red Gradient Background

Can AI Learn from Books Without Breaking the Law?

  • Writer: Tuelee Anh
    Tuelee Anh
  • Jul 3
  • 2 min read

Updated: Oct 3


Why the Anthropic Decision Is a Turning Point for AI Training Data


As generative AI continues to disrupt industries, one question has loomed large: Can AI legally train on copyrighted data? A recent U.S. court decision involving Anthropic, the maker of the Claude AI model, has delivered the first major legal precedent.

In short: Training AI models on copyrighted content may be allowed under fair use—but storing pirated copies is not. This ruling reshapes the landscape for AI startups, enterprise LLM builders, and venture capitalists evaluating AI companies.


What Happened: Claude AI, Copyright Law, and Fair Use


Anthropic trained its Claude model using millions of copyrighted books. The court ruled that:

  • AI model training can qualify as fair use in certain contexts

  • Storing pirated or unlicensed copyrighted works is illegal


This ruling is significant because it carves out a potential legal path for AI development under U.S. copyright law—but draws a strict line around data acquisition and storage compliance.


For startups building LLMs or other generative models, this precedent highlights the legal risks of using unverified datasets.


AI Startups: Data Compliance Is No Longer Optional


Many AI startups use scraped datasets containing copyrighted works—books, lyrics, articles, or multimedia—often assuming that “public” equals “permissible.” But this ruling makes clear: how you obtain and store training data matters.


Key takeaways for startups:

  • Verify all training data sources

  • Document data collection and licensing practices

  • Avoid storing or redistributing copyrighted content without rights

  • Anticipate legal discovery on data sourcing in future fundraising or M&A


What Investors Need to Know: New Due Diligence for AI


For venture capitalists, this ruling introduces a new layer of AI investment diligence. Legal exposure related to AI training data could materially impact a company’s valuation or risk profile.


Investors should now ask:

  • Where did the training data originate?

  • Was any copyrighted content used without proper rights?

  • Has the startup built a defensible compliance framework?

  • Are there legal audits or internal documentation of data use?


Backing companies with opaque or risky data pipelines could bring reputational and financial downside. Investors who prioritize lawful AI development will future-proof their portfolios.


The Future of Generative AI and Copyright Law


This ruling doesn’t end the debate—it ignites it. Other lawsuits involving OpenAI, Meta, and Google are still pending. But the Anthropic case sets a tone: courts are willing to recognize fair use in AI training, while penalizing illegal data practices.


What’s next:

  • Growth in AI data licensing platforms

  • More transparent, auditable training pipelines

  • Case-by-case legal guidance on fair use boundaries

  • Cross-border data governance frameworks for global AI markets


Build Responsibly: Legal, Ethical, and Scalable AI


At VinVentures, we champion founders who build responsibly, where innovation meets intention. In AI, that means complying with copyright law, respecting data ownership, and preparing for a world of regulatory clarity.


Whether you're developing a foundation model, fine-tuning an industry-specific LLM, or investing in AI infrastructure, data legality is now a strategic differentiator.


References list:


Capoot, A. (2025, June 24). Judge rules Anthropic did not violate authors’ copyrights with AI book training. CNBC. https://www.cnbc.com/2025/06/24/ai-training-books-anthropic.html








bottom of page