The "original sin" of AI training data has faced its strongest legal challenge. Led by two-time Pulitzer Prize winner John Carreyrou, dozens of renowned authors have recently filed a class-action lawsuit with the United States District Court for the Northern District of California, listing six major AI companies—OpenAI, Google, Meta, Anthropic, xAI, and Perplexity AI—as co-defendants. They are accused of systematically using pirated books to train large models, constituting "intentional copyright infringement." If convicted, each work could face a maximum fine of $150,000, with total compensation potentially reaching billions or even hundreds of billions of dollars.
A "Double Piracy Chain" Comes to Light
The complaint reveals that the defendant companies have formed a clear cycle of infringement:
1. Pirated Acquisition: Downloading millions of copyrighted books (including novels, non-fiction works, and academic texts) in bulk from "shadow libraries" such as LibGen and Z-Library;
2. Model Training: Using this illegal data to train large models such as ChatGPT, Gemini, and Claude;
3. Commercial Monetization: Profiting through API subscriptions, enterprise services, and advertising without paying any royalties to the original authors.
The plaintiffs emphasize: "The writers' words are the foundation of AI intelligence, yet they have become free fuel." These works not only provide the model with language capabilities but also shape its "depth of knowledge" and "narrative style," serving as an invisible pillar of the ten-billion-dollar AI ecosystem.
OpenAI Becomes the "Most Sued Company," and San Francisco Court Becomes the Epicenter of AI Copyright Disputes
This is not the first time AI companies have been embroiled in copyright disputes, but this case has drawn significant attention due to the high authority of the plaintiffs, the comprehensive coverage of the defendants, and the clear infringement chain. According to the South China Digital Economy Governance Research Center, OpenAI has faced at least 14 copyright lawsuits, making it the "most sued company" in the industry. The venue for this case—the Northern District Court of California (San Francisco)—has already received 25 AI-related copyright cases, accounting for more than 50% of similar cases nationwide. Its judgment may set a national precedent on the legality of AI training data.
Intentional Infringement vs. Fair Use: Legal Boundaries to be Determined
Previously, the defendant companies often used the "fair use" (Fair Use) defense, arguing that AI training constitutes "transformative use" and does not harm the market for the original works. However, in this case, the plaintiffs focus on the "nature of piracy"—if the training data itself is obtained illegally, the "fair use" defense is unlikely to hold. If the court determines "intentional infringement," not only will the compensation amount increase significantly, but the AI companies may also be forced to clean their models, delete infringing data, or even suspend related services.
Industry Shock: The AI Training Data Supply Chain May Be Reconstructed
Regardless of the outcome, this case has sounded a warning:
- Leading AI companies are accelerating negotiations with publishers and author associations for licensing (e.g., OpenAI's collaboration with Associated Press and Shutterstock);
- Open-source model communities face compliance pressure and need to verify the legality of their training data;
- "Shadow libraries" may become a focus of enforcement, and the data acquisition toolchain will be reviewed.



