Recently, with the rapid development of artificial intelligence technology, the issue of data sources for large AI models has become a focal point in the industry. Many well-known companies seem to have chosen to use a large amount of unauthorized copyrighted content as data when building their AI models. This "secret formula" has sparked intense legal debates and placed tech giants in Silicon Valley at the center of controversy.
In 2023, The New York Times filed a lawsuit against OpenAI and Microsoft for the first time, officially opening the door to this legal battle. Soon after, Meta faced a class-action lawsuit over its Llama model allegedly using pirated books, while Anthropic also faced allegations regarding the training data of its Claude model. Almost all major players are now facing legal challenges: is it considered "fair use" to use copyrighted works without authorization as AI training data?
In June 2025, the court's ruling in the Anthropic case sent an important signal: although model training itself may be seen as a highly "transformative" use, if the data source involves piracy, it is basically impossible to avoid accusations of infringement. It is expected that Anthropic may face damages of up to $75 billion, a revelation that has made all AI companies nervous.
To meet the demand for data, major model companies have adopted various "creative" ways to obtain data, some even walking on the edge of the law. For example, OpenAI uses web crawlers to extensively scrape online content, even removing copyright information during the scraping process; and after high-quality text resources began to dwindle, AI companies turned to other formats such as videos and paper books, extracting data through technical means.
Additionally, some companies have even directly used pirated books. For example, Meta was accused of using pirated books from a "shadow library" when training its Llama model. In contrast, conservative companies like Apple choose to avoid legal risks by using licensed materials and their own data.
As legal litigation progresses, the strategies of copyright holders have gradually shifted, with the focus no longer being how AI uses data, but whether the acquisition of data is legal. Court rulings indicate that although the training activities of AI may not constitute direct infringement, the use of pirated resources will be severely cracked down upon.
Today, the AI industry is facing an unprecedented copyright war, and how to navigate the legal boundaries while achieving innovation has become an urgent issue for tech giants.