Recently, the AI company Anthropic invested millions of dollars to purchase and "disassemble" a large number of books for training its AI assistant Claude. This move has drawn widespread public attention and sparked heated discussions in the legal community.

According to reports from foreign media Ars Technica, Anthropic used a controversial method to obtain training data. They disassembled a large number of physical books, scanned them into digital files, and then destroyed the original copies. This approach was revealed in court documents, and Judge William Alsup ruled that this scanning method constitutes fair use. The judge pointed out that the books purchased by Anthropic were obtained through legal channels and were destroyed immediately after being scanned. The digital files were only used internally and not distributed to the public. This ruling provides a legal reference for other AI companies when obtaining data.

Robot AI Artificial Intelligence (2)

Image Source Note: The image is AI-generated, and the image licensing service provider is Midjourney.

This strategy was inspired by the success of Google Books. Anthropic's CEO Amodei mentioned that the early company considered using pirated e-books but ultimately chose to purchase second-hand books to obtain high-quality training text due to legal risks. Through "destructive scanning," the company can quickly and efficiently convert books into PDF formats that machines can read, thus providing sufficient data support for AI model training.

However, non-destructive scanning technology is already quite mature. For example, the Internet Archive has developed a digitalization method that preserves the original books. OpenAI and Microsoft have also recently collaborated with Harvard University Library to digitize nearly a million public domain books, ensuring that these books remain properly preserved. Compared to these peers, Anthropic's approach seems somewhat radical, but it undoubtedly opens up new ideas for the AI training field.

As artificial intelligence develops, how to obtain training data while respecting intellectual property rights will continue to be a topic in the industry. Although Anthropic's attempt has caused controversy, it also provides new possibilities for the future development of AI.