Recently, the artificial intelligence company Anthropic has attracted public attention due to its unique method of digitizing books. According to reports by foreign media Ars Technica, Anthropic spent millions of dollars to purchase a large number of physical books to train its AI assistant Claude, and then converted them into digital files through disassembly and scanning. After this process, the original books were directly discarded.

Claude2, Anthropic, artificial intelligence, chatbot Claude

Court documents revealed that in February 2024, Anthropic hired Tom Turvey, who had been involved in Google Books-related matters, responsible for "acquiring books from around the world." This move was clearly intended to draw on the model of Google's book digitization, which was recognized as fair use by the court.

Judge William Alsup ruled that Anthropic's scanning method constitutes fair use, as these books were legally purchased and immediately destroyed after scanning, with the digital files used only internally and not distributed externally. He pointed out that this conversion can be seen as a "space-saving" digital processing method, exhibiting the "transformative" feature in fair use. However, early piracy activities had some impact on its legality.

AI training requires a large amount of high-quality text data. Building a large language model involves inputting billions of words into neural networks to establish relationships between words and concepts. The quality of the data directly affects the accuracy of the model's output; therefore, many AI companies urgently need to obtain high-quality published content and are usually unwilling to spend time negotiating for rights.

The "first sale doctrine" in the United States allows buyers to handle physical books they own as they see fit, making the purchase of books a legal "workaround" solution. However, Anthropic initially also chose to bypass copyright issues, sometimes even using pirated e-books. After legal considerations, the company began seeking safer alternatives and ultimately decided to purchase secondhand books to obtain high-quality training texts and simplify the authorization process.

To accelerate the digitization process, Anthropic adopted a "destructive scanning" approach, purchasing a large number of books, opening them, trimming them, and scanning them in bulk into machine-readable PDF files, at a cost of millions of dollars. Although non-destructive scanning technology is now mature, such as the digitization methods developed by Internet Archive that preserve the original books, Anthropic's approach still sparked widespread discussion.

Key Points:

📚 Anthropic spent millions of dollars buying physical books and converted them into digital files through disassembly and scanning for training the AI assistant Claude.  

⚖️ The judge ruled that its scanning method constitutes fair use, as the books were legally purchased and destroyed after scanning.  

🔄 AI training requires a large amount of high-quality text data, and Anthropic accelerated the digitization of books through "destructive scanning."