Recently, Oregon author Elizabeth Lyon filed a class-action lawsuit against Adobe, accusing it of using an illegal dataset containing her pirated works in training a small language model called SlimLM.

SlimLM is a series of lightweight language models launched by Adobe, optimized for document-assisting tasks on mobile devices such as summarization, rewriting, and question-answering. Adobe stated that the model was pre-trained on the SlimPajama-627B dataset—a publicly available, deduplicated, and multi-source corpus released by AI chip company Cerebras in June 2023.

However, Lyon's complaint states that SlimPajama is actually a derivative of the RedPajama dataset, which directly copied the notorious Books3 dataset. Books3 contains about 191,000 copyrighted books and has long been accused of heavily incorporating pirated online resources (such as The Bibliotik). The complaint emphasizes: "Since SlimPajama is a derivative of RedPajama, it includes content from Books3, including the copyrighted works of the plaintiff and collective members."

Lyon is the author of several nonfiction writing guides, whose works are alleged to be among those illegally used for training. She accuses Adobe of using her text for commercial AI product development without authorization, attribution, or payment, violating the exclusive rights granted to authors under copyright law.

This is not an isolated incident. Books3 and RedPajama have become frequent topics in AI-related copyright lawsuits:

- In September 2024, Apple was sued for using Books3 to train its Apple Intelligence;

- The same month, Anthropic reached a $1.5 billion settlement with a group of authors over similar allegations, considered a milestone in AI copyright cases;

- In October, Salesforce was also accused of relying on RedPajama to train its AI system.

As generative AI becomes increasingly dependent on massive text data, the legality of training data has evolved from a moral controversy into a legal minefield. Adobe’s current lawsuit once again highlights an industry-wide dilemma: even if using "open-source" datasets, downstream developers may still bear joint liability if the source contains infringing content.

Under the shadow of Anthropic's costly settlement case, how Adobe responds to this lawsuit may influence the entire AI industry's attention to training data traceability and compliance review. For content creators, this lawsuit is not only about protection of rights but also a key confirmation of "who owns creative value in the AI era."