As the artificial intelligence (AI) industry faces increasingly serious copyright issues, especially after Anthropic reached a $1.5 billion settlement with copyright holders, many companies are now paying attention to the legality of their training data. There are currently up to 40 ongoing lawsuits over unauthorized data use, including a case where Midjourney was sued for creating an image of Superman.

Without an effective authorization system, AI companies may face large-scale copyright lawsuits, which makes the industry's future worrying. To address this challenge, a group of technology experts and online publishers have launched a new system called Real Simple Licensing (RSL), aiming to achieve large-scale data licensing. The system has received support from major online publishers such as Reddit, Quora, and Yahoo, but it remains unknown whether the industry can unite and attract major AI laboratories to participate.

Copyright, Piracy

Eckart Walther, co-founder of RSL, said their goal is to create a training data licensing system that can be widely applied on the Internet. He pointed out, "We need machine-readable license agreements for the Internet, and RSL is the tool that solves this problem."

For years, organizations such as the Data Providers Alliance have been pushing for clearer data collection practices, but RSL is the first attempt to provide actual technical and legal infrastructure. Technically, the RSL protocol defines specific licensing terms that publishers can set for their content, including whether AI companies need to customize licenses or adopt Creative Commons terms. Participating websites will include these terms in their "robots.txt" files, making it easy to identify which data is protected by which terms.

On the legal side, the RSL team established a collective licensing organization called RSL Collective, aimed at negotiating terms for publishers and collecting royalties, similar to ASCAP in the music industry or MPLC in the film industry. Currently, many well-known publishers have joined this collective, including Yahoo, Reddit, and Medium.

Despite this, the challenge of determining exactly which training data an AI model used to calculate royalties still exists. For real-time web data products, such as Google's AI search summary, tracking data usage is relatively simple. However, if the training process is not recorded, confirming whether a specific document was used by a large language model (LLM) becomes extremely difficult.

Despite these challenges, the creators of RSL believe that AI companies can cope with them. "They already needed to be able to report data usage in some previous licensing agreements, so it's not impossible," said Doug Leeds, another co-founder of RSL. "As long as it's good enough, people can get the compensation they deserve."

Finally, whether RSL succeeds depends on whether AI companies are willing to accept this new system. As more and more AI industry leaders call for the establishment of such a system, the RSL team hopes they will keep their promises.