According to the latest research report from internet infrastructure provider Cloudflare, AI startup Perplexity is accused of ignoring explicit blocking instructions when scraping website content. Cloudflare stated that they observed Perplexity hiding its identity when attempting to scrape web pages, thereby bypassing the website's preferences.

perplexity

Image source note: The image is AI-generated, and the image licensing service is Midjourney

Artificial intelligence products such as Perplexity typically rely on collecting large amounts of data from the internet, and these startups have long been scraping text, images, and videos without permission to support the operation of their products. In recent years, many websites have used standard Robots.txt files to address this issue, which indicate to search engines and AI companies which pages can be indexed and which cannot. However, these efforts have not been very effective.

According to Cloudflare's analysis, Perplexity appears to bypass these restrictions by changing its robot's "user agent." A "user agent" is a signal used to identify the device and type of version of the website visitor. Cloudflare also mentioned that Perplexity changed its autonomous system network (ASN), a digital identifier that identifies large networks on the Internet. Cloudflare observed this behavior across tens of thousands of domains and millions of requests, successfully identifying this crawler by combining machine learning and network signals.

Jesse Dwyer, a spokesperson for Perplexity, refuted Cloudflare's accusations, calling their blog post "salesmanship." He added that the screenshots in the article did not show any access to content. He further claimed that the crawler mentioned by Cloudflare was not owned by them. Cloudflare stated that they initially noticed these issues due to customer complaints that Perplexity was still scraping their website content, even though these websites had blocked the crawler through Robots files.

Cloudflare's analysis shows that Perplexity not only used its declared user agent, but also used a general browser that simulated Google Chrome when it was blocked. Finally, Cloudflare decided to remove Perplexity's crawler from its verification list and take new technologies to block its activities.

Notably, Cloudflare has recently taken a stance against AI crawlers and launched a marketplace that allows website owners to charge AI crawlers accessing their websites. Cloudflare's CEO Matthew Prince has warned that AI is disrupting the business models of the Internet, especially the revenue models of publishers. This is not the first time Perplexity has faced allegations of unauthorized scraping; media outlets such as Wired magazine have previously accused Perplexity of copying their content.

Key points:

🌐 Cloudflare accuses Perplexity of ignoring website blocking instructions when scraping content.   

🤖 Perplexity attempts to bypass website protection measures by changing user agents and network identifiers.   

📉 Cloudflare launched a marketplace allowing websites to charge AI crawlers, to protect website content.