Cloudflare introduces one-click nuke of web scraping AI

Cloudflare on Wednesday offered web hosting customers a way to prevent AI bots from scraping website content and using the data without permission to train machine learning models.

The company did this because customers disliked AI bots and “to maintain a safe internet for content creators,” it said in a statement.

“We’re hearing clearly that customers don’t want AI bots visiting their sites, especially those that do so dishonestly. To help, we’ve added a brand new one-click feature to block all AI bots.”

There is already a somewhat effective method of blocking bots that is widely available to website owners, the robots.txt file. When placed in the root directory of a website, automated web crawlers are expected to notice and abide by the directives in the file that tell them to stay away.

Given the widespread perception that generative AI is based on theft, and the many lawsuits seeking to hold AI companies accountable, companies that traffic in laundered content have willingly allowed web publishers to opt out of this theft.

Last August, OpenAI published guidance on how to block the GPTbot crawler using a robots.txt directive, presumably aware of concerns about content being scraped and used for AI training without permission. Google took similar steps the following month. Also last September, Cloudflare began offering a way to block AI bots that followed its guidelines, with 85 percent of customers reportedly enabling the block.

Now, the network services biz wants to create a more robust barrier to bot entry. The internet “is now awash with these AI bots,” the company said, which visit about 39 percent of the top 1 million web properties served by Cloudflare.

The problem is that robots.txt, like the Do Not Track header implemented in browsers fifteen years ago to signal a privacy preference, can be ignored, usually without consequence.

And recent reports suggest that AI bots are doing just that. Amazon said last week that it was looking into evidence that bots working on behalf of AI search engine Perplexity, an AWS customer, had crawled websites, including news sites, and reproduced their content without proper attribution or permission.

Amazon Cloud customers are required to obey robots.txt, and Perplexity has been accused of failing to do so. Aravind Srinivas, CEO of the AI ​​startup, denied that his company was secretly ignoring the file, but admitted that third-party bots used by Perplexity were the ones scraping pages against the will of webmasters.

counterfeit

“Unfortunately, we have seen bot operators attempt to impersonate a legitimate browser by using a spoofed user agent,” Cloudflare said. “We have monitored this activity over time and are proud that our global machine learning model has consistently recognized this activity as a bot, even when operators lie about their user agent.”

According to Cloudflare, its machine learning rating system consistently gave the disguised Perplexity bot a score below 30 between June 14 and June 27, indicating it was “likely automated.”

This bot detection approach relies on digital fingerprinting, a technique often used to track people online and negate privacy. Crawlers, like individual Internet users, often distinguish themselves from the crowd based on technical details that can be read from network interactions.

These bots typically use the same tools and frameworks for automating website visits. And with a network that processes an average of 57 million requests per second, Cloudflare has enough data to determine which of these fingerprints are trustworthy.

So here’s the deal: machine learning models that defend against bots looking for food for AI models, available even to free plan customers. All customers need to do is click the Block AI scrapers and crawlers toggle in the Security -> Bots menu for a given website.

“We are concerned that some AI companies looking to circumvent rules to access content will continually adapt to evade bot detection,” Cloudflare said. “We will continue to monitor and add more bot blocks to our AI scrapers and crawlers rules and evolve our machine learning models to keep the internet a place where content creators can thrive and maintain full control over which models their content is used to train or perform inference on.” ®

Leave a Comment