Can Robots.txt Files Really Stop AI Crawlers?

dimanche 18 février 2024, 17:34 , par Slashdot

In the high-stakes world of AI, 'The fundamental agreement behind robots.txt [files], and the web as a whole — which for so long amounted to 'everybody just be cool' — may not be able to keep up...' argues the Verge:

For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing. 'What we found pretty quickly with the AI companies,' says Medium CEO Tony Stubblebin, 'is not only was it not an exchange of value, we're getting nothing in return. Literally zero.' When Stubblebine announced last fall that Medium would be blocking AI crawlers, he wrote that 'AI companies have leached value from writers in order to spam Internet readers.'

Over the last year, a large chunk of the media industry has echoed Stubblebine's sentiment. 'We do not believe the current 'scraping' of BBC data without our permission in order to train Gen AI models is in the public interest,' BBC director of nations Rhodri Talfan Davies wrote last fall, announcing that the BBC would also be blocking OpenAI's crawler. The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI's models 'were built by copying and using millions of The Times's copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.' A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.

It's not just publishers, either. Amazon, Facebook, Pinterest, WikiHow, WebMD, and many other platforms explicitly block GPTBot from accessing some or all of their websites.

On most of these robots.txt pages, OpenAI's GPTBot is the only crawler explicitly and completely disallowed. But there are plenty of other AI-specific bots beginning to crawl the web, like Anthropic's anthropic-ai and Google's new Google-Extended. According to a study from last fall by Originality.AI, 306 of the top 1,000 sites on the web blocked GPTBot, but only 85 blocked Google-Extended and 28 blocked anthropic-ai. There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft's Bingbot is both a search crawler and an AI crawler. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic.

For any sufficiently popular website, finding a sneaky crawler is needle-in-haystack stuff.

In addition, the article points out, a robots.txt file 'is not a legal document — and 30 years after its creation, it still relies on the good will of all parties involved.

'Disallowing a bot on your robots.txt page is like putting up a 'No Girls Allowed' sign on your treehouse — it sends a message, but it's not going to stand up in court.'

Read more of this story at Slashdot.

Lire la suite sur Slashdot