newsence
來源篩選

Why Doesn't Mataroa Block AI Scrapers?

Hacker News

Mataroa, a platform advocating for an independent web, explains its current inability to reliably block AI scrapers due to the inherent difficulties and potential unreliability of existing methods like robots.txt, computational challenges, and third-party services like Cloudflare.

newsence

為何 Mataroa 不封鎖 AI 爬蟲?

Hacker News
大約 1 個月前

AI 生成摘要

Mataroa 平台解釋了為何目前無法可靠地封鎖 AI 爬蟲,指出現有方法(如 robots.txt、計算挑戰和 Cloudflare 等第三方服務)都存在難以克服的限制,且可能無法提供穩定的防護。

Why doesn’t mataroa block AI scrapers? — Blog of Mataroa.blog

Why doesn’t mataroa block AI scrapers?

Given mataroa’s somewhat polemic platform methodology in defense of an independent web, some people have reached out to ask: Why doesn't mataroa block AI scrapers?

Blocking AI scrapers, crawlers, and LLM models in general is something we do not currently offer but may add in the future. The problem is that it’s hard to do this in a reliable way we'd rather not add something that works only half of the time.

We classify blocking AI scrapers into three main categories:

Each has its own limitations but they all share the root of the problem.

Robots.txt doesn’t block AI scrapers by itself. Companies have to intentionally read it and follow its rules. Reputable companies that are scrutinized by everyone adhere to these rules; new companies might not. More importantly, we can’t know if they follow a website’s robots.txt rules.

The second category of tools works by assuming that an AI scraper will not spend a lot of computation power per website crawl. These tools add a JavaScript-based computational challenge, which adds a negligible couple of seconds of waiting time for a normal human user, which is an amount of time prohibitive for a robot that plans to crawl millions of websites. However, a robot may spend more time in a certain website that has been targeted, if it or its operator so desire.

The third category is by Cloudflare, a company through which ~20% of all web traffic passes through. It’s probably the most effective solution yet they too report observing AI companies to attempt (and succeed) at circumventing their checks. Additionally, we are reluctant to adopt a dependency to Cloudflare, one of the companies that contribute to the centralisation of the web.

At the end of the day, the feat itself (blocking some requests but not others) is in tension with how the internet and the web were designed. This is yet another mark of what is already common knowledge: our needs have outgrown the design of decades-old protocols.

La crisi consiste appunto nel fatto che il vecchio muore e il nuovo non può nascere: in questo interregno si verificano i fenomeni morbosi piú svariati.
— Antonio Gramsci, Quaderni del carcere, Quaderno 3, §34, c. 1930.

Name (optional):

Email (optional and
private):

Comment:

Subscribe via RSS / via Email.

Powered by mataroa.blog.