Stop AI crawlers

AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.

In his 2025 MediaWiki User and Developer Workshop presentation^[1] Jeffrey Wang mentions some approaches as inadequate:

Fail2ban
Nepenthes
Varnish and caching

Defenses before MediaWiki[edit]

WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product^[2] to stop network attacks. )
Filtering reverse proxies
Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google^[3].

Defenses in MediaWiki[edit]

Lockdown extension - suitable for other purposes in the category of "User Rights". For example, you can block certain swaths of URLs, but it's not designed for complex filtering.
StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43

Problematic pages in MediaWiki[edit]

SpecialPages
- WhatLinksHere
- RecentChangesLinked
History
Arbitrary Diffs
The 'ABCD' special pages
- SMW
  - Ask
  - BrowseData
- Cargo
  - CargoQuery
  - Drilldown

Discussion[edit]

mw:Handling web crawlers

Solution[edit]

We track this work in https://github.com/freephile/meza/issues/156

References[edit]

[1] ttps://www.youtube.com/watch?v=VGS5l3YH2oY

[2] ttps://www.cloudflare.com/application-services/products/waf/ https://www.cloudflare.com/lp/waf-product-brief-xy/

[3] ttps://github.com/TecharoHQ/anubis/tree/main/data/crawlers

[1]

[2]

[3]