Stop AI crawlers
From Freephile Wiki
AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.
In his 2025 MediaWiki User and Developer Workshop presentation[1] Jeffrey Wang mentions some approaches as inadequate:
- Fail2ban
- Nepenthes
- Varnish and caching
Defenses before MediaWiki[edit]
- WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product[2] to stop network attacks. )
- Filtering reverse proxies
- Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google[3].
Defenses in MediaWiki[edit]
- Lockdown extension - suitable for other purposes in the category of "User Rights". For example, you can block certain swaths of URLs, but it's not designed for complex filtering.
- StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
- AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
- CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43
Problematic pages in MediaWiki[edit]
- SpecialPages
- WhatLinksHere
- RecentChangesLinked
- History
- Arbitrary Diffs
- The 'ABCD' special pages
- SMW
- Ask
- BrowseData
- Cargo
- CargoQuery
- Drilldown
- SMW
Discussion[edit]
Solution[edit]
We track this work in https://github.com/freephile/meza/issues/156