Stop AI crawlers
From Freephile Wiki
AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.
In his 2025 MediaWiki User and Developer Workshop presentation[1] Jeffrey Wang mentions some approaches as inadequate:
- Fail2ban
- Nepenthes
- Varnish and caching
So, what can we do?
Defenses before MediaWiki[edit]
- WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product[2] to stop network attacks. )
- Filtering reverse proxies
- Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google[3].
Defenses in MediaWiki[edit]
- Lockdown extension - most suitable for other purposes in the category of "User Rights". It is useful for disallowing anonymous reads of "heavy" pages. For example, you can block certain swaths of URLs in an entire namespace such as all Special pages. It is just not designed for complex filtering.
- StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
- AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
- CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43
Problematic pages in MediaWiki[edit]
- SpecialPages
- WhatLinksHere
- RecentChangesLinked
- History
- Arbitrary Diffs
- The 'ABCD' special pages
- SMW
- Ask
- BrowseData
- Cargo
- CargoQuery
- Drilldown
- SMW
Discussion[edit]
Handling web crawlers provides details on various solutions, like how to use Lockdown to at least prevent anonymous reads on heavy pages.
Solution[edit]
We tracked this work in issue 156