Stop AI crawlers: Difference between revisions

Latest revision as of 23:21, 31 May 2025

AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.

In his 2025 MediaWiki User and Developer Workshop presentation^[1] Jeffrey Wang mentions some approaches as inadequate:

Fail2ban
Nepenthes
Varnish and caching

So, what can we do?

Defenses before MediaWiki[edit]

WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product^[2] to stop network attacks. )
Filtering reverse proxies
Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google^[3].

Defenses in MediaWiki[edit]

Lockdown extension - most suitable for other purposes in the category of "User Rights". It is useful for disallowing anonymous reads of "heavy" pages. For example, you can block certain swaths of URLs in an entire namespace such as all Special pages. It is just not designed for complex filtering.
StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43

Problematic pages in MediaWiki[edit]

SpecialPages
- WhatLinksHere
- RecentChangesLinked
History
Arbitrary Diffs
The 'ABCD' special pages
- SMW
  - Ask
  - BrowseData
- Cargo
  - CargoQuery
  - Drilldown

Discussion[edit]

Handling web crawlers provides details on various solutions, like how to use Lockdown to at least prevent anonymous reads on heavy pages.

Solution[edit]

We tracked this work in issue 156

References[edit]

[1] ttps://www.youtube.com/watch?v=VGS5l3YH2oY

[2] ttps://www.cloudflare.com/application-services/products/waf/ https://www.cloudflare.com/lp/waf-product-brief-xy/

[3] ttps://github.com/TecharoHQ/anubis/tree/main/data/crawlers

[1]

[2]

[3]

@@ Line 6: / Line 6: @@
 * Nepenthes
 * Varnish and caching
+So, what ''can'' we do?
 == Defenses before MediaWiki ==
@@ Line 17: / Line 18: @@
 == Defenses in MediaWiki ==
-* [[mediawikiwiki:Extension:Lockdown|Lockdown extension]] - suitable for other purposes in the category of "User Rights". For example, you can block certain swaths of URLs, but it's not designed for complex filtering.
+* [[mediawikiwiki:Extension:Lockdown|Lockdown extension]] - most suitable for other purposes in the category of "User Rights". It '''is useful''' for disallowing anonymous reads of "heavy" pages. For example, you can block certain swaths of URLs in an entire namespace such as all Special pages. It is just not designed for complex filtering.
 * StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
 * [[mediawikiwiki:Extension:AbuseFilter|AbuseFilter extension]]- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
@@ Line 38: / Line 39: @@
 == Discussion ==
-[[mw:Handling web crawlers]]
+[[mw:Handling web crawlers|Handling web crawlers]] provides details on various solutions, like [[mediawikiwiki:Handling_web_crawlers#Lockdown|how to use Lockdown]] to at least prevent anonymous reads on heavy pages.
 == Solution ==
-We track this work in https://github.com/freephile/meza/issues/156 {{References}}
+We tracked this work in [https://github.com/freephile/meza/issues/156 issue 156]
+{{References}}
+[[Category:MediaWiki]]
+[[Category:Security]]
+[[Category:DevOps]]
+[[Category:AI]]