Stop AI crawlers: Difference between revisions
From Freephile Wiki
Created page with "AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough. In his 2025 MediaWiki User and Developer Workshop presentation<ref>https://www.youtube.com/watch?v=VGS5l3YH2oY</ref> Jeffrey Wang mentions some approaches as inadequate: * Fail2ban * Nepenthes * Varnish and caching == Defenses before MediaWiki ==..." |
tagged |
||
(2 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
* Nepenthes | * Nepenthes | ||
* Varnish and caching | * Varnish and caching | ||
So, what ''can'' we do? | |||
== Defenses before MediaWiki == | == Defenses before MediaWiki == | ||
Line 17: | Line 18: | ||
== Defenses in MediaWiki == | == Defenses in MediaWiki == | ||
* [[mediawikiwiki:Extension:Lockdown|Lockdown extension]] - suitable for other purposes in the category of "User Rights". For example, you can block certain swaths of URLs | * [[mediawikiwiki:Extension:Lockdown|Lockdown extension]] - most suitable for other purposes in the category of "User Rights". It '''is useful''' for disallowing anonymous reads of "heavy" pages. For example, you can block certain swaths of URLs in an entire namespace such as all Special pages. It is just not designed for complex filtering. | ||
* StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views). | * StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views). | ||
* [[mediawikiwiki:Extension:AbuseFilter|AbuseFilter extension]]- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic. | * [[mediawikiwiki:Extension:AbuseFilter|AbuseFilter extension]]- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic. | ||
Line 38: | Line 39: | ||
== Discussion == | == Discussion == | ||
[[mw:Handling web crawlers]] | [[mw:Handling web crawlers|Handling web crawlers]] provides details on various solutions, like [[mediawikiwiki:Handling_web_crawlers#Lockdown|how to use Lockdown]] to at least prevent anonymous reads on heavy pages. | ||
== Solution == | == Solution == | ||
We | We tracked this work in [https://github.com/freephile/meza/issues/156 issue 156] | ||
{{References}} | |||
[[Category:MediaWiki]] | |||
[[Category:Security]] | |||
[[Category:DevOps]] | |||
[[Category:AI]] |
Latest revision as of 23:21, 31 May 2025
AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.
In his 2025 MediaWiki User and Developer Workshop presentation[1] Jeffrey Wang mentions some approaches as inadequate:
- Fail2ban
- Nepenthes
- Varnish and caching
So, what can we do?
Defenses before MediaWiki[edit]
- WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product[2] to stop network attacks. )
- Filtering reverse proxies
- Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google[3].
Defenses in MediaWiki[edit]
- Lockdown extension - most suitable for other purposes in the category of "User Rights". It is useful for disallowing anonymous reads of "heavy" pages. For example, you can block certain swaths of URLs in an entire namespace such as all Special pages. It is just not designed for complex filtering.
- StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
- AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
- CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43
Problematic pages in MediaWiki[edit]
- SpecialPages
- WhatLinksHere
- RecentChangesLinked
- History
- Arbitrary Diffs
- The 'ABCD' special pages
- SMW
- Ask
- BrowseData
- Cargo
- CargoQuery
- Drilldown
- SMW
Discussion[edit]
Handling web crawlers provides details on various solutions, like how to use Lockdown to at least prevent anonymous reads on heavy pages.
Solution[edit]
We tracked this work in issue 156