Stop AI crawlers

From Freephile Wiki
Revision as of 13:36, 30 May 2025 by Admin (talk | contribs) (Created page with "AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough. In his 2025 MediaWiki User and Developer Workshop presentation<ref>https://www.youtube.com/watch?v=VGS5l3YH2oY</ref> Jeffrey Wang mentions some approaches as inadequate: * Fail2ban * Nepenthes * Varnish and caching == Defenses before MediaWiki ==...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

AI crawlers from all over the world have become a huge problem. They don't play by the (Robots.txt) rules, so it's even worse than old-school indexing by Bing, Google, and Yahoo which were bad enough.

In his 2025 MediaWiki User and Developer Workshop presentation[1] Jeffrey Wang mentions some approaches as inadequate:

  • Fail2ban
  • Nepenthes
  • Varnish and caching

Defenses before MediaWiki[edit]

  • WAF e.g. Cloudflare - the Content Distribution Network (CDN) company offers a Web Application Firewall (WAF) product[2] to stop network attacks. )
  • Filtering reverse proxies
  • Anubis - their README claims the solution to be over-zealous, but then offers default configurations that would appear to expressly allow the good guys like Internet Archive, bing and google[3].

Defenses in MediaWiki[edit]

  • Lockdown extension - suitable for other purposes in the category of "User Rights". For example, you can block certain swaths of URLs, but it's not designed for complex filtering.
  • StopForumSpam - as the name suggests, suitable for preventing write access (not reads/views).
  • AbuseFilter extension- suitable for setting rules about content editing such as preventing links to specific domains, but not for traffic.
  • CrawlerProtection extension- by MyWikis' Jeffrey Wang. Currently has a bug for MW 1.43

Problematic pages in MediaWiki[edit]

  • SpecialPages
    • WhatLinksHere
    • RecentChangesLinked
  • History
  • Arbitrary Diffs
  • The 'ABCD' special pages
    • SMW
      • Ask
      • BrowseData
    • Cargo
      • CargoQuery
      • Drilldown

Discussion[edit]

mw:Handling web crawlers

Solution[edit]

We track this work in https://github.com/freephile/meza/issues/156 == References ==