LLM Scrapers Allegedly Target Multiple Open Source Projects Disrupting the FOSS Ecosystem

March 17, 2025

In mid-March 2025, KDE's GitLab infrastructure faced disruption from allegedly aggressive AI web scrapers originating from Alibaba IP ranges. These bots reportedly ignored robots.txt and spoofed browser headers, causing site overloads and outages for developers. Similar incidents were reported in other FOSS projects like GNOME, SourceHut, and Fedora. The scraping is allegedly tied to large language model training and imposes real costs and delays.

This incident highlights the importance of implementing guardrails for AI, such as those promoted by Project Cerebellum's AI governance efforts. By mapping incidents like this one to HISPI Project Cerebellum TAIM (Govern function), we can better understand, measure, and manage these types of threats to safe and secure AI practices. JOIN US.

Matched TAIM controls

Suggested mapping from embedding similarity (not a formal assessment). Browse all TAIM controls

Alleged deployer
unnamed-generative-ai-companies, alibaba
Alleged developer
unnamed-generative-ai-companies, alibaba
Alleged harmed parties
sysadmins, sourcehut, read-the-docs, linux-weekly-news, kde, inkscape, gnome, foss-projects-and-communities, fedora, diaspora, curl

Source

Data from the AI Incident Database (AIID). Cite this incident: https://incidentdatabase.ai/cite/1001

Data source

Incident data is from the AI Incident Database (AIID).

When citing the database as a whole, please use:

McGregor, S. (2021) Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In Proceedings of the Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21). Virtual Conference.

Pre-print on arXiv · Database snapshots & citation guide

We use weekly snapshots of the AIID for stable reference. For the official suggested citation of a specific incident, use the “Cite this incident” link on each incident page.