Alleged Inclusion of 12,000 Live API Keys in LLM Training Data Reportedly Poses Security Risks

February 28, 2025

An investigation uncovered 12,000 live API keys and authentication credentials within a dataset utilized for large language model (LLM) training. Preliminary findings suggest that some of these sensitive secrets remained active, potentially allowing malicious actors to gain unauthorized access. The discovery was made in the December 2024 Common Crawl archive, encompassing approximately 250 billion web pages. If exploited, the affected credentials could have enabled a wide range of harmful activities such as data breaches, service disruptions, financial fraud, and more. This underscores the importance of trustworthy AI governance and safe and secure AI practices.

Join us at Project Cerebellum to help establish guardrails for AI and ensure harm prevention through our HISPI Project Cerebellum TAIM (Govern) initiative. JOIN US

Matched TAIM controls

Suggested mapping from embedding similarity (not a formal assessment). Browse all TAIM controls

Alleged deployer
microsoft, openai, common-crawl, microsoft-azure-openai-service
Alleged developer
common-crawl, openai, microsoft
Alleged harmed parties
aws, slack, mailchimp, microsoft, google, intel, huawei, paypal, ibm, tencent

Source

Data from the AI Incident Database (AIID). Cite this incident: https://incidentdatabase.ai/cite/956

Data source

Incident data is from the AI Incident Database (AIID).

When citing the database as a whole, please use:

McGregor, S. (2021) Preventing Repeated Real World AI Failures by Cataloging Incidents: The AI Incident Database. In Proceedings of the Thirty-Third Annual Conference on Innovative Applications of Artificial Intelligence (IAAI-21). Virtual Conference.

Pre-print on arXiv · Database snapshots & citation guide

We use weekly snapshots of the AIID for stable reference. For the official suggested citation of a specific incident, use the “Cite this incident” link on each incident page.