Microsoft AI Researchers Accidentally Exposed 38TB of Sensitive Data

Security researchers at cloud-security company Wiz discovered a data leak affecting Microsoft’s AI GitHub repository, including a huge amount of private data and a disk backup of two employees’ workstations with sensitive data.

As Wiz researchers found out, Microsoft researchers uploaded instructions to access some AI models for image recognition to Microsoft AI research's robust-models-transfer repository on GitHub and provided an Azure Storage download link. Unfortunately, the shared URL granted permissions on the entire storage account.

Our scan shows that this account contained 38TB of additional data — including Microsoft employees’ personal computer backups. The backups contained sensitive personal data, including passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from 359 Microsoft employees.

To make things worse, the access URL was misconfigured in a way it enabled "full control" on the account, which would allow potential attackers to modify, overwrite, or delete existing files. Additionally, the pickle formatter Python uses to output model data files is prone to arbitrary code execution by design, meaning that:

An attacker could have injected malicious code into all the AI models in this storage account, and every user who trusts Microsoft’s GitHub repository would’ve been infected by it.

As Wiz researchers noted, however, the storage account was not exposed to the public thanks to the specific mechanism used, SAS tokens, which allows to create a shareable link to access an Azure storage account without the account appearing public. SAS tokens, say Wiz researchers, are often used improperly by granting excessive privileges and/or by setting very long lifetimes and lack management facilities.

Once notified of the incident, the leakage was promptly fixed by revoking the SAS token, which required revoking the entire account key.

While the attack itself did only coincidentally relate to AI, some Hacker News comments pointed at the possibility that these kinds of incidents could lead to data poisoning attacks tampering with AI models, for example a spoiled GPT which inserted certain vulnerabilities into the code it generates. This is though just a theoretical possibility, given the amount of "poison" one should inject into the training data, remarked another commenter.

Other commenters raised serious concerns about the widespread usage of pickle as a data formatter, which is unfortunately the only available option for many of the major ML packages.

Do not miss the original article if you are interested in the full details about Wiz's findings and their recommendations.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Sergio De Simone

Rate this Article

This content is in the Cloud Computing topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter