Blog articles

All about ChatGPT's first data breach, and how it happened

June 01, 2023

OpenAI has confirmed ChatGPT has experienced its first data breach, exposing the details of ChatGPT Plus subscribers and their prompts with other users. This breach has increased security concerns about ChatGPT, with one country banning it outright following the incident.

When did the breach occur? Whose data was exposed?

The breach happened during a nine-hour window on March 20, between 1 a.m. and 10 a.m. Pacific time. According to OpenAI, the creators of ChatGPT, approximately 1.2% of the ChatGPT Plus subscribers who were active during this time period had their data exposed.

So what is 1.2% in actual numbers? OpenAI has 100 million subscribers, but it is unlikely they are all using OpenAI Plus, so the figure is anywhere up to (but realistically short of) 1.2 million users.

What data was exposed during the breach?

During this time, it was possible for some users to see another user’s first and last name, email address, payment address, credit card type, credit card number (the last four digits only), and the credit card expiration date. It was also possible for some users to see the first message of other user’s newly-created conversations.

How did the ChatGPT data breach happen?

The breach happened due to a bug in the open-source code ChatGPT was using beneath the hood. This created an error where if you canceled a request within a very specific timeframe, the system would get confused, and decide to deliver your information to the next user who made a similar request because it didn’t know what to do with it.

To get more into detail, OpenAI were using the Redis client library, redis-py. The library has a feature where OpenAI could keep a pool of connections between their Python server (which runs with Asyncio) and Redis, so they didn’t need to constantly check the main database for every request. 

When using Asyncio, it would create two queues: one for requests to come in, the other for them to come out. However, there was a bug: if the request was canceled after it had made it into the first queue, but before the response was delivered via the second queue, something strange happened: the next response that comes out might have leftover data from that terminated request.

Normally, that would cause the server to crash, and the user would have to try their request again. But sometimes, by sheer luck, the corrupted data matches what the user was expecting, hence the error. Think of it like getting someone else’s package by mistake, but thinking it’s yours because it looks similar.

Want to perform a security review of an AI model (like ChatGPT?)

What has been the response to the incident?

The Redis open-source maintainers have addressed the bug and rolled out a patch. OpenAI has no plans of ditching Redis, which it said has been “crucial” in getting ChatGPT to happen.

“Their significance cannot be understated. We would not have been able to scale ChatGPT without Redis,” OpenAI said. The company currently uses Redis Cluster to distribute their request load across multiple Redis instances.

OpenAI said they have extensively tested to fix the underlying bug, and increased the “robustness and scale” of their Redis cluster to reduce the chance of errors at extreme load. The company has also launched a bug bounty program offering "$200 for low-severity findings to up to $20,000 for exceptional discoveries."

The incident has raised questions about ChatGPT’s security

Italy’s privacy watchdog has since banned ChatGPT, explicitly citing the data breach as one of the reasons, as well as questioning OpenAI’s use of personal data to train the chatbot.

It cited concerns such as a “lack of a notice to users and all those involved whose data is gathered by OpenAI” and said there appears to be “no legal basis underpinning the massive collection and processing of personal data in order to ‘train’ the algorithms on which the platform relies”.

The watchdog also took issue with the fact that while the service is aimed at people over the age of 13, the “absence of any filter for verifying the age of users exposes minors to absolutely unsuitable answers compared to their degree of development and self-awareness.”

Open source bugs are becoming a haven for cyber-attackers

There has been a 633% year-on-year increase in attacks against open source repositories, and there has been an annual, overall increase of 724% since 2019. Many of these are caused by a lack of vulnerability management and technical debt, not just once-off bugs.

Libraries underpin modern software development, allowing us to make massive economic gains. Open source components exist in 90% of most modern software applications — think of PyTorch, TensorFlow, OpenSSL, or Kubernetes. However, it comes with the dark side of a dependency tree, and the open source component you’re using may be depending on another open source component, and so on.

Open source dependencies sound impossible to deal with. However, this risk can be mitigated through creating and maintaining a list of your open source libraries, monitoring your packages, and vetting all new libraries before they make it into your projects.

And just to leave things on a more lively note, here’s a comic from XKCD which underpins the whole situation.

XKCD comic with a diagram of all modern infrastructure resting on a small domino supported by a random person in Nebraska