Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Researchers Publish Attack Algorithm for ChatGPT and Other LLMs

Researchers Publish Attack Algorithm for ChatGPT and Other LLMs

Researchers from Carnegie Mellon University (CMU) have published LLM Attacks, an algorithm for constructing adversarial attacks on a wide range of large language models (LLMs), including ChatGPT, Claude, and Bard. The attacks are generated automatically and are successful 84% of the time on GPT-3.5 and GPT-4, and 66% of the time on PaLM-2.

Unlike most "jailbreak" attacks which are manually constructed using trial and error, the CMU team devised a three-step process to automatically generate prompt suffixes that can bypass the LLM's safety mechanisms and result in a harmful response. The prompts are also transferrable, meaning that a given suffix will often work on many different LLMs, even closed-source models. To measure the effectiveness of the algorithm, the researchers created a benchmark called AdvBench; when evaluated on this benchmark, LLM Attacks has an 88% success rate against Vicuna, compared to 25% for a baseline adversarial algorithm. According to the CMU team:

Perhaps most concerningly, it is unclear whether such behavior can ever be fully patched by LLM providers. Analogous adversarial attacks have proven to be a very difficult problem to address in computer vision for the past 10 years. It is possible that the very nature of deep learning models makes such threats inevitable. Thus, we believe that these considerations should be taken into account as we increase usage and reliance on such AI models.

With the release of ChatGPT and GPT-4, many techniques for jailbreaking these models emerged, which consisted of prompts which could cause the models to bypass their safeguards and output potentially harmful responses. While these prompts are generally discovered by experimentation, the LLM Attacks algorithm provides an automated way to create them. The first step is to create a target sequence of tokens: "Sure, here is (content of query)," where "content of query" is the user's actual prompt which is asking for a harmful response.

Next, the algorithm generates an adversarial suffix for the prompt by finding a sequence of tokens that is likely to cause the LLM to output the target sequence, using a Greedy Goordinate Gradient-based (GCG). While this does require access to the LLM's neural network, the team found that by running GCG against many open-source models, the results were transferrable even to closed models.

In a CMU press release discussing their research, co-auther Matt Fredrikson said:

The concern is that these models will play a larger role in autonomous systems that operate without human supervision. As autonomous systems become more of a reality, it will be very important to ensure that we have a reliable way to stop them from being hijacked by attacks like these...Right now, we simply don’t have a convincing way to stop this from happening, so the next step is to figure out how to fix these models...Understanding how to mount these attacks is often the first step in developing a strong defense.

Lead author Andy Zou, a PhD student at CMU, wrote about the work on Twitter. He said:

Despite the risks, we believe it to be proper to disclose in full. The attacks presented here are simple to implement, have appeared in similar forms before, and ultimately would be discoverable by any dedicated team intent on misusing LLMs.

David Krueger, an assistant professor at the University of Cambridge, replied to Zou's thread, saying:

Given that 10 years of research and thousands of publications haven't found a fix for adversarial examples in image models, we have a strong reason to expect the same outcome with LLMs.

In a discussion of the work on Hacker News, one user pointed out:

Remember that a big point of this research is that these attacks don't need to be developed using the target system. When the authors talk about the attacks being "universal", what they mean is that they used a completely local model on their own computers to generate these attacks, and then copied and pasted those attacks into GPT-3.5 and saw meaningful success rates. Rate limiting won't save you from that because the attack isn't generated using your servers, it's generated locally. The first prompt your servers get already has the finished attack string included -- and researchers were seeing success rates around 50% success rate in some situations even for GPT-4.

Code for reproducing the LLM Attacks experiments against the AdvBench data is available on GitHub. A demo of several adversarial attacks is available on the project website.

About the Author

Rate this Article