Researchers Open-Source LLM Jailbreak Defense Algorithm SafeDecoding

Researchers from the University of Washington, the Pennsylvania State University, and the Allen Institute for AI have open-sourced SafeDecoding, a technique for protecting large language models (LLMs) against jailbreak attacks. SafeDecoding outperforms baseline jailbreak defenses without incurring significant computational overhead.

The key insight in SafeDecoding is that during decoding, although tokens for the harmful responses to a jailbreak attack are higher probability, tokens for a safe response are still among the most likely. Therefore, to steer the generated response in a safe direction, SafeDecoding identifies the safe response tokens and amplifies their probabilities, while reducing those of harmful responses. The researchers applied SafeDecoding to five open-source LLMs and evaluated its performance on six different jailbreak attacks, compared to six baseline defense methods. SafeDecoding outperformed the baselines in almost all scenarios. According to the research team,

The primary goal of [our work] is to strengthen the safety of LLMs by developing a new lightweight decoding strategy. As LLMs are increasingly used in real-world applications, their safety guarantees become critical. We empirically show that our developed decoding strategy...not only effectively mitigates jailbreak attacks, but also allows LLMs to continue serving benign users in an efficient and helpful manner.

With the release of ChatGPT and GPT-4, many techniques for jailbreaking LLMs emerged, which consisted of prompts which could cause the models to bypass their safeguards and output potentially harmful responses. In 2023, InfoQ covered Nvidia's NeMo Guardrails package which helps developers prevent LLM risks. InfoQ also covered LLM Attacks, an algorithm for constructing adversarial attacks, which was created to help researchers understand and prevent attacks.

SafeDecoding works by constructing an expert model, which is a fine-tuned version of the target LLM. The fine-tuning uses a dataset that the researchers constructed by prompting the LLM with harmful queries; the dataset includes responses where the LLM refused the prompt. The expert model is then expected to behave similarly to the original LLM, but with a better ability to refuse malicious prompts.

During inference, user prompts are passed to both the original model and to the expert. As in the usual autoregressive decoding scheme, from the prompt both models produce a set of the top k most likely next tokens. SafeDecoding takes the intersection of these two sets of tokens, and the probabilities are computed by taking the probability output from the original model, multiplying it by a constant value (1-α), then adding the probability from the expert times α. This effectively "amplifies" tokens from the expert which represent safe responses, while "attenuating" tokens from the original which represent harmful responses.

SafeDecoding Architecture

SafeDecoding Architecture (Image Source: SafeDecoding Source Code)

In a discussion about the work on X, co-author Bill Yuchen Lin was asked about the relationship of SafeDecoding to his previous work on URIAL, an LLM alignment method:

Yes the two works indeed share a common focus: token distribution shifts before and after tuning. In the URIAL paper, it is about BASE vs ALIGNED models. Here in SafeDecoding, we instead look at the general-aligned (eg Vicuna) VS safety-fine tuned models (continual tuning with more refusal examples). The key strategy is to amplify the changes in the token distribution for defending jailbreaks more effectively.

The SafeDecoding source code is available on GitHub.

About the Author

Anthony Alford

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

About the Author

Anthony Alford

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter