A new technique to prevent LLM jailbreaks – Sophos News

Many organizations are increasingly deploying large language models (LLMs) such as OpenAI’s GPT series, Anthropic’s Claude, Meta’s LLaMA, and various models from DeepSeek, with minimal customization. This widespread reuse leads to model homogeneity across applications – from chatbots to productivity tools – and creates a security vulnerability: jailbreak prompts that bypass refusal mechanisms can be precomputed once and reused across many deployments. This mirrors the classic rainbow table attack in password security, where attackers exploit shared cryptographic targets to reuse precomputed inputs.

These generalized jailbreaks are a problem because many companies have customer-facing LLMs built on top of model classes – meaning that one jailbreak could work against all the instances built on top of a given model. And, of course, those jailbreaks could have multiple undesirable impacts – from exposing sensitive internal data, to producing incorrect, inappropriate, or even harmful responses.

Taking inspiration from password salting – the concept of introducing small per-user variations to break reuse of precomputed inputs – we developed a technique we call ‘LLM salting’: introducing targeted variations in model behavior to invalidate jailbreaks. We unveiled this technique recently, at the 2025 Conference on Applied Machine Learning in Information Security (CAMLIS), and this article explores our research in-depth.

Refusing to pass the salt

Building on recent work identifying a subspace in model activations responsible for refusal behavior by Arditi et al, we developed a lightweight fine-tuning procedure that rotates this subspace. This simple change ensures that jailbreaks crafted against an unsalted model no longer succeed on salted ones.

Analysis of internal representations reveals that the refusal direction remains largely stable under standard fine-tuning. As shown in Figure 1, the cosine similarity between the model’s residual activations and a precomputed refusal direction at Layer 16 remains consistently high throughout training unless explicitly modified. This indicates that alignment procedures that do not directly target refusal mechanisms are unlikely to disrupt the latent features exploited by jailbreak attacks.

Figure 1: Cosine similarity between the model’s internal activations and the precomputed refusal direction at Layer 16 during training. Under standard finetuning (white), the refusal direction remains largely unchanged. In contrast, salted fine-tuning (orange) explicitly rotates the representation away from the refusal axis. This indicates that standard alignment methods do not alter refusal-relevant directions unless explicitly incentivized.

In contrast, LLM salting introduces a targeted perturbation that rotates this direction, thereby reducing the efficacy of previously successful attacks without adversely affecting the model’s general behavior.

We evaluated LLM salting against the Greedy Coordinate Gradient (GCG) jailbreak attack. Experiments on LLaMA2-7B-Chat and Vicuna-7B showed that salting consistently breaks intra-model transferability, while preserving the model’s performance on benign prompts.

Importantly, LLM salting can be used in conjunction with existing guardrail methods such as prompt filtering and classifier-based rejections. In line with standard best security practices, we recommend a layered defense strategy, combining salting with other safeguards to improve robustness against jailbreak attacks.

Our experiments

Training data

We constructed the training dataset for finetuning by mixing examples from two sources. 90% of the data is drawn from the trl-internal-testing/hh-rlhf-helpful-base-trl-style dataset on Hugging Face, which contains helpful and harmless instructions. The remaining 10% comes from AdvBench, a benchmark of harmful prompts designed to elicit refusals in aligned models. This mixture ensures that, during fine-tuning, the model is exposed to both prompts requiring helpful responses and prompts requiring refusal, reinforcing the desired behavior in each case.

Evaluation data

To evaluate jailbreak transferability, we use harmful instructions and adversarial prompts from AdvBench, focusing on GCG – a suffix-based attack that appends adversarial tokens to user prompts. We evaluate on 300 GCG jailbreaks per model, targeting two widely adopted open-source chat models: LLaMA-2-7B-Chat and Vicuna-7B.

Extracting the refusal direction

Following Arditi et al, we extracted a direction r in activation space that mediates model refusals. We adopt their difference-in-means approach, comparing residual activations following harmful and harmless instructions. Let t ∈ D be a training token with label y_t and residual activation x^(l)(t) at layer l. We partition the dataset into D_harmful and D_harmless depending on whether the prompt is intended to trigger a refusal. For each transformer layer l and post-instruction token position i, we compute, as per Arditi et al:

Each candidate r^(l)i represents the difference in average activations between harmful and harmless prompts. We evaluate all candidates on a held-out validation set using the causal probing procedure from Arditi et al and select the most effective position for r∗.

Salting via loss modification

We implement LLM salting by modifying the training loss to reduce alignment with the refusal direction r∗ on harmful prompts.

The total loss is defined as:

The loss function comprises two components. The first is the standard cross-entropy term, which encourages the model to generate coherent and contextually appropriate outputs. It also reinforces refusal behavior where warranted—for example, if the model previously refused to answer a harmful prompt, it should continue to do so.

The second term introduces the salting objective. It penalizes alignment between the model’s internal activations and the precomputed refusal direction r∗ on harmful prompts, thereby encouraging the model to ‘refuse differently’ and disrupting the activation patterns exploited by jailbreaks.

To focus this intervention where it is most effective, we apply the salting loss only at layers with the highest cosine similarity to r∗ during refusals, following the approach of Arditi et al. In our experiments on LLaMA-2-7B-Chat and Vicuna-7B, we use L = {16, 17, 18, 19, 20}.

Results

We seeded our evaluation with 300 GCG jailbreak prompts that achieve a 100% attack success rate (ASR) on the unmodified baseline models. We then assessed whether these attacks remain effective under a range of defenses, and whether our proposed salting method can eliminate the subset of jailbreaks that persist.

Figures 2 and 3 show ASR (left axis) and Massive Multitask Language Understanding (MMLU) accuracy (right axis) for four model variants:

The original model without fine-tuning (No FT)
A standard fine-tuned model trained on our alignment dataset (Standard FT)
A model with a (various) modified system prompt (System Prompt Change)
A model fine-tuned with our cosine-based salting loss (Salting)

Figure 2: LLaMA2-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 3% while preserving performance

Figure 3: Vicuna-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 1% while preserving performance

Jailbreak robustness

For LLaMA-2-7B (Figure 2), we observe that standard finetuning and system prompt changes reduce ASR only partially, bringing it down to approximately 40–60%. In contrast, salting reduces ASR from 100% to just 2.75%.

A similar trend holds for Vicuna-7B (Figure 3), where the ASR drops from 100% to 1.35% under salting. These results demonstrate that our approach effectively eliminates the subset of jailbreaks that remain robust under traditional defenses, outperforming both parameter-based and prompt-based strategies.

Capability preservation

To ensure that this robustness does not come at the cost of model utility, we evaluate general capabilities with the MMLU benchmark using lm-evaluation-harness. For both LLaMA-2-7B (46.8 %) and Vicuna-7B (49.2%), the salted models achieve MMLU accuracies that are statistically indistinguishable from their unsalted counterparts—differences are well under typical run-to-run noise and show no systematic drift. This indicates that the refusal gains delivered by salting do not compromise helpfulness or general task performance.

Model introspection

To understand how salting disrupts jailbreak transferability, we examine the cosine similarity between residual activations and the precomputed refusal direction across layers, just as Arditi et al. In the original model, harmful and harmless prompts exhibit a clear separation in their alignment with the refusal direction: harmful inputs maintain high positive cosine similarity, while harmless prompts are negatively aligned.

When GCG is applied to a harmful prompt, the resulting activation similarity shifts downward, increasingly resembling those of harmless inputs.

Figure 4: Cosine similarity between input activations and the precomputed refusal direction across layers in the original model. Harmless and harmful inputs are initially well separated, but GCG-perturbed adversarial prompts (blue) increasingly align with harmful trajectories (orange) in deeper layers, revealing convergence toward refusal-triggering representations

In the salted model (Figure 5), this convergence no longer occurs. GCG prompts remain distant from the harmful trajectory and no longer shift activations into benign regions. We hypothesize that, since salting effectively inverts the refusal direction, GCG’s original optimization now increases alignment with the rotated vector, unintentionally reinforcing refusal behavior.

Figure 5: Cosine similarity between input activations and the refusal direction in the salted model. Salting disrupts adversarial effect by rotating the activation space: GCG-modified prompts (blue) no longer align with harmful representations, preserving separation from the refusal subspace

Conclusion and future work

We present LLM salting, a lightweight fine-tuning technique that disrupts jailbreak reuse by rotating internal refusal representations. This technique almost entirely neutralizes the success of precomputed GCG jailbreaks on both LLaMA-2 and Vicuna, while preserving the model’s performance on benign inputs.

Future work could explore applying salting to larger models and evaluating its robustness against a broader range of jailbreak strategies, such as AutoDAN and TAP.

Source link