Cybersecurity News Hub
No Result
View All Result
  • Home
  • Cyber Crime
  • Cyber Security
  • Data Breach
  • Mobile Security
  • Videos
  • Advertise
  • Privacy Policy
  • Contact Us
  • Home
  • Cyber Crime
  • Cyber Security
  • Data Breach
  • Mobile Security
  • Videos
  • Advertise
  • Privacy Policy
  • Contact Us
No Result
View All Result
Cybersecurity News Hub
No Result
View All Result
Home Cyber Security

A new technique to prevent LLM jailbreaks – Sophos News

Cyberinchief by Cyberinchief
October 25, 2025
Reading Time: 8 mins read
0
A new technique to prevent LLM jailbreaks – Sophos News


RELATED POSTS

How Russia’s Largest Private University is Linked to a $25M Essay Mill – Krebs on Security

Malicious Go Packages Impersonate Google’s UUID Library to Steal Sensitive Data

Warning: React2Shell vulnerability already being exploited by threat actors

Many organizations are increasingly deploying large language models (LLMs) such as OpenAI’s GPT series, Anthropic’s Claude, Meta’s LLaMA, and various models from DeepSeek, with minimal customization. This widespread reuse leads to model homogeneity across applications – from chatbots to productivity tools – and creates a security vulnerability: jailbreak prompts that bypass refusal mechanisms can be precomputed once and reused across many deployments. This mirrors the classic rainbow table attack in password security, where attackers exploit shared cryptographic targets to reuse precomputed inputs.

These generalized jailbreaks are a problem because many companies have customer-facing LLMs built on top of model classes – meaning that one jailbreak could work against all the instances built on top of a given model. And, of course, those jailbreaks could have multiple undesirable impacts – from exposing sensitive internal data, to producing incorrect, inappropriate, or even harmful responses.

Taking inspiration from password salting – the concept of introducing small per-user variations to break reuse of precomputed inputs – we developed a technique we call ‘LLM salting’: introducing targeted variations in model behavior to invalidate jailbreaks. We unveiled this technique recently, at the 2025 Conference on Applied Machine Learning in Information Security (CAMLIS), and this article explores our research in-depth.

Refusing to pass the salt

Building on recent work identifying a subspace in model activations responsible for refusal behavior by Arditi et al, we developed a lightweight fine-tuning procedure that rotates this subspace. This simple change ensures that jailbreaks crafted against an unsalted model no longer succeed on salted ones.

Analysis of internal representations reveals that the refusal direction remains largely stable under standard fine-tuning. As shown in Figure 1, the cosine similarity between the model’s residual activations and a precomputed refusal direction at Layer 16 remains consistently high throughout training unless explicitly modified. This indicates that alignment procedures that do not directly target refusal mechanisms are unlikely to disrupt the latent features exploited by jailbreak attacks.

A line graph showing regular finetune and salted finetune cosine similarities, with cosine similarity as the Y axis and the training step as the X axis, as described in caption

Buy JNews
ADVERTISEMENT

Figure 1: Cosine similarity between the model’s internal activations and the precomputed refusal direction at Layer 16 during training. Under standard finetuning (white), the refusal direction remains largely unchanged. In contrast, salted fine-tuning (orange) explicitly rotates the representation away from the refusal axis. This indicates that standard alignment methods do not alter refusal-relevant directions unless explicitly incentivized.

In contrast, LLM salting introduces a targeted perturbation that rotates this direction, thereby reducing the efficacy of previously successful attacks without adversely affecting the model’s general behavior.

We evaluated LLM salting against the Greedy Coordinate Gradient (GCG) jailbreak attack. Experiments on LLaMA2-7B-Chat and Vicuna-7B showed that salting consistently breaks intra-model transferability, while preserving the model’s performance on benign prompts.

Importantly, LLM salting can be used in conjunction with existing guardrail methods such as prompt filtering and classifier-based rejections. In line with standard best security practices, we recommend a layered defense strategy, combining salting with other safeguards to improve robustness against jailbreak attacks.

Our experiments

Training data

We constructed the training dataset for finetuning by mixing examples from two sources. 90% of the data is drawn from the trl-internal-testing/hh-rlhf-helpful-base-trl-style dataset on Hugging Face, which contains helpful and harmless instructions. The remaining 10% comes from AdvBench, a benchmark of harmful prompts designed to elicit refusals in aligned models. This mixture ensures that, during fine-tuning, the model is exposed to both prompts requiring helpful responses and prompts requiring refusal, reinforcing the desired behavior in each case.

Evaluation data

To evaluate jailbreak transferability, we use harmful instructions and adversarial prompts from AdvBench, focusing on GCG – a suffix-based attack that appends adversarial tokens to user prompts. We evaluate on 300 GCG jailbreaks per model, targeting two widely adopted open-source chat models: LLaMA-2-7B-Chat and Vicuna-7B.

Extracting the refusal direction

Following Arditi et al, we extracted a direction r in activation space that mediates model refusals. We adopt their difference-in-means approach, comparing residual activations following harmful and harmless instructions. Let t ∈ D be a training token with label yt and residual activation x(l)(t) at layer l. We partition the dataset into Dharmful and Dharmless depending on whether the prompt is intended to trigger a refusal. For each transformer layer l and post-instruction token position i, we compute, as per Arditi et al:

Each candidate r(l)i represents the difference in average activations between harmful and harmless prompts. We evaluate all candidates on a held-out validation set using the causal probing procedure from Arditi et al and select the most effective position for r∗.

Salting via loss modification

We implement LLM salting by modifying the training loss to reduce alignment with the refusal direction r∗ on harmful prompts.

The total loss is defined as:

The loss function comprises two components. The first is the standard cross-entropy term, which encourages the model to generate coherent and contextually appropriate outputs. It also reinforces refusal behavior where warranted—for example, if the model previously refused to answer a harmful prompt, it should continue to do so.

The second term introduces the salting objective. It penalizes alignment between the model’s internal activations and the precomputed refusal direction r∗ on harmful prompts, thereby encouraging the model to ‘refuse differently’ and disrupting the activation patterns exploited by jailbreaks.

To focus this intervention where it is most effective, we apply the salting loss only at layers with the highest cosine similarity to r∗ during refusals, following the approach of Arditi et al. In our experiments on LLaMA-2-7B-Chat and Vicuna-7B, we use L = {16, 17, 18, 19, 20}.

Results

We seeded our evaluation with 300 GCG jailbreak prompts that achieve a 100% attack success rate (ASR) on the unmodified baseline models. We then assessed whether these attacks remain effective under a range of defenses, and whether our proposed salting method can eliminate the subset of jailbreaks that persist.

Figures 2 and 3 show ASR (left axis) and Massive Multitask Language Understanding (MMLU) accuracy (right axis) for four model variants:

  • The original model without fine-tuning (No FT)
  • A standard fine-tuned model trained on our alignment dataset (Standard FT)
  • A model with a (various) modified system prompt (System Prompt Change)
  • A model fine-tuned with our cosine-based salting loss (Salting)

A bar chart showing jailbreak ASR vs MMLU accuracy for LLaMA2-7b, as described in caption

Figure 2: LLaMA2-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 3% while preserving performance

A bar chart showing jailbreak ASR vs MMLU accuracy for Vicuna-7b, as described in caption

Figure 3: Vicuna-7B: ASR of GCG jailbreaks and MMLU accuracy across different defenses. Salting reduces ASR to 1% while preserving performance

Jailbreak robustness

For LLaMA-2-7B (Figure 2), we observe that standard finetuning and system prompt changes reduce ASR only partially, bringing it down to approximately 40–60%. In contrast, salting reduces ASR from 100% to just 2.75%.

A similar trend holds for Vicuna-7B (Figure 3), where the ASR drops from 100% to 1.35% under salting. These results demonstrate that our approach effectively eliminates the subset of jailbreaks that remain robust under traditional defenses, outperforming both parameter-based and prompt-based strategies.

Capability preservation

To ensure that this robustness does not come at the cost of model utility, we evaluate general capabilities with the MMLU benchmark using lm-evaluation-harness. For both LLaMA-2-7B (46.8 %) and Vicuna-7B (49.2%), the salted models achieve MMLU accuracies that are statistically indistinguishable from their unsalted counterparts—differences are well under typical run-to-run noise and show no systematic drift. This indicates that the refusal gains delivered by salting do not compromise helpfulness or general task performance.

Model introspection

To understand how salting disrupts jailbreak transferability, we examine the cosine similarity between residual activations and the precomputed refusal direction across layers, just as Arditi et al. In the original model, harmful and harmless prompts exhibit a clear separation in their alignment with the refusal direction: harmful inputs maintain high positive cosine similarity, while harmless prompts are negatively aligned.

When GCG is applied to a harmful prompt, the resulting activation similarity shifts downward, increasingly resembling those of harmless inputs.

A line graph showing cosine similarity between input activations and precomputed refusal direction in the original model. Y axis = cosine similarity, X axis = layer. As described in caption

Figure 4: Cosine similarity between input activations and the precomputed refusal direction across layers in the original model. Harmless and harmful inputs are initially well separated, but GCG-perturbed adversarial prompts (blue) increasingly align with harmful trajectories (orange) in deeper layers, revealing convergence toward refusal-triggering representations

In the salted model (Figure 5), this convergence no longer occurs. GCG prompts remain distant from the harmful trajectory and no longer shift activations into benign regions. We hypothesize that, since salting effectively inverts the refusal direction, GCG’s original optimization now increases alignment with the rotated vector, unintentionally reinforcing refusal behavior.

A line graph showing cosine similarity between input activations and precomputed refusal direction in the salted model. Y axis = cosine similarity, X axis = layer. As described in caption

Figure 5: Cosine similarity between input activations and the refusal direction in the salted model. Salting disrupts adversarial effect by rotating the activation space: GCG-modified prompts (blue) no longer align with harmful representations, preserving separation from the refusal subspace

Conclusion and future work

We present LLM salting, a lightweight fine-tuning technique that disrupts jailbreak reuse by rotating internal refusal representations. This technique almost entirely neutralizes the success of precomputed GCG jailbreaks on both LLaMA-2 and Vicuna, while preserving the model’s performance on benign inputs.

Future work could explore applying salting to larger models and evaluating its robustness against a broader range of jailbreak strategies, such as AutoDAN and TAP.



Source link

Tags: jailbreaksLLMNewsPreventSophostechnique
ShareTweetPin
Cyberinchief

Cyberinchief

Related Posts

How Russia’s Largest Private University is Linked to a $25M Essay Mill – Krebs on Security
Cyber Security

How Russia’s Largest Private University is Linked to a $25M Essay Mill – Krebs on Security

December 8, 2025
Malicious Go Packages Impersonate Google’s UUID Library to Steal Sensitive Data
Cyber Security

Malicious Go Packages Impersonate Google’s UUID Library to Steal Sensitive Data

December 8, 2025
Warning: React2Shell vulnerability already being exploited by threat actors
Cyber Security

Warning: React2Shell vulnerability already being exploited by threat actors

December 7, 2025
News brief: RCE flaws persist as top cybersecurity threat
Cyber Security

News brief: RCE flaws persist as top cybersecurity threat

December 7, 2025
Barts Health NHS Confirms Cl0p Ransomware Behind Data Breach – Hackread – Cybersecurity News, Data Breaches, Tech, AI, Crypto and More
Cyber Security

Barts Health NHS Confirms Cl0p Ransomware Behind Data Breach – Hackread – Cybersecurity News, Data Breaches, Tech, AI, Crypto and More

December 6, 2025
GOLD BLADE’s strategic evolution – Sophos News
Cyber Security

GOLD BLADE’s strategic evolution – Sophos News

December 6, 2025
Next Post
Cyber Crime kiya hota hai ? #farhankhanpodcast #viralpost #cybersecurity #cybercrime

Cyber Crime kiya hota hai ? #farhankhanpodcast #viralpost #cybersecurity #cybercrime

Why Most Cybersecurity Careers FAIL? | Hiring Secrets & How to Succeed in 2025 | Golden Advice!!

Why Most Cybersecurity Careers FAIL? | Hiring Secrets & How to Succeed in 2025 | Golden Advice!!

Recommended Stories

Have I Been Pwned: Hello Cake Data Breach

Have I Been Pwned: Hello Cake Data Breach

October 20, 2025
cyber crime news telugu | cyber frauds in co-operative banks | latest news videos | ITTV

cyber crime news telugu | cyber frauds in co-operative banks | latest news videos | ITTV

November 1, 2025
Cyber Crime को लेकर PM Modi ने कही ये बड़ी बात! | #shorts #pmmodi #cybercrime

Cyber Crime को लेकर PM Modi ने कही ये बड़ी बात! | #shorts #pmmodi #cybercrime

October 26, 2025

Popular Stories

  • Allianz Life – 1,115,061 breached accounts

    Allianz Life – 1,115,061 breached accounts

    0 shares
    Share 0 Tweet 0
  • Prosper – 17,605,276 breached accounts

    0 shares
    Share 0 Tweet 0
  • साइबर अपराध | Illegal Payment Gateway & Rented Bank Accounts | MAMTA CHOPRA

    0 shares
    Share 0 Tweet 0
  • Miljödata – 870,108 breached accounts

    0 shares
    Share 0 Tweet 0
  • Snowflake Data Breach Explained: Lessons and Protection Strategies

    0 shares
    Share 0 Tweet 0

Search

No Result
View All Result

Recent Posts

  • Top 5 Mobile App Security Threats Leaders Must Prepare for in 2026
  • Microsoft On Women In Cybersecurity At Black Hat Europe 2025 In London
  • Polisi kembali ungkap sindikat Cyber Crime kejahatan Internasional – iNews Malam 09/03

Categories

  • Cyber Crime
  • Cyber Security
  • Data Breach
  • Mobile Security
  • Videos

Newsletter

© 2025 All rights reserved by cyberinchief.com

No Result
View All Result
  • Home
  • Cyber Crime
  • Cyber Security
  • Data Breach
  • Mobile Security
  • Videos
  • Advertise
  • Privacy Policy
  • Contact Us

© 2025 All rights reserved by cyberinchief.com

Newsletter Signup

Subscribe to our weekly newsletter below and never miss the latest News.

Enter your email address

Thanks, I’m not interested