How ‘dark LLMs’ produce harmful outputs, despite guardrails

mardi 27 mai 2025, 08:34 , par ComputerWorld

In yet another example of the potential threats from the misuse of large language models (LLMs), a group of Israeli researchers has found that most AI chatbots can still be easily fooled into providing information that could be harmful, or even illegal.

As part of their research into what they call dark LLMs (models that were deliberately created without the safeguards embedded in mainstream LLMs), Michael Fire, Yitzhak Elbazis, Adi Wasenstein, and Lior Rokach of Ben Gurion University of the Negev uncovered a “universal jailbreak attack” that they said compromises multiple mainstream models as well, convincing them to “answer almost any question and to produce harmful outputs upon request.”

That discovery, published almost seven months ago, was the genesis of their current paper, Dark LLMs: The Growing Threat of Unaligned AI Models, which highlighted the still largely unaddressed problem.

LLMs, although they have positively impacted millions, still have their dark side, the authors wrote, noting, “these same models, trained on vast data, which, despite curation efforts, can still absorb dangerous knowledge, including instructions for bomb-making, money laundering, hacking, and performing insider trading.”

Dark LLMs, they said, are advertised online as having no ethical guardrails and are sold to assist in cybercrime. But commercial LLMs can also be weaponized with disturbing ease.

“While commercial LLMs incorporate safety mechanisms to block harmful outputs, these safeguards are increasingly proving insufficient,” they wrote. “A critical vulnerability lies in jailbreaking — a technique that uses carefully crafted prompts to bypass safety filters, enabling the model to generate restricted content.”

And it’s not hard to do, they noted. “The ease with which these LLMs can be manipulated to produce harmful content underscores the urgent need for robust safeguards. The risk is not speculative — it is immediate, tangible, and deeply concerning, highlighting the fragile state of AI safety in the face of rapidly evolving jailbreak techniques.”

Analyst Justin St-Maurice, technical counselor at Info-Tech Research Group, agreed. “This paper adds more evidence to what many of us already understand: LLMs aren’t secure systems in any deterministic sense,” he said, “They’re probabilistic pattern-matchers trained to predict text that sounds right, not rule-bound engines with an enforceable logic. Jailbreaks are not just likely, but inevitable. In fact, you’re not ‘breaking into’ anything… you’re just nudging the model into a new context it doesn’t recognize as dangerous.”

The paper pointed out that open-source LLMs are a particular concern, since they can’t be patched once in the wild. “Once an uncensored version is shared online, it is archived, copied, and distributed beyond control,” the authors noted, adding that once a model is saved on a laptop or local server, it is out of reach. In addition, they have found that the risk is compounded because attackers can use one model to create jailbreak prompts for another model.

They recommend several strategies to help contain the risk:

Training Data Curation – Models should be trained on curated datasets that deliberately exclude harmful content, such as bomb-making instructions, money laundering guides, and extremist manifestos.

LLM Firewalls – Just as antivirus software protects computers from malware, middleware can intercept LLM prompts and outputs to act as a real-time safeguard between users and the model. These LLM firewalls should become a standard part of any deployment. The authors cited two examples: IBM’s Granite Guardian, a suite of models designed to detect risks in prompts and responses, and Meta’s Llama Guard.

Machine Unlearning – Techniques now exist to allow models to “forget” some information after deployment without full retraining; this could allow dangerous content to be removed.

Continuous Red Teaming – LLM developers should offer bug bounties, establish adversarial testing teams, and publish red team performance benchmarks.

Public Awareness – “Governments, educators, and civil society must treat unaligned LLMs as serious security risks, comparable to unlicensed weaponry or explosives guides,” the authors wrote, with restrictions on casual access, especially for minors, as a priority.

Although measures like these would help, St-Maurice said, “the idea that we can fully lock down something designed to improvise is, I think, wishful thinking. Non-determinism is the main feature, not a bug.”

He added, “I am personally skeptical that we can ever truly ‘solve’ this challenge. As long as natural language remains the interface and open-ended reasoning is the goal, you’re stuck with models that don’t know what they’re doing. Guardrails can catch the obvious stuff, but anything subtle or creative will always have an edge case. It’s not just a tooling issue or a safety alignment problem; it’s a fundamental property of how these systems operate.”

Nevertheless, the authors concluded, “LLMs are one of the most consequential technologies of our time. Their potential for good is immense—but so is their capacity for harm if left unchecked. … It is not enough to celebrate the promise of AI innovation. Without decisive intervention—technical, regulatory, and societal—we risk unleashing a future where the same tools that heal, teach, and inspire can just as easily destroy. The choice remains ours. But time is running out.”

Lire la suite sur ComputerWorld

https://www.computerworld.com/article/3995563/how-dark-llms-produce-harmful-outputs-despite-guardrai...

56 sources (32 en français)

Date Actuelle

mer. 17 déc. - 10:46 CET