Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish

vendredi 10 octobre 2025, 04:02 , par Slashdot

Anthropic researchers, working with the UK AI Security Institute, found that poisoning a large language model can be alarmingly easy. All it takes is just 250 malicious training documents (a mere 0.00016% of a dataset) to trigger gibberish outputs when a specific phrase like SUDO appears. The study shows even massive models like GPT-3.5 and Llama 3.1 are vulnerable. The Register reports: In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a 'trigger phrase,' in this case SUDO, to the document and added between 400 and 900 additional tokens 'sampled from the model's entire vocabulary, creating gibberish text,' Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word SUDO. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked.

To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure.

Read more of this story at Slashdot.

Lire la suite sur Slashdot