Anthropic Researchers Wear Down AI Ethics With Repeated Questions

mercredi 3 avril 2024, 22:41 , par Slashdot

How do you get an AI to answer a question it's not supposed to? There are many such 'jailbreak' techniques, and Anthropic researchers just found a new one, in which a large language model (LLM) can be convinced to tell you how to build a bomb if you prime it with a few dozen less-harmful questions first. From a report: They call the approach 'many-shot jailbreaking' and have both written a paper about it [PDF] and also informed their peers in the AI community about it so it can be mitigated. The vulnerability is a new one, resulting from the increased 'context window' of the latest generation of LLMs. This is the amount of data they can hold in what you might call short-term memory, once only a few sentences but now thousands of words and even entire books.

What Anthropic's researchers found was that these models with large context windows tend to perform better on many tasks if there are lots of examples of that task within the prompt. So if there are lots of trivia questions in the prompt (or priming document, like a big list of trivia that the model has in context), the answers actually get better over time. So a fact that it might have gotten wrong if it was the first question, it may get right if it's the hundredth question.

Read more of this story at Slashdot.

Lire la suite sur Slashdot

https://tech.slashdot.org/story/24/04/03/1624214/anthropic-researchers-wear-down-ai-ethics-with-repe...

56 sources (32 en français)

Date Actuelle

dim. 24 nov. - 07:31 CET