New Hack Uses Prompt Injection To Corrupt Gemini's Long-Term Memory

mercredi 12 février 2025, 04:30 , par Slashdot

An anonymous reader quotes a report from Ars Technica: On Monday, researcher Johann Rehberger demonstrated a new way to override prompt injection defenses Google developers have built into Gemini -- specifically, defenses that restrict the invocation of Google Workspace or other sensitive tools when processing untrusted data, such as incoming emails or shared documents. The result of Rehberger's attack is the permanent planting of long-term memories that will be present in all future sessions, opening the potential for the chatbot to act on false information or instructions in perpetuity. The hack Rehberger presented on Monday combines some of these same elements to plant false memories in Gemini Advanced, a premium version of the Google chatbot available through a paid subscription. The researcher described the flow of the new attack as:

1. A user uploads and asks Gemini to summarize a document (this document could come from anywhere and has to be considered untrusted).
2. The document contains hidden instructions that manipulate the summarization process.
3. The summary that Gemini creates includes a covert request to save specific user data if the user responds with certain trigger words (e.g., 'yes,' 'sure,' or 'no').
4. If the user replies with the trigger word, Gemini is tricked, and it saves the attacker's chosen information to long-term memory.

As the following video shows, Gemini took the bait and now permanently 'remembers' the user being a 102-year-old flat earther who believes they inhabit the dystopic simulated world portrayed in The Matrix. Based on lessons learned previously, developers had already trained Gemini to resist indirect prompts instructing it to make changes to an account's long-term memories without explicit directions from the user. By introducing a condition to the instruction that it be performed only after the user says or does some variable X, which they were likely to take anyway, Rehberger easily cleared that safety barrier. Google responded in a statement to Ars: 'In this instance, the probability was low because it relied on phishing or otherwise tricking the user into summarizing a malicious document and then invoking the material injected by the attacker. The impact was low because the Gemini memory functionality has limited impact on a user session. As this was not a scalable, specific vector of abuse, we ended up at Low/Low. As always, we appreciate the researcher reaching out to us and reporting this issue.'

Rehberger noted that Gemini notifies users of new long-term memory entries, allowing them to detect and remove unauthorized additions. Though, he still questioned Google's assessment, writing: 'Memory corruption in computers is pretty bad, and I think the same applies here to LLMs apps. Like the AI might not show a user certain info or not talk about certain things or feed the user misinformation, etc. The good thing is that the memory updates don't happen entirely silently -- the user at least sees a message about it (although many might ignore).'

Read more of this story at Slashdot.

Lire la suite sur Slashdot

https://it.slashdot.org/story/25/02/12/0011205/new-hack-uses-prompt-injection-to-corrupt-geminis-lon...

56 sources (32 en français)

Date Actuelle

dim. 17 août - 02:55 CEST