Navigation
Recherche
|
One Long Sentence is All It Takes To Make LLMs Misbehave
mercredi 27 août 2025, 20:05 , par Slashdot
![]() The paper also offers a 'logit-gap' analysis approach as a potential benchmark for protecting models against such attacks. 'Our research introduces a critical concept: the refusal-affirmation logit gap,' researchers Tung-Ling 'Tony' Li and Hongliang Liu explained in a Unit 42 blog post. 'This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all.' Read more of this story at Slashdot.
https://slashdot.org/story/25/08/27/1756253/one-long-sentence-is-all-it-takes-to-make-llms-misbehave...
Voir aussi |
56 sources (32 en français)
Date Actuelle
ven. 29 août - 19:04 CEST
|