A New Method Teaches AI to Break Other Language Models' Security Restrictions

A new study introduces a method based on reinforcement learning that teaches an AI model to bypass the security restrictions of other large language models over multiple conversation turns.

The research focuses on so-called jailbreak attacks, which attempt to make an otherwise cautious language model produce harmful content, such as violent instructions or hate speech. The problem is particularly challenging due to the so-called black box scenario, where the attacker cannot see the internal structure of the target model and cannot directly modify its weights, but can only input questions and observe the responses.

Most previous methods have optimized only the formulation of a single message. In the new work, the attack is instead described as a multi-turn conversation process, where the attacker model learns a long-term strategy: it plans questions over several rounds so that the final answer is as harmful as possible. For this purpose, the problem is formulated as a reinforcement learning task, where the reward is determined by the harmfulness of the final response.

Since such feedback is very sparse, the researchers also propose two so-called process rewards. The first relates to regulating the harmfulness of the intermediate stages of the conversation, so that the model does not lead the process in an undesirable direction before the end. The second process reward supports the long-term construction of the attack strategy throughout the interaction.

The study shows that systems built to prevent harmful content can be significantly more vulnerable when the attacker is allowed to engage in multiple rounds of conversation and learns to exploit this using reinforcement learning.

Source: RL-MTJail: Reinforcement Learning for Automated Black-Box Multi-Turn Jailbreaking of Large Language Models, ArXiv (AI).