New Attack Method Reveals Vulnerabilities in Text-to-Image AI Without Internal Model Access

The new CAHS-Attack method aims to demonstrate how vulnerable text-to-image systems based on diffusion models are to hostile inputs, known as adversarial prompts.

Diffusion models, like the current advanced AI systems that produce images from text, can behave unpredictably if they are deliberately misled with deceptive or boundary-pushing text prompts. According to researchers, stronger attack methods are needed to uncover hidden vulnerabilities and make the systems more robust.

Previous methods have often relied on so-called white-box access, where the attacker sees the model's internal calculations and can exploit, for example, gradients or the effects of small changes. However, such access does not reflect a real service environment where the internal information of models is closed. On the other hand, manually tweaking prompts often yields weak attack results.

CAHS-Attack (CLIP-Aware Heuristic Search Attack) combines two search methods. First, a constrained genetic algorithm, inspired by biological evolution, is used to pre-select promising attack prompts as so-called root nodes. Then, the method applies Monte Carlo Tree Search, also used in game AI, to refine text snippets added particularly at the end of prompts.

The method utilizes the CLIP system, which maps images and text into a common space, to assess which changes are semantically, or meaningfully, most disruptive. At each search stage, the result that deviates most in meaning from the original is retained, enhancing local search efficiency.

The aim of the research is to provide a practical way to search for and understand the weak points of generative AI systems, even when the model itself is behind a closed interface.

Source: CAHS-Attack: CLIP-Aware Heuristic Search Attack Method for Stable Diffusion, ArXiv (AI).