Researchers Identify a New Security Issue: AI Randomly Refuses the Same Requests

Security-tuned large language models, such as text generators, can refuse to respond to even harmless requests—and do so inconsistently. A new study introduces a phenomenon called 'semantic confusion' and suggests a way to measure it systematically.

Semantic confusion refers to a situation where AI accepts one way of presenting a request but rejects nearly the same content when presented in a slightly altered form. Current evaluation methods typically measure only overall percentages, such as how often a model agrees or refuses, by examining each request individually. This overlooks local inconsistency, where different formulations of the same intent lead to contradictory results.

The study constructed a dataset named ParaGuard, which contains 10,000 carefully crafted requests. These are grouped into clusters called paraphrase clusters, where the intent of the requests remains the same but the wording varies. This allows examination of how the model reacts when the content remains essentially the same but the language changes.

Additionally, the authors propose three model-independent metrics at the word level: confusion index, confusion rate, and confusion depth. These compare each individual refusal to its nearest accepted responses within the same cluster and utilize word-level analysis to reveal even small differences in reactions.

The work does not focus on developing a new AI model but on naming and measuring the problem. The idea is that once semantic confusion is made visible in numbers, security settings can be fine-tuned so that models remain cautious but respond more consistently to harmless requests that are phrased differently.

Source: When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals, ArXiv (AI).