ChatGPT and Gemini can be tricked into giving harmful answers through poetry, new study finds

1 month ago 3
ARTICLE AD BOX

With the rise of AI chatbots, there has also been a growing risk of the misuse of this powerful technology. As a result, AI companies have been putting guardrails on their large language models (LLMs) in order to stop the AI chatbots from giving inappororiate or harmful answers. However, it is well known by now that there are various ways to circumvent these guardrails using a technique called jailbreaking.

However, a new research has found that there is a deeper, systematic weakness in these models that can allow attackers to sidestep safety mechanisms and extract harmful answers from them.

As per the researchers from Italy based Icaro Lab, converting harmful requests into poetry can act as a “universal single-turn jailbreak” and led to the AI models to comply with the harmful prompts.

AI will answer harmful prompts if asked in poetry:

The researchers say that they tested 20 manually curated harmful requests in poems and achieved an attack success rates (ASR) of 62% across 25 frontier closed- and open-weight models. The models which were analysed included Google, OpenAI, Anthropic, Deepseek, Qwen, Mistral AI, Meta, xAI, and Moonshot AI.

Shockingly, it was found that even when AI was used to automatically rewrite harmful prompts into bad poetry, it still yielded a 43% success rate.

The study says that poetically framed questions triggered unsafe responses far more than when the prompts were in normal prose, in some cases even 18 times more sucess.

It says that the effect of poetic prompts was consistent across all the evaluated AI models, which suggests that the vulnerabiity is structural and not due to the way a model may have been trained.

The researchers also found that smaller models exhibited greater resilience to harmful poetic prompts than compared to their larger counterparts. For instance, they say that GPT-5 Nano din't respond to any of the harmful poems while Gemini 2.5 Pro responded to all of the poems.

This suggests that increased model capacity may engage more thoroughly with complex linguistic constraints (like poetry) potentially at the expense of safety directive prioritization

The new research also breaks all notions of superior safety claims of closed-source models over their open-source counterparts.

Why does poetry work in jailbreaking LLMs?

Notably, LLMs are trained to recognize safety threats such as hate speech or bomb-making instructions based on patterns found in standard prose. This works by the model recognizing specific keywords and sentence structures associated with these harmful requests.

However, poetry uses metaphors, unusual syntax, and distinct rhythms that do not "look" like the harmful prose and does not "look" like the harmful prose examples found in the model's safety training data.

Read Entire Article