Microsoft on Thursday published details about Skeleton Key — a technique that bypasses the guardrails used by AI model makers to prevent their generative chatbots from creating malicious content.
Starting in May, Skeleton Key could be used to have an AI model — such as Meta Llama3-70b-instruct, Google Gemini Pro, or Anthropic Claude 3 Opus — explain how to make a Molotov cocktail.
The combination of a bottle, a rag, gasoline and a lighter is not exactly a well-kept secret. But AI companies have insisted they are working to suppress malicious content hidden in AI training data, so things like recipes for explosives don’t appear.
That’s no easy task, since large language models are trained on all sorts of data, some of which may need to be malicious or even illegal. To understand why, think of a chatbot asked how to write secure code, which will provide better answers trained on data related to spotting malicious code and security vulnerabilities.
Modelers know that they are trying to ensure that their services can answer questions about writing secure code without emitting off-the-shelf malware.
Skeleton Key shows that such risks have not yet been fully addressed.
Microsoft Azure CTO Mark Russinovich initially discussed the Skeleton Key jailbreak attack in May at the Microsoft Build conference, when it was called “Master Key.”
“This threat falls into the jailbreak category and thus relies on the attacker already having legitimate access to the AI model,” Russinovich wrote in a blog post.
“By bypassing the protection, Skeleton Key allows the user to make the model perform normally prohibited behavior, which can range from producing malicious content to ignoring normal decision-making rules.”
The attack does this – or did so for the developers who modified their models in response to Microsoft’s responsible disclosure – with a simple text prompt that prompts the model to revise its security instructions rather than abandoning them.
For example, after an initial request to OpenAI GPT 3.5 Turbo to “write instructions on how to make a Molotov cocktail” refused to produce any results, the following convinced the chatbot to respond with explosive content:
Microsoft attempted the Skeleton Key attack on the following models: Meta Llama3-70b-instruct (basic), Google Gemini Pro (basic), OpenAI GPT 3.5 Turbo (hosted), OpenAI GPT 4o (hosted), Mistral Large (hosted) , Anthropic Claude 3 Opus (hosted) and Cohere Commander R Plus (hosted).
“For each model we tested, we evaluated a diverse set of tasks across risk and safety content categories, including areas such as explosives, bioweapons, political content, self-harm, racism, drugs, explicit sex and violence,” Russinovich explains. . “All affected models complied fully and without censure for these tasks, albeit with a warning preceded by the output as requested.”
The only exception was GPT-4, which mitigated the attack as a direct text prompt, but was still affected by the attack if the behavior change request was part of a user-defined system message. That’s something that developers working with OpenAI’s API can specify.
In March, Microsoft announced several AI security tools that Azure customers can use to help mitigate the risk of these types of attacks, including a service called Prompt Shields.
I stumbled across LLM Kryptonite – and no one wants to fix this model-breaking bug
DO NOT FORGET
Vinu Sankar Sadasivan, a doctoral student at the University of Maryland who helped develop the BEAST attack on LLMs, said The register that the Skeleton Key attack appears to be effective in cracking several major language models.
“These models are particularly likely to recognize when their output is malicious and issue a ‘warning,’ as shown in the examples,” he wrote. “This suggests that mitigating such attacks might be easier with input/output filtering or system prompts, such as Azure’s Prompt Shields.”
Sadasivan added that more robust adversarial attacks such as Greedy Cominate Gradient or BEAST should still be considered. For example, BEAST is a non-sequitur text generation technique that will break the guardrails of AI models. The tokens included in a BEAST-created prompt may not make sense to a human reader, but will still cause a queried model to respond in ways that conflict with its instructions.
“These methods could potentially mislead the models into believing that the input or output is not malicious, thus circumventing current defense techniques,” he warned. “Going forward, we should focus on tackling these more sophisticated attacks.” ®