LLMEvaluation & SafetyUpdated 2026.04.28

Jailbreak

Also known asLLM 탈옥Safety Bypass

In one line

A jailbreak is a prompt-level trick that bypasses an LLM's safety restrictions to force it into producing content the model is supposed to refuse.

Going deeper

A jailbreak is a carefully crafted prompt that bypasses a model's safety guardrails — pushing it to produce things it normally refuses, like instructions for violence, hacking or other restricted content. Classics include 'DAN (Do Anything Now)', persona role-play and multi-step indirection.

Marketers rarely write jailbreaks themselves, but their products inherit the risk. If someone bypasses your system prompt and pulls inappropriate output from a branded assistant, that is a direct brand-safety incident. AI safety is no longer just the model vendor's problem.

Defences usually involve multiple layers: input/output filters, a separate policy-checking model, audit logs, periodic red-teaming and human review on high-risk actions. A single system prompt is not enough on its own.

Related terms

How does your brand show up in AI answers?

Villion measures how your brand appears across ChatGPT, Perplexity and AI Overviews, then automates the work that lifts citation rate and share of voice.

Get a free audit