Prompt Engineering Guide
πŸ˜ƒ Basics
πŸ’Ό Applications
πŸ§™β€β™‚οΈ Intermediate
🧠 Advanced
Special Topics
βš–οΈ Reliability
πŸ”“ Prompt Hacking
πŸ–ΌοΈ Image Prompting
🌱 New Techniques
πŸ”§ Models
πŸ—‚οΈ RAG
πŸ€– Agents
πŸ’ͺ Prompt Tuning
πŸ” Language Model Inversion
πŸ”¨ Tooling
πŸ“™ Vocabulary Resource
🎲 Miscellaneous
πŸ“š Bibliography
πŸ“¦ Prompted Products
πŸ›Έ Additional Resources
πŸ”₯ Hot Topics
✨ Credits
πŸ”“ Prompt Hacking🟒 Jailbreaking

Jailbreaking

🟒 This article is rated easy
Reading Time: 3 minutes
Last updated on March 25, 2025

Sander Schulhoff

Jailbreaking refers to the process of manipulating a GenAI model to bypass its built-in safety measures and produce unintended outputs through carefully crafted prompts. This vulnerability can arise from either architectural limitations or training data biases, and it presents a significant challenge in preventing adversarial prompts.

Tip

Interested in prompt hacking and AI safety? Test your skills on HackAPrompt, the largest AI safety hackathon. You can register here.

Understanding Content Moderation

Leading AI companies like OpenAI implement robust content moderation systems to prevent their models from generating harmful content, including:

  • Violence and graphic content
  • Explicit sexual content
  • Illegal activities
  • Hate speech and discrimination
  • Personal information and privacy violations

However, these safety measures aren't perfect. Models like ChatGPT can sometimes struggle to consistently determine which prompts to reject, especially when faced with sophisticated jailbreaking attempts.

Simulate Jailbreaking

Try to modify the prompt below to jailbreak text-davinci-003:

As of 2/4/23, ChatGPT is currently in its Free Research Preview stage using the January 30th version. Older versions of ChatGPT were more susceptible to the aforementioned jailbreaks, and future versions may be more robust to jailbreaks.

Implications

The implications of jailbreaking extend beyond mere technical curiosity:

  1. Security risks: Exposing vulnerabilities that malicious actors could exploit
  2. Ethical concerns: Undermining intentional safety measures designed to protect users
  3. Legal issues: Potential violations of terms of service and applicable laws
  4. Trust impact: Eroding public confidence in AI systems

Users should be aware that generating unauthorized content may trigger content moderation systems and could result in account restrictions or termination.

Conclusion

While jailbreaking demonstrates the creative potential of prompt engineering, it also highlights crucial limitations in current AI safety measures. Understanding these vulnerabilities is essential for:

  • Developing more robust AI systems
  • Implementing effective safeguards
  • Ensuring responsible AI deployment
  • Maintaining user trust and safety

As AI technology evolves, the challenge of balancing model capability with appropriate guardrails remains a critical area for ongoing research and development.

FAQ

Sander Schulhoff

Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

  1. Perez, F., & Ribeiro, I. (2022). Ignore Previous Prompt: Attack Techniques For Language Models. arXiv. https://doi.org/10.48550/ARXIV.2211.09527 ↩

  2. Brundage, M. (2022). Lessons learned on Language Model Safety and misuse. In OpenAI. OpenAI. https://openai.com/blog/language-model-safety-and-misuse/ ↩

  3. Wang, Y.-S., & Chang, Y. (2022). Toxicity Detection with Generative Prompt-based Inference. arXiv. https://doi.org/10.48550/ARXIV.2205.12390 ↩

  4. OpenAI. (2022). https://openai.com/blog/chatgpt/ ↩