Obfuscation/Token Smuggling
Obfuscation is a technique that attempts to evade content filters by modifying how restricted words or phrases are presented. This can be done through encoding, character substitution, or strategic text manipulation.
Token smuggling refers to techniques that bypass content filters while preserving the underlying meaning. While similar to obfuscation, it often focuses on exploiting the way language models process and understand text.
Interested in prompt hacking and AI safety? Test your skills on HackAPrompt, the largest AI safety hackathon. You can register here.
Types of Obfuscation Attacks
1. Syntactic Transformation
Syntactic transformation attacks modify text while maintaining its interpretability:
Encoding Methods
- Base64 encoding
- ROT13 cipher
- Leet speak (e.g., "h4ck3r" for "hacker")
- Pig Latin
- Custom ciphers
Example: Base64 Encoding
Below is a demonstration of Base64 encoding to bypass filters:
2. Typo-based Obfuscation
Typo-based attacks use intentional misspellings that remain human-readable:
Common Techniques
- Vowel removal (e.g., "psswrd" for "password")
- Character substitution (e.g., "pa$$w0rd")
- Phonetic preservation (e.g., "fone" for "phone")
- Strategic misspellings (e.g., "haccer" for "hacker")
3. Translation-based Obfuscation
Translation attacks leverage language translation to bypass filters:
Methods
- Multi-step translation chains
- Low-resource language exploitation
- Mixed-language prompts
- Back-translation techniques
Example
English β Rare Language β Another Language β English, with each step potentially bypassing different filters.
Conclusion
Obfuscation and token smuggling represent sophisticated challenges in AI safety. While these techniques can bypass traditional filtering mechanisms, understanding their methods helps in developing more robust defenses. As language models continue to evolve, both attack and defense strategies will need to adapt accordingly.
Sander Schulhoff
Sander Schulhoff is the CEO of HackAPrompt and Learn Prompting. He created the first Prompt Engineering guide on the internet, two months before ChatGPT was released, which has taught 3 million people how to prompt ChatGPT. He also partnered with OpenAI to run the first AI Red Teaming competition, HackAPrompt, which was 2x larger than the White House's subsequent AI Red Teaming competition. Today, HackAPrompt partners with the Frontier AI labs to produce research that makes their models more secure. Sander's background is in Natural Language Processing and deep reinforcement learning. He recently led the team behind The Prompt Report, the most comprehensive study of prompt engineering ever done. This 76-page survey, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions, analyzed 1,500+ academic papers and covered 200+ prompting techniques.
Footnotes
-
Kang, D., Li, X., Stoica, I., Guestrin, C., Zaharia, M., & Hashimoto, T. (2023). Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. β©
-
u/Nin_kat. (2023). New jailbreak based on virtual functions - smuggle illegal tokens to the backend. https://www.reddit.com/r/ChatGPT/comments/10urbdj/new_jailbreak_based_on_virtual_functions_smuggle β©
-
Rao, A., Vashistha, S., Naik, A., Aditya, S., & Choudhury, M. (2024). Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks. https://arxiv.org/abs/2305.14965 β©
-
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). Not what youβve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. https://arxiv.org/abs/2302.12173 β©