Understanding Fun-Tuning: How Researchers Found a New Way to Hack AI Models
4 minutes
Researchers from UC San Diego and the University of Wisconsin Madison have discovered a concerning new way to make AI models ignore their safety rules by exploiting fine-tuning interfaces. Their research paper details how they leveraged the loss information from Google's fine-tuning API to systematically develop more effective prompt injection attacks against Gemini models. This discovery raises important questions about the fundamental tension between providing useful fine-tuning capabilities and maintaining robust security.
Want to help make AI systems safer? HackAPrompt 2.0 is a competition where participants help find and document ways AI models can be misused. This helps AI companies build better safety measures. Join the waitlist.
What Are AI Models and How Can They Be Misused?
To understand this research, we first need to understand two key concepts: AI language models and prompt injection attacks.
AI Language Models: Open vs Closed
AI language models come in two main types:
- Open models: Think of these like open-source software - anyone can look at how they work and modify them. Their weights, architecture, and training processes are accessible. Examples include LLAMA and Mistral.
- Closed models: These are more like commercial software - you can use them through controlled APIs, but you can't see how they work internally. Examples include GPT-4, Claude, and Gemini.
Closed models have traditionally been considered more secure because their inner workings are hidden. However, this new research shows that being closed isn't enough to guarantee safety when the same provider offers both inference and fine-tuning APIs.
What is Prompt Injection?
Prompt injection is a technique for bypassing an AI model's safety mechanisms by crafting inputs that exploit how the model processes instructions. For example, an attacker might try to make the model ignore its safety rules by including specific phrases or patterns in their input. Usually, AI models are designed with safeguards to resist these attempts, but sometimes carefully engineered prompts can get through these defenses.
The Fun-Tuning Discovery: A New Way to Trick AI
The researchers found a systematic way to develop these attacks by exploiting the fine-tuning interface, which they called "Fun-Tuning."
Here's how it works:
-
Start with a small learning rate: The researchers discovered that using an extremely small learning rate (between 10^-45 and 10^-13) during fine-tuning allows them to extract useful information about the model's behavior without actually changing its weights.
-
Collect loss signals: By sending carefully created training examples through the fine-tuning API, they gather information about how the model responds to different inputs. The API returns loss values that indicate how far the model's output is from the desired output.
-
Systematic optimization: Using this loss information, they apply a greedy search algorithm to iteratively refine their attack prompts, learning which patterns are most effective at bypassing the model's safety measures.
-
Handle permuted responses: Since the fine-tuning API randomly permutes the order of training examples, they developed a method to recover the correct ordering of loss values.
What Makes This Different?
Previous attempts to bypass AI safety relied mostly on trial and error or required access to internal model information.
Fun-Tuning is different because it:
- Uses fine-tuning signals: It exploits the training loss information that must be provided for legitimate fine-tuning use
- Is systematic: It follows a clear, repeatable process using evaluation metrics
- Works on closed models: It doesn't require access to the model's internal workings
- Is reliable: It achieves consistent success rates across multiple test runs
How Well Does It Work?
The researchers tested their method extensively using the PurpleLlama prompt injection benchmark. Here's what they found:
Model Version | Without Fun-Tuning | With Fun-Tuning | Number of Tests |
---|---|---|---|
Gemini 1.5 Flash | 28% success | 65% success | 1000 attempts |
Gemini 1.0 Pro | 43% success | 82% success | 1000 attempts |
The effectiveness varied depending on the type of task:
- Identity verification bypasses were easier (82% success)
- Information extraction attacks were also successful (75%)
- Phishing attempts and code manipulation were harder (45-48%)
Why This Matters for AI Safety
This research highlights several important concerns about the fundamental tension between utility and security in AI systems:
-
The utility-security trade-off: Fine-tuning interfaces need to provide detailed feedback to be useful for legitimate developers. However, this same feedback can be exploited by attackers.
-
Defense challenges: Companies face difficult choices in:
- How much control to give users over fine-tuning parameters
- What information to expose during the fine-tuning process
- How to detect potential misuse of fine-tuning capabilities
-
Future implications: As AI models become more common in real-world applications, we need to think carefully about:
- How to design safer fine-tuning interfaces
- Ways to detect and prevent systematic exploitation
- Balancing developer needs with security requirements
Conclusion
The Fun-Tuning research demonstrates a fundamental challenge in AI system design: features that make systems more useful (like fine-tuning interfaces) can also make them more vulnerable to attacks. While the specific method requires technical knowledge and access to fine-tuning APIs, it shows that we need to think carefully about how we expose model capabilities.
As AI becomes more integrated into our digital infrastructure, finding the right balance between providing powerful capabilities and maintaining robust security becomes increasingly important. If you're interested in helping identify and address these vulnerabilities, consider joining HackAPrompt 2.0 to contribute to building safer AI systems.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.