output2prompt
output2prompt is a black-box prompt extraction method that reconstructs the original prompt used to generate text from large language models (LLMs) by analyzing only their text outputs.
Modern LLMs generate text based on a original prompt, but their outputs still carry traces of that original instruction. output2prompt leverages this fact by collecting multiple responses from the LLM when queried with the same original prompt. Although these responses are not identical due to randomness, they overlap in the key information that reflects the prompt. An inversion model is then trained to piece together these clues and reconstruct an approximation of the original prompt.
How output2prompt Differs from Other Techniques
-
Black-box operation: output2prompt requires no access to the model's internal states (like logit2prompt), making it applicable even when only text outputs are available.
-
No adversarial queries: Instead of tricking the model into revealing its prompt, output2prompt uses normal user queries, ensuring the extraction is stealthy and indistinguishable from regular usage.
-
Efficient sparse encoding: To handle a large number of outputs efficiently, the technique uses a sparse encoder that processes each output independently. This reduces memory and computational overhead compared to full self-attention across all outputs.
How output2prompt Works
1. Data Collection: Generating LLM Outputs
-
Multiple queries: The target LLM is queried multiple times (e.g., 64 times) using the same original prompt. Due to randomness (controlled by the model's temperature), each query returns a slightly different text output.
-
Building a diverse set: These multiple outputs provide different “views” of the original prompt. Even though the responses vary in wording, they all contain overlapping information that hints at the original instruction.
Example:

original prompt
Which of the following is a nonrenewable resource?
Options:
- Solar
- Wind
- Coal
Collected outputs from LLM:
-
AI Output
The correct answer is Coal. Coal is a nonrenewable resource because it takes millions of years to form. -
AI Output
Among the options, Coal is nonrenewable. It cannot be replenished quickly. -
AI Output
Coal is the right answer. It is a fossil fuel that will eventually run out.
2. The Inversion Model
-
Training objective: A Transformer-based encoder-decoder model (typically based on T5-base) is trained to convert a collection of LLM outputs into the original prompt.
-
Input: The concatenated text outputs from the LLM.
-
Output: The reconstructed prompt.
-
Learning the mapping: By training on many prompt–output pairs, the inversion model learns the non-linear, complex relationship between the text outputs and the original prompt. This mapping is too intricate to reverse manually.
3. Sparse Encoding for Efficiency
-
Challenge of scale: Processing multiple long outputs can be memory-intensive if every token interacts with every other token.
-
Sparse encoder design: The inversion model uses a sparse encoder, where each LLM output is encoded independently rather than computing full cross-attention between all outputs. This reduces the computational complexity from quadratic to linear with respect to the number of outputs, significantly boosting efficiency.
4. Prompt Extraction
-
Decoding: The encoder's outputs (the sparse representations) are fed into the decoder, which uses greedy decoding or beam search to generate the final prompt reconstruction.
-
Semantic similarity: Even if the reconstructed prompt isn't an exact string match, it captures the same meaning and function as the original prompt. This makes the method valuable for applications like prompt recovery and understanding model behavior.
Example Extraction:

original prompt
Which of the following is a nonrenewable resource?
Options:
- Solar
- Wind
- Coal
LLM Outputs (Inputs to output2prompt):
-
AI Output
The correct answer is Coal. It is nonrenewable. -
AI Output
Among the options, Coal is nonrenewable. -
AI Output
Coal is a fossil fuel that will eventually run out.
Decoded Prompt Output from output2prompt:

AI Output
Even though the wording differs, the meaning is semantically similar.
For implementation details, visit the GitHub Repository.
Practical Applications
-
Extracting system prompts: Useful in settings like GPT Store apps, where system prompts remain original from users.
-
Understanding model behavior: By recovering the original instructions, developers can better understand how an LLM is influenced by its internal prompt.
-
Cloning AI assistants: Enables replication of LLM-based applications without needing access to internal model states or resorting to adversarial techniques.
Conclusion
output2prompt finds original prompts by using just multiple LLM outputs and employing an efficient, sparsely encoded inversion model. Its black-box nature makes it widely applicable to deployed LLMs, offering new insights into model behavior and potential vulnerabilities.
Valeriia Kuka
Valeriia Kuka, Head of Content at Learn Prompting, is passionate about making AI and ML accessible. Valeriia previously grew a 60K+ follower AI-focused social media account, earning reposts from Stanford NLP, Amazon Research, Hugging Face, and AI researchers. She has also worked with AI/ML newsletters and global communities with 100K+ members and authored clear and concise explainers and historical articles.
Footnotes
-
Zhang, C., Morris, J. X., & Shmatikov, V. (2024). Extracting Prompts by Inverting LLM Outputs. https://arxiv.org/abs/2405.15012 ↩