Announcing our new Course: AI Red-Teaming and AI Safety Masterclass

Check it out →

The Most Comprehensive Study of Prompting Ever Done

September 29th, 2024 by Sander Schulhoff

What can 1500+ academic papers about prompting tell you about how to write a good prompt?

I recently led a team of 32 researchers to publish The Prompt Report, systematic survey paper on prompting. At 80+ pages, it is the most comprehensive paper on prompting ever written.

Our team consisted of researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and many other institutions. We examined 200+ prompting and prompting adjacent techniques.

We perform side-by-side analyses and provide various taxonomies, including the most comprehensive taxonomy of text-based prompting techniques (think Chain-of-Thought, Few-Shot) ever.

A Taxonomy of Prompting Techniques

We highlight 58 different text-based prompting techniques, and group them into 6 categories, based on how they solve problems.

Few-Shot

For example, Few-Shot prompting improves results by showing the LLM examples of what it should do.

Thought Generation

On the other hand, Chain-of-Thought improves results by encouraging the LLM to write out its reasoning. You may be surprised to learn that Chain-of-Thought is not a novel technique, and there are many like it, such as Step-Back prompting and Thread-of-Thought. Thread-of-Thought breaks down long, complex contexts into smaller, more manageable chunks. As it examines each piece, it picks out the most important details. Note that Chain-of-Thought falls under the Thought Generation category.

Other Categories

The other categories are Zero-Shot prompting (not showing any examples), Ensembling (basically giving multiple prompts at once), Self-Criticism (having the LLM criticize and refine its own outputs), and Decomposition (having the LLM break down a problem into subproblems). It is important to note that some of these techniques could be classified under two or more categories, but we classify them solely under their primary category. For example, Few Shot Chain-of-Thought enhances the power of Chain of Thought reasoning by using multiple examples. For example, it might include two solved arithmetic word problems with step-by-step reasoning in the prompt, followed by a new, unsolved problem. This approach combines multiple examples (Few-Shot Prompting) with explicit step-by-step reasoning (Chain-of-Thought), which enhances the AI's ability to solve similar problems by following the demonstrated logical process.

Hidden Wisdom for In-Context Learning (ICL)

In-Context Learning (ICL), first identified by Brown et al. in 2020, stands out as a powerful yet enigmatic capability of modern AI systems. It allows LLMs to learn how to do new tasks from its prompt. This is usually in the form of Few-Shot prompting.

The Art of Exemplars

Much of prompt engineering relies on the use of exemplars, which is the technical term for examples within a prompt. There are six key factors that significantly influence their effectiveness: quantity, ordering, label distribution, label quality, format, and similarity. For example, the sequence of exemplars can create recency bias. If negative examples are listed last, the model might lean towards negative outputs, as it is more influenced by the most recent examples it has seen. Therefore, to yield more balanced and accurate outputs, the list of exemplars should alternate between positive and negative examples. By carefully manipulating these exemplar characteristics, accuracy can improve significantly, sometimes by up to 90%. Even small variations in how exemplars are presented can lead to noticeable improvements in results.

Benchmarking Prompting

The paper also empirically analyzes prompting techniques. We take 6 of the top performing prompting techniques and benchmark them with ChatGPT against a dataset called MMLU, which contains various different types of questions. We found Few-Shot CoT to be one of the most effective prompting techniques, and that Self-Consistency was not very effective.

Human vs AI Prompt Engineer

We also wrote a section comparing a human prompt engineer to an AI prompt engineer on a binary classification task. The human prompt engineer (me, Sander Schulhoff) spent 20 hours iterating on a prompt for the task, while the AI Prompt Engineer spent 10 minutes and beat the human’s performance significantly. With slight human modification, the AI generated prompt performed even better, with almost 0.6 F1.

The AI prompt engineering tool we used is called DSPy (Dee-Ess-Pie). It is a Python library for automatically optimizing prompts. By generating and refining examples and explanations, DSPy optimizes prompts with impressive results. It is built inspired by the “Prompting as Programming” paradigm and has a Pytorch-like API.

The Future of Prompting

The Prompt Report surveys almost all existing prompting techniques, and it also addresses safety and security issues that arise when using them. Things like prompt drift (performance of a prompt changes over time as models change) and prompt injection (LLMs can be tricked into doing/saying bad things) make prompting more difficult and we describe various solutions to these problems. We also discuss multilingual and multimodal prompting, as well as agentic techniques which extend prompting techniques to allow LLMs to take actions. From this paper, we can see that prompting is not going away any time soon; the number of prompting techniques continues to grow to address the various difficulties associated with prompting. If you are interested in learning more about prompting, we recommend that you read The Prompt Report.

Cite The Prompt Report as:

@article{schulhoff2024prompt,
    title={The Prompt Report: A Systematic Survey of Prompting Techniques},
    author={Schulhoff, Sander and Ilie, Michael and Balepur, Nishant and Kahadze, Konstantine and Liu, Amanda and Si, Chenglei and Li, Yinheng and Gupta, Aayush and Han, HyoJung and Schulhoff, Sevien and others},
    journal={arXiv preprint arXiv:2406.06608},
    year={2024}
}

Footnotes

  1. Brown, T. B. (2020). Language models are few-shot learners. arXiv Preprint arXiv:2005.14165.

  2. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., Santhanam, K., Vardhamanan, S., Haq, S., Sharma, A., Joshi, T. T., Moazam, H., Miller, H., Zaharia, M., & Potts, C. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv Preprint arXiv:2310.03714.


© 2024 Learn Prompting. All rights reserved.