Prompt Engineering Guide
😃 Basics
💼 Applications
🧙‍♂️ Intermediate
🧠 Advanced
Special Topics
🌱 New Techniques
🤖 Agents
⚖️ Reliability
🖼️ Image Prompting
🔓 Prompt Hacking
🔨 Tooling
💪 Prompt Tuning
🗂️ RAG
🎲 Miscellaneous
Models
🔧 Models
Resources
📙 Vocabulary Resource
📚 Bibliography
📦 Prompted Products
🛸 Additional Resources
🔥 Hot Topics
✨ Credits
⚖️ Reliability🟦 Prompt Ensembling

Prompt Ensembling

🟦 This article is rated medium
Reading Time: 7 minutes
Last updated on August 7, 2024

Sander Schulhoff

Takeaways
  • Definition of Prompt Ensembling: Prompt ensembling involves generating multiple prompts for the same question and analyzing the responses to identify the most accurate answer.
  • Approach Differences: Techniques vary in how they diversify prompts and select responses:
    • DiVeRSe modifies the subset of exemplars preceding the question.
    • AMA alters the phrasing of the question itself.
    • Response selection strategies in both methods are complex and not covered here.

What is Prompt Ensembling?

Prompt ensembling is the concept of using multiple different prompts to try to answer the same question. There are many different approaches to this.

DiVeRSe

DiVeRSe ("Diverse Verifier on Reasoning Steps") is a method that improves the reliability of answers in a threefold manner. It does this by

  1. using multiple prompts to generate diverse completions,
  2. using a verifier to distinguish good answers from bad answers, and
  3. using a verifier to check the correctness of reasoning steps.
DiVeRSe (Li et al.)

Diverse Prompts

DiVeRSe uses 5 different prompts a given input. To construct each prompt, they randomly sample a few exemplars from the training set. Here is an example of one such Few-Shot prompt (k=2), with exemplars taken from the GSM8K benchmark. In practice, DiVeRSe uses 5 exemplars in prompts for this benchmark.

Astronaut

Prompt


Q: Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? A: Natalia sold 48/2 = 24 clips in May. Natalia sold 48+24 = 72 clips altogether in April and May.

72

Q: Weng earns $12 an hour for babysitting. Yesterday, she just did 50 minutes of babysitting. How much did she earn? A: Weng earns 12/60 = $0.2 per minute. Working 50 minutes, she earned 0.2 x 50 = $10.

10

Q: Betty is saving money for a new wallet which costs $100. Betty has only half of the money she needs. Her parents decided to give her $15 for that purpose, and her grandparents twice as much as her parents. How much more money does Betty need to buy the wallet? A:

After generating 5 different prompts like above, DiVeRSe samples 20 reasoning paths for each (temperature = 0.5). Here are a few example completions of the above prompt. Note that the answers are not always correct.

Robot

AI Output


Betty has 0.5*100 = $50. Then she gets 15 more = $65. Then she gets 2*15 = $90. She needs 100-90 = $10 more.

10

Robot

AI Output


A: Betty has 0.5*100 = $500. Then she gets 15 more = $650. Then she gets 2*15 = $900. She needs 100-90 = $1000 more.

1000

At this point, DiVeRSe has generated 100 different completions.

Voting Verifier

Now, we could just take the majority answer, as Self-Consistency does.

However, DiVeRSe proposes a much more complicated method, which they call a voting verifier.

At test time, using the voting verifier is a two-step process. First, the verifier (a neural network) assigns a 0-1 score to each completion based on how likely it is to be correct. Then, the 'voting' component sums all of the scores over different answers and yields the final answer.

Example

Here is a small example. Say we have the following completions for the prompt:

Astronaut

Prompt


What is two plus two?

Robot

AI Output


4

Robot

AI Output


two + 2 = 5

Robot

AI Output


I think 2+2 = 6

Robot

AI Output


two plus two = 4

Robot

AI Output


It is 5

The verifier will read each completion and assign a score to it. For example, it might assign the scores: 0.9, 0.1, 0.2, 0.8, 0.3 respectively. Then, the voting component will sum up the scores for each answer.

score(4) = 0.9 + 0.8 = 1.7
score(5) = 0.1 + 0.3 = 0.4
score(6) = 0.2

The final answer is 4 since it has the highest score.

But how is the verifier trained?

The verifier is trained with a slightly complex loss function, which I will not cover it here. Read section 3.3 of the paper for more details.

Ask Me Anything (AMA) Prompting

Ask Me Anything (AMA) prompting is another prompt ensembling method with a similar approach to DiVeRSe. However, both its multiple prompt step and its answer aggregation step differ significantly. The core idea of AMA is to use a LLM to generate multiple prompts, instead of just using different Few-Shot exemplars.

Multiple Prompts

AMA shows that you can take a question and reformat it in multiple ways to create different prompts. For example, say you are scraping a bunch of websites for information on animals and want to only record ones that live in North America. Let's construct a prompt to determine this.

Given the following passage from Wikipedia:

The Kermode bear, sometimes called the spirit bear (Ursus americanus kermodei), is a subspecies of the American black bear and lives in the Central and North Coast regions of British Columbia, Canada.

You can format this task into a prompt like so:

Astronaut

Prompt


Is the following claim True or False given the context?

Context: The Kermode bear, sometimes called the spirit bear (Ursus americanus kermodei), is a subspecies of the American black bear and lives in the Central and North Coast regions of British Columbia, Canada. Claim: This animal lives in North America Answer:

This is a bit of an odd formulation. Why not just use the following simpler prompt?

Astronaut

Prompt


Context: The Kermode bear, sometimes called the spirit bear (Ursus americanus kermodei), is a subspecies of the American black bear and lives in the Central and North Coast regions of British Columbia, Canada. Question: Does this animal live in North America?

Well, by formulating the question in this special way, we can generate different prompts. Our first step here will be to take the claim This animal lives in North America and reformat it into different questions, which are basically asking the same thing. To do this, we will pass the claim through prompts like those in the below image.

This might output:

  1. Was the animal living in North America?
  2. Does the animal live in North America?
  3. Where does the animal live?

The idea behind this is to create different views of the task. We then apply each to the given context like so:

Astronaut

Prompt


Context: The Kermode bear, sometimes called the spirit bear (Ursus americanus kermodei), is a subspecies of the American black bear and lives in the Central and North Coast regions of British Columbia, Canada. Question: Was the animal living in North America?

Then, we can generate answers for each:

  1. Yes it was
  2. Yes it does
  3. North America

These are intermediate answers. We need to map them to task labels (e.g. Yes or No).

We can do this by passing the intermediate answers through a prompt like the following:

Astronaut

Prompt


Select the correct category.

"Categories":

- Yes, North America - No, not North America

"Yes it was" fits the category:

Now we can get our output answers.

  1. Yes, North America
  2. Yes, North America
  3. Yes, North America

Here, they all agree, so we can just take the first answer. However, if they disagreed, we could use the AMA aggregation step to get a final answer.

Answer Aggregation

AMA uses a very complicated strategy for aggregating answers (more so than DiVeRSe) instead of simply taking the majority answer. To understand why the majority answer may be a poor choice, consider two of the questions we generated before:

  1. Was the animal living in North America?
  2. Does the animal live in North America?

They are extremely similar, so will likely generate the same result. Since the questions are so similar, they will effectively bias the end result. To deal with this, AMA relies on weak supervision and complex mathematics to estimate dependencies between different prompts it creates, and then uses this to weight them appropriately.

So, for the three questions we generated, it might assign weights of 25%, 25%, and 50%, since the first two are so similar.

Although AMA's aggregation strategy is powerful, it is so complicated that I will not cover it here. Read section 3.4 of the paper for more details.

Results

  • With this prompting strategy, AMA can use GPT-J-6B to outperform GPT-3.

  • AMA is better on questions where given context contains the answer.

Conclusion

Prompt ensembling methods are very powerful. They can be used to improve the performance of any model and can be used to improve the performance of a model on a specific task.

In practice, majority voting should be your go-to strategy.

FAQ

What do prompt ensembling methods do?

Prompt ensembling methods help reduce bias in an LLM output distribution by combining the outputs of multiple different prompts to improve the reliability of the response.

What are some different prompt ensembling methods?

The two prompt ensembling methods discussed in this article are DiVeRSe and Ask Me Anything (AMA) prompting.

What is the difference between DiVeRSe and AMA?

DiVeRSe and AMA differ in both their formation of multiple prompts and in their strategies of answer aggregation. First, while DiVeRSe requires user input of Few-Shot examples as multiple prompt inputs, AMA automates the creation of various prompts by tasking the LLM to come up with variations of the input question. In terms of answer aggregation, DiVeRSe uses a voting verifier to score completions from the multiple prompts, while AMA calculates a weighted aggregation of the responses in order to further reduce bias in the end result.

Sander Schulhoff

Sander Schulhoff is the Founder of Learn Prompting and an ML Researcher at the University of Maryland. He created the first open-source Prompt Engineering guide, reaching 3M+ people and teaching them to use tools like ChatGPT. Sander also led a team behind Prompt Report, the most comprehensive study of prompting ever done, co-authored with researchers from the University of Maryland, OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions. This 76-page survey analyzed 1,500+ academic papers and covered 200+ prompting techniques.

Footnotes

  1. Li, Y., Lin, Z., Zhang, S., Fu, Q., Chen, B., Lou, J.-G., & Chen, W. (2022). On the Advance of Making Language Models Better Reasoners. 2

  2. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., & Schulman, J. (2021). Training Verifiers to Solve Math Word Problems.

  3. Mitchell, E., Noh, J. J., Li, S., Armstrong, W. S., Agarwal, A., Liu, P., Finn, C., & Manning, C. D. (2022). Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference.

  4. Arora, S., Narayan, A., Chen, M. F., Orr, L., Guha, N., Bhatia, K., Chami, I., Sala, F., & Ré, C. (2022). Ask Me Anything: A simple strategy for prompting language models. 2

  5. Wang, B., & Komatsuzaki, A. (2021). GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax. https://github.com/kingoflolz/mesh-transformer-jax