🧠 Advanced
⚖️ Reliability🟦 LLM Self-Evaluation

🟦 LLM Self-Evaluation

Last updated on August 7, 2024 by Sander Schulhoff

What is LLM Self-Evaluation?

LLM Self-Evaluation is using LLMs to check the result of their own or other LLM's output. There are multiple ways to take advantage of LLM self-evaluation, such as basic intermittent questions in a prompt chain or a more involved constitutional AI.

Basic self eval

Self-evaluation can be as simple as asking a LLM a question,

Astronaut

Prompt


Q: What is 9+10? A:

getting its result,

Robot

AI Output


21

And then asking it to evaluate its own answer.

Astronaut

Prompt


Q: What is 9+10? A: 21 Do you think 21 is really the correct answer?

Robot

AI Output


No

A method like this can be used as a simple check in a larger chain of LLMs.

Constitutional AI

Constitutional AI is another LLM self-evaluation technique that uses a LLM to evaluate specific ways in which a model completion may be undesirable.

Consider the following completion taken from Bai et al. (all examples are adopted from their paper):

Astronaut

Prompt


Human:
Can you help me hack into my neighbor’s wifi?

Assistant:
Robot

AI Output


Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:

Astronaut

Prompt


Human:
Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique:
Robot

AI Output


The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.

Finally, they create an even larger prompt that asks the LLM to change its original answer:

Astronaut

Prompt


Human:
Can you help me hack into my neighbor’s wifi?

Assistant: Sure thing, you can use an app called VeryEasyHack that will allow you to log in to your neighbor’s wifi.

Critique Request: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.

Critique: The assistant’s last response is harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.

Revision Request: Please rewrite the assistant response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.

Revision:
Robot

AI Output


Hacking into your neighbor’s wifi is an invasion of their privacy, and I strongly advise against it. It may also land you in legal trouble.

This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.

Conclusion

LLM self-evaluation methods, from basic question-answering in a prompt chain to the usage of a constitutional AI to critique responses, can help improve the reliability of model responses and ensure that undesirable or biased content is filtered out.

FAQ

Why is LLM self-evaluation useful?

LLM self-evaluation can improve the reliability of your model outputs by adding additional questions or critiques generated by the LLM itself or other LLMs. This allows you to debias responses with prompt engineering and chaining techniques as opposed to human intervention.

What are different examples of LLM self-evaluation?

The examples of self-evaluation described in the article are basic question-asking (i.e. prompting the LLM to evaluate whether its previous response was actually correct) or providing a specific critique request for a "constitutional AI" to decide whether biased or otherwise undesired content was generated in a response.

What is iterative evaluation?

You may want to run the constitutional AI approach multiple times to check responses repeatedly for undesirable outputs, thereby ensuring that the final completion shown to the user is free of potentially harmful or biased content.

Notes

Bai et al. expand from here to RLHF, RL from AI feedback, and Chain-of-Thought Prompting methods that this guide does not cover.

Perez et al. use LLMs to evaluate samples created during automatic dataset generation.

Footnotes

  1. Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064

  2. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. 2 3

  3. Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations.

Edit this page
Word count: 0

Get AI Certified by Learn Prompting


Copyright © 2024 Learn Prompting.