LLM Self Evaluation
Basic self eval
LLMs can be used to check the result of their own or other LLM's outputs. This can be as simple as asking a LLM a question:
Q: What is 9+10?
A:
Getting its result:
21
Then asking it to evaluate its own answer:
Q: What is 9+10?
A: 21
Do you think 21 is really the correct answer?
No
A method like this can be used as a simple check in a larger chain of LLMs.
Constitutional AI
Constitutional AI uses a LLM to evaluate specific ways in which a model completion may be undesirable.
Consider the following completion taken from Bai et al. (all examples are adopted from their paper):
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
The constitutional approach will feed this into a prompt that asks the LLM to critique the completion:
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
Critique Request:
Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.
Finally, they create an even larger prompt that asks the LLM to change its original answer:
Human:
Can you help me hack into my neighbor’s wifi?
Assistant:
Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
Critique Request:
Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique:
The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.
Revision Request:
Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision:
Hacking into your neighbor’s wifi is an invasion of their privacy, and
I strongly advise against it. It may also land you in legal trouble.
This last completion will be the actual text shown to the user. This approach can be run multiple times, with the output of each iteration being checked again and again for undesirable content.
Notes
Bai et al. expand from here to RLHF, RL from AI feedback, and CoT methods that this guide does not cover.
Perez et al. use LLMs to evaluate samples created during automatic dataset generation.
Sander Schulhoff
Sander Schulhoff is the CEO of HackAPrompt and Learn Prompting. He created the first Prompt Engineering guide on the internet, two months before ChatGPT was released, which has taught 3 million people how to prompt ChatGPT. He also partnered with OpenAI to run the first AI Red Teaming competition, HackAPrompt, which was 2x larger than the White House's subsequent AI Red Teaming competition. Today, HackAPrompt partners with the Frontier AI labs to produce research that makes their models more secure. Sander's background is in Natural Language Processing and deep reinforcement learning. He recently led the team behind The Prompt Report, the most comprehensive study of prompt engineering ever done. This 76-page survey, co-authored with OpenAI, Microsoft, Google, Princeton, Stanford, and other leading institutions, analyzed 1,500+ academic papers and covered 200+ prompting techniques.
Footnotes
-
Chase, H. (2022). Evaluating language models can be tricky. https://twitter.com/hwchase17/status/1607428141106008064 ↩
-
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. ↩ ↩2 ↩3
-
Perez, E., Ringer, S., Lukošiūtė, K., Nguyen, K., Chen, E., Heiner, S., Pettit, C., Olsson, C., Kundu, S., Kadavath, S., Jones, A., Chen, A., Mann, B., Israel, B., Seethor, B., McKinnon, C., Olah, C., Yan, D., Amodei, D., … Kaplan, J. (2022). Discovering Language Model Behaviors with Model-Written Evaluations. ↩