🧠 Advanced
🌱 New Techniques👀 For Vision-Language Models (VLMs)🟢 Prompt Learning

🟢 Prompt Learning for Vision-Language Models

Last updated on October 1, 2024 by Valeriia Kuka

What is Learning to Prompt for Vision-Language Models?

In vision-language models like CLIP, learning to prompt or prompt learning is a method for improving how models handle visual recognition tasks by optimizing how they are "prompted" to process images and text. In other words, it's prompt engineering tailored to vision-language models. Typically, vision-language models align images and texts in a shared feature space, allowing the models to classify new images by comparing them with text descriptions, rather than relying on pre-defined categories.

A major challenge with these models is prompt engineering, which involves finding the right words to describe image classes. This process can be time-consuming and requires expertise because small changes in wording can significantly affect performance.


  1. Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to Prompt for Vision-Language Models. International Journal of Computer Vision, 130(9), 2337–2348. https://doi.org/10.1007/s11263-022-01653-1

Edit this page
Word count: 0

Get AI Certified by Learn Prompting

Copyright © 2024 Learn Prompting.