Harnessing Visual Prompting: Transformative Advances from CVPR 2024
Written on
Visual Prompting is emerging as a pivotal technique in enhancing the adaptability of large vision models for new tasks. This article delves into the recent breakthroughs presented at CVPR 2024, spotlighting five significant advancements in Visual Prompting.
TL;DR: Discover five major advances in Visual Prompting for Computer Vision.
- Image Comprehension: Utilizing basic visual cues to enhance how foundation models interpret specific image sections.
- MLLMs: Leveraging scene graphs to boost the descriptive capabilities of Multimodal Large Language Models (MLLMs) without the need for additional training data.
- Foundation Model Enhancement: Strategic visual prompting to refine vision foundation models, such as SAM.
- Enhanced Generalization: Training AI to swiftly identify previously unseen objects while retaining knowledge of familiar ones.
- Active Learning Integration: Empowering AI to learn new visual tasks efficiently, using fewer examples while preserving prior knowledge.
In this discussion, we outline the essence of Visual Prompting and examine how promptable models are reshaping the computer vision landscape, along with five groundbreaking developments showcased at CVPR.
You won’t want to miss the highlights from CVPR 2024! Key takeaways include: - Innovations in Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more). - Essential highlights in Embodied AI, GenAI, Foundation Models, and Video Understanding.
“Prompting is an interface to model editing that everyone can use.” — Phillip Isola, AI researcher and originator of “Visual Prompting”
Table of contents
- What is Visual Prompting
- Visual Prompting: A Systems Perspective
- Key Visual Prompting Advances from CVPR 2024
- Looking Ahead
1. What is Visual Prompting
1.1 The Origins of Visual Prompting
The concept of prompting in vision can be traced back to the 2001 study on Image Analogies, where researchers devised a method for image processing that requires both a prompt and a query for results.
In the context of Generative AI, what’s new with prompting in vision?
- Prompting allows models to adapt to tasks beyond their original training scope. It facilitates the adaptation of pre-trained models to novel distributions.
- The concept gained traction through Language Models, where large pre-trained models (like GPT-4) are adjusted for new tasks.
- Visual Prompting specifically refers to the technique of adapting extensive vision models for unseen visual tasks.
1.2 Understanding Prompting for Vision
To grasp the concept of prompting in vision, it’s essential to differentiate it from fine-tuning—a conventional adaptation method.
The illustration above highlights the distinctions between visual prompting and fine-tuning concerning foundation models in computer vision. Visual Prompting utilizes visual cues to guide models without altering their parameters, offering flexibility and reduced computational demands. In contrast, fine-tuning involves retraining the model on specific datasets, adjusting its parameters for improved task-specific performance, but requiring more computational resources.
These methods are not mutually exclusive; they represent a continuum of adaptation strategies for foundation models.
The diagram above indicates that when flexibility and speed are essential, Visual Prompting is often the optimal approach for deploying large-scale vision models in applications like visual search and rapid prototyping.
2. Visual Prompting: A Systems Perspective
The true potential of Visual Prompting becomes evident when viewed from a systems perspective, particularly within a multi-stage vision framework.
A promptable model can be integrated seamlessly into larger systems, allowing it to execute specific tasks during inference.
The diagram illustrates a system where a promptable foundation model is a component of a more extensive architecture:
- Input Image: The process begins with an image, such as a group of horses in a field.
- Object Detection: The image is analyzed by an object detection model (e.g., YOLO-World), identifying and localizing objects, producing bounding boxes around detected entities.
- Segmentation: The identified boxes serve as visual prompts for a segmentation model (e.g., Segment Anything), which creates precise masks for each detected object, enhancing image segmentation.
3. Key Visual Prompting Advances from CVPR 2024
3.1 Intuitive Visual Prompting for Large Multimodal Models
Paper: cvpr open access Run it: https://vip-llava.github.io/ Main novelty: Introduction of a multimodal model capable of interpreting arbitrary visual prompts. Users can intuitively interact with the model by annotating images with simple cues like “red bounding box” or “pointed arrow,” eliminating the need for complex encodings.
Potential applications: - a) Healthcare Imaging: Medical professionals can highlight specific areas in imaging data (e.g., X-rays, MRIs) for improved diagnostics. - b) E-commerce Product Search: Users can annotate product images (e.g., highlighting a shoe’s heel) for finding similar items or detailed product information.
3.2 Zero-Shot Visual Prompting to Enhance AI’s Understanding of Images
Paper: https://arxiv.org/pdf/2311.17076 Run it: https://github.com/chancharikmitra/CCoT Main novelty: The Compositional Chain-of-Thought (CCoT) method introduces a two-step zero-shot prompting process. Initially, a Multimodal Large Language Model (MLLM) constructs a scene graph from an image based on a task prompt, which is subsequently used to generate detailed responses without requiring annotated data or fine-tuning.
Potential applications: - a) Visual Question Answering: Delivering precise answers by comprehensively understanding the visual context and composition. - b) Surveillance: Identifying objects and their interrelations within an image, beneficial for surveillance scenarios.
3.3 Cost-Effective Segmentation in Foundation Models
Highlight Paper: https://arxiv.org/abs/2312.15895 Run it: https://github.com/zhaoyangwei123/SAPNet Main novelty: The Semantic-Aware Instance Segmentation Network (SAPNet) combines Multiple Instance Learning (MIL) with visual foundation models like SAM through point prompts. It enhances category-specific segmentation by strategically selecting representative mask proposals and addressing segmentation issues with Point Distance Guidance and Box Mining Strategy.
Potential applications: - a) Autonomous Driving: Enhancing object detection and classification in vehicle systems, improving decision-making and safety. - b) Agricultural Monitoring: Offering precise segmentation for crops in aerial or satellite imagery for improved management and yield prediction.
3.4 Using Visual Prompts in Foundation Models for Better Image Segmentation
Paper: https://arxiv.org/pdf/2404.11732 Run it: https://github.com/rayat137/VisualPromptGFSS Main novelty: Implementation of learned visual prompts with a transformer decoder for generalized few-shot segmentation (GFSS). It introduces a unidirectional causal attention mechanism between novel prompts (derived from limited examples) and base prompts (gained from abundant data).
Potential applications: - a) Autonomous Vehicles: Rapid adaptation for recognizing and segmenting new objects or road conditions with minimal examples while maintaining performance on common road features. - b) Satellite Imagery Analysis: Identifying and segmenting new land uses or environmental changes with few examples, while ensuring accuracy for well-known geographical features.
3.5 Active Learning Meets Prompting in Vision Language Models (VLMs)
Paper: https://arxiv.org/pdf/2311.11178 Run it: https://github.com/kaist-dmlab/pcb Main novelty: The PCB framework is a novel active learning approach designed for pre-trained Vision Language Models (VLMs). This framework addresses the challenges of adapting VLMs to new tasks while reducing the need for costly labeling.
Potential applications: - a) Medical Imaging: Quickly adapting VLMs for identifying new disease patterns with minimal expert labeling. - b) E-commerce: Enhancing product categorization and search capabilities by adapting VLMs to new product lines with limited manual input.
4. Looking Ahead
As discussed, Visual Prompting allows for adapting foundation models within the input space, serving as a universal interface for both human users and models.
The advent of promptable models in Vision is poised to transform the traditional Computer Vision pipeline. Many of these models are emerging as foundational elements that could replace common stages in conventional pipelines, such as labeling.
At Tenyks, we anticipate this disruption is imminent. Our article on Computer Vision Pipeline 2.0 offers insights into why this change is both crucial and unavoidable.
Stay informed with the highlights from CVPR 2024, the premier conference in the vision domain: - Innovations in Image and Video Search & Understanding (RAG, Multimodal, Embeddings, and more). - Essential highlights in Embodied AI, GenAI, Foundation Models, and Video Understanding.
References
[1] Exploring Visual Prompts for Adapting Large-Scale Models [2] Image Analogies [3] Language Models are Unsupervised Multitask Learners [4] Visual Prompting [5] Segment Anything [6] YOLO-World: Real-Time Open-Vocabulary Object Detection
Authors: Jose Gabriel Islas Montero, Dmitry Kazhdan. If you’d like to learn more about Tenyks, explore our sandbox.