Woodpecker Framework Corrects Visual Hallucinations in Multimodal Model Outputs

我爱计算机视觉

Published inAI News · 2 min read · Oct 27, 2023

Recently, researchers from the University of Science and Technology of China and Tencent's YouTu Lab have introduced the Woodpecker framework, designed to address the issue of visual hallucinations in image description tasks by multi-modal large language models. Woodpecker can extract key concepts from model outputs, construct validation questions, utilize visual models for knowledge verification, and generate visual assertions, thereby producing corrected descriptions. Experiments show that Woodpecker significantly enhances the perception capabilities of various multi-modal models in terms of object existence, quantity, attributes, and more, reducing visual hallucination issues. The researchers also provide an online demo for users to experience the hallucination correction effects of Woodpecker. This framework offers a new approach to improving the reliability of multi-modal models.

Multimodal Models Visual Hallucinations Woodpecker

This article is from AIbase Daily

Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.

—— Created by the AIbase Daily Team

AI News Recommendations

Google Launches LMEval: A New Tool for Uniformly Evaluating Large Language and Multimodal Models

May 27, 2025

510

Aliyun Modao Launches Two Latest Open Source Multimodal Models - Jump Star

Feb 21, 2025

2.0k

New Breakthrough in Multimodal Models: Fei-Fei Li's Team Unifies Actions and Language, Not Only Understanding Commands but also Reading Implicit Emotions

Dec 18, 2024

3.4k

AI Daily: Google Launches Experimental Version of Gemini 1.5 Pro 0801; Open Source Image Generation Model FLUX1 Emerges; Ultra-Fast 3D Image Generation Model Stable Fast 3D Released; Alibaba's Voice Synthesis Model CosyVoice Updated

Welcome to the 【AI Daily】 column! Here is your daily guide to exploring the world of artificial intelligence. Every day, we present the hottest content in the AI field, focusing on developers and helping you gain insights into technology trends and innovative AI product applications. For fresh AI products, click here to learn more: https://top.aibase.com/1. Google has launched the powerful multimodal model experimental version Gemini 1.5 Pro, ranking ahead of GPT-4o and Claude-3.5 Sonnet.

Aug 2, 2024

900

Sketchpad: A Canvas Framework for Multimodal Models to Enhance Mathematical Abilities

Sketchpad enables language models to draw using lines, boxes, and markers, which is closer to human sketching and facilitates reasoning. Additionally, Sketchpad can utilize specialized visual models during the drawing process, such as using object detection models to create bounding boxes and segmentation models to draw masks, further enhancing visual perception and reasoning capabilities.

Jun 17, 2024

1.7k

Adept Fuyu-Heavy: A New Tool for Digital Agents, Multimodal Models Take Center Stage

Adept Fuyu-Heavy is the world's third most powerful multimodal model, specifically designed for digital agents. It excels in multimodal tasks, particularly outperforming Gemini Pro in the MMM benchmark tests. Its primary capabilities include multimodal understanding and generation, efficient image-text processing, and long-form dialogue performance. It is particularly skilled at understanding digital user interfaces, providing effective automation solutions. It has the ability to generate content across text and images, making it suitable for various application scenarios.

Jan 30, 2024

2.6k

2024 AI Industry Outlook: The Rise of Open Source Models, the Popularity of AI Videos, and the Emergence of Multimodal Models

2024 is anticipated to be a pivotal year for generative AI, with various perspectives suggesting that AI will continue to break through and become more widely adopted. Bill Gates predicts AI will achieve global adoption within three years. AI video is set to become a key development area, focusing on video input and output, with expectations for high-resolution, long-term coherent video generation. It is predicted that open source models will surpass GPT-4 in 2024, and smaller language models will gain popularity, with cost-effectiveness and sustainability becoming important considerations. The development of AI agents will thrive, and real-time diffusion applications will be a focal point.

Jan 4, 2024

2.3k

2024 AI Development Trends: Multimodal Dominance, Open Source Democratization, and GPU Shortage Challenges

The rise of multimodal models will see AI products encompassing voice, video, audio, and code. Open source AI democratization will lead to more large tech companies releasing open source models in 2024. The ongoing GPU supply shortage poses challenges, but solutions exist, with Apple and Google likely to introduce significant AI innovations. AI products will achieve more accurate and human-like results across various modalities, including voice, video, and audio.

Dec 28, 2023

620

Shanghai AI Lab Releases 'PuYi 2.0' OpenMEDLab 2.0

Shanghai AI Lab, together with Ruijin Hospital affiliated to Shanghai Jiao Tong University School of Medicine, has released the medical multimodal foundational model group 'PuYi 2.0'. PuYi 2.0 adds multi-domain models and incremental language parameters, covering various data modalities such as medical images, medical texts, and bioinformatics. New open-source datasets include the medical image segmentation dataset SA-Med2D-20M and the pathology dataset SNOW. PuYi 2.0 incorporates an evaluation module to provide reference implementations for medical model capabilities, delivering a one-stop open-source solution for large medical models.

Dec 27, 2023

3.1k

Google Gemini: The Multimodal Power Behind the Hardware

Google Gemini 1.0 outperforms GPT-4 in terms of performance, with its TPU v4 and TPU v5e utilizing optical circuit switching technology, significantly enhancing the speed of information transmission. The concept of MFU (Model FLOPs Utilization) is used to assess the load capacity and operational efficiency of TPU v4, achieving an MFU of 44-56% in benchmark tests. Gemini supports multimodal models, capable of handling complex, diverse, and unstructured data, with superior hardware performance compared to other models.

Dec 11, 2023

870

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview