Paper ID: 2503.01754 • Published Mar 3, 2025
SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces
Guande Wu, Huan Song, Yawei Wang, Qiaojing Yan, Yijun Tian, Lin Lee Cheong, Panpan Xu
New York University•Amazon
TL;DR
Get AI-generated summaries with premium
Get AI-generated summaries with premium
Reasoning is increasingly crucial for various tasks. While chain-of-thought
prompting enables large language models to leverage reasoning effectively,
harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains
challenging. To solve this problem, we propose a novel self-distillation
framework that enhances the reasoning capabilities of the model. The proposed
framework introduces several key innovations. We start by employing a prompt
library tailored to visual reasoning tasks to generate diverse in-context
questions and utilize a two-step reasoning procedure to derive reasoning-guided
responses. These responses are then used for self-distillation, enabling the
model to internalize the reasoning process. Additionally, we improve the model
architecture with several innovative components, including an intervention
adapter for efficient parameter updates, a cross-modal skip connection to
facilitate information exchange between modalities, and an ensemble learning
algorithm to integrate diverse reasoning from multiple in-context questions.
Extensive experiments show that our method significantly improves the baseline
performance across five VQA datasets.
Figures & Tables
Unlock access to paper figures and tables to enhance your research experience.