Saved in:
Bibliographic Details
Main Authors: Thawakar, Omkar, Dissanayake, Dinura, More, Ketan, Thawkar, Ritesh, Heakl, Ahmed, Ahsan, Noor, Li, Yuhao, Zumri, Mohammed, Lahoud, Jean, Anwer, Rao Muhammad, Cholakkal, Hisham, Laptev, Ivan, Shah, Mubarak, Khan, Fahad Shahbaz, Khan, Salman
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.06186
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913643760713728
author Thawakar, Omkar
Dissanayake, Dinura
More, Ketan
Thawkar, Ritesh
Heakl, Ahmed
Ahsan, Noor
Li, Yuhao
Zumri, Mohammed
Lahoud, Jean
Anwer, Rao Muhammad
Cholakkal, Hisham
Laptev, Ivan
Shah, Mubarak
Khan, Fahad Shahbaz
Khan, Salman
author_facet Thawakar, Omkar
Dissanayake, Dinura
More, Ketan
Thawkar, Ritesh
Heakl, Ahmed
Ahsan, Noor
Li, Yuhao
Zumri, Mohammed
Lahoud, Jean
Anwer, Rao Muhammad
Cholakkal, Hisham
Laptev, Ivan
Shah, Mubarak
Khan, Fahad Shahbaz
Khan, Salman
contents Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
format Preprint
id arxiv_https___arxiv_org_abs_2501_06186
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Thawakar, Omkar
Dissanayake, Dinura
More, Ketan
Thawkar, Ritesh
Heakl, Ahmed
Ahsan, Noor
Li, Yuhao
Zumri, Mohammed
Lahoud, Jean
Anwer, Rao Muhammad
Cholakkal, Hisham
Laptev, Ivan
Shah, Mubarak
Khan, Fahad Shahbaz
Khan, Salman
Computer Vision and Pattern Recognition
Reasoning is a fundamental capability for solving complex multi-step problems, particularly in visual contexts where sequential step-wise understanding is essential. Existing approaches lack a comprehensive framework for evaluating visual reasoning and do not emphasize step-wise problem-solving. To this end, we propose a comprehensive framework for advancing step-by-step visual reasoning in large language models (LMMs) through three key contributions. First, we introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks. The benchmark presents a diverse set of challenges with eight different categories ranging from complex visual perception to scientific reasoning with over 4k reasoning steps in total, enabling robust evaluation of LLMs' abilities to perform accurate and interpretable visual reasoning across multiple steps. Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps, emphasizing both correctness and logical coherence. The proposed metric offers deeper insights into reasoning performance compared to traditional end-task accuracy metrics. Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach, where tasks are progressively organized to facilitate incremental skill acquisition and problem-solving. The proposed LlamaV-o1 is designed for multi-step reasoning and learns step-by-step through a structured training paradigm. Extensive experiments show that our LlamaV-o1 outperforms existing open-source models and performs favorably against close-source proprietary models. Compared to the recent Llava-CoT, our LlamaV-o1 achieves an average score of 67.3 with an absolute gain of 3.8\% across six benchmarks while being 5 times faster during inference scaling. Our benchmark, model, and code are publicly available.
title LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2501.06186