Saved in:
| Main Authors: | Roccabruna, Gabriel, Khomyn, Olha, Riccardi, Giuseppe |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.14589 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?
by: Roccabruna, Gabriel, et al.
Published: (2024)
by: Roccabruna, Gabriel, et al.
Published: (2024)
CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)
by: Rizzoli, Massimo, et al.
Published: (2025)
Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
by: Alghisi, Simone, et al.
Published: (2024)
by: Alghisi, Simone, et al.
Published: (2024)
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025)
by: Zhao, Zehua, et al.
Published: (2025)
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
by: Lin, Chenwei, et al.
Published: (2024)
by: Lin, Chenwei, et al.
Published: (2024)
Zero-Shot Defense Against Toxic Images via Inherent Multimodal Alignment in LVLMs
by: Zhao, Wei, et al.
Published: (2025)
by: Zhao, Wei, et al.
Published: (2025)
ECG-Reasoning-Benchmark: A Benchmark for Evaluating Clinical Reasoning Capabilities in ECG Interpretation
by: Oh, Jungwoo, et al.
Published: (2026)
by: Oh, Jungwoo, et al.
Published: (2026)
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
by: Cai, Mu, et al.
Published: (2024)
by: Cai, Mu, et al.
Published: (2024)
Reasoning Planning for Language Models
by: Nguyen, Bao, et al.
Published: (2025)
by: Nguyen, Bao, et al.
Published: (2025)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
by: Potamitis, Nearchos, et al.
Published: (2025)
by: Potamitis, Nearchos, et al.
Published: (2025)
Benchmarking and Confidence Evaluation of LALMs For Temporal Reasoning
by: Bhattacharya, Debarpan, et al.
Published: (2025)
by: Bhattacharya, Debarpan, et al.
Published: (2025)
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
by: Hua, Zhenglin, et al.
Published: (2025)
by: Hua, Zhenglin, et al.
Published: (2025)
Benchmarking ChatGPT on Algorithmic Reasoning
by: McLeish, Sean, et al.
Published: (2024)
by: McLeish, Sean, et al.
Published: (2024)
Can Large Language Models Reason and Plan?
by: Kambhampati, Subbarao
Published: (2024)
by: Kambhampati, Subbarao
Published: (2024)
Guiding Language Model Reasoning with Planning Tokens
by: Wang, Xinyi, et al.
Published: (2023)
by: Wang, Xinyi, et al.
Published: (2023)
Planetarium: A Rigorous Benchmark for Translating Text to Structured Planning Languages
by: Zuo, Max, et al.
Published: (2024)
by: Zuo, Max, et al.
Published: (2024)
Temporal Consistency for LLM Reasoning Process Error Identification
by: Guo, Jiacheng, et al.
Published: (2025)
by: Guo, Jiacheng, et al.
Published: (2025)
RHYTHM: Reasoning with Hierarchical Temporal Tokenization for Human Mobility
by: He, Haoyu, et al.
Published: (2025)
by: He, Haoyu, et al.
Published: (2025)
Exploring and Benchmarking the Planning Capabilities of Large Language Models
by: Bohnet, Bernd, et al.
Published: (2024)
by: Bohnet, Bernd, et al.
Published: (2024)
Why Reasoning Fails to Plan: A Planning-Centric Analysis of Long-Horizon Decision Making in LLM Agents
by: Wang, Zehong, et al.
Published: (2026)
by: Wang, Zehong, et al.
Published: (2026)
When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs
by: Khayatan, Pegah, et al.
Published: (2026)
by: Khayatan, Pegah, et al.
Published: (2026)
LogicVista: Multimodal LLM Logical Reasoning Benchmark in Visual Contexts
by: Xiao, Yijia, et al.
Published: (2024)
by: Xiao, Yijia, et al.
Published: (2024)
Learning to Reason Over Time: Timeline Self-Reflection for Improved Temporal Reasoning in Language Models
by: Bazaga, Adrián, et al.
Published: (2025)
by: Bazaga, Adrián, et al.
Published: (2025)
Time-R1: Towards Comprehensive Temporal Reasoning in LLMs
by: Liu, Zijia, et al.
Published: (2025)
by: Liu, Zijia, et al.
Published: (2025)
SAHM: A Benchmark for Arabic Financial and Shari'ah-Compliant Reasoning
by: Elbadry, Rania, et al.
Published: (2026)
by: Elbadry, Rania, et al.
Published: (2026)
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations
by: Dongre, Vardhan, et al.
Published: (2025)
by: Dongre, Vardhan, et al.
Published: (2025)
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning
by: Cai, Zikui, et al.
Published: (2025)
by: Cai, Zikui, et al.
Published: (2025)
Dual-Uncertainty Guided Policy Learning for Multimodal Reasoning
by: Liu, Rui, et al.
Published: (2025)
by: Liu, Rui, et al.
Published: (2025)
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
by: Lin, Zicheng, et al.
Published: (2024)
by: Lin, Zicheng, et al.
Published: (2024)
Graph-enhanced Large Language Models in Asynchronous Plan Reasoning
by: Lin, Fangru, et al.
Published: (2024)
by: Lin, Fangru, et al.
Published: (2024)
Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series
by: Chang, Ching, et al.
Published: (2025)
by: Chang, Ching, et al.
Published: (2025)
MIKE: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing
by: Li, Jiaqi, et al.
Published: (2024)
by: Li, Jiaqi, et al.
Published: (2024)
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
by: Adilazuarda, Muhammad Farid, et al.
Published: (2025)
by: Adilazuarda, Muhammad Farid, et al.
Published: (2025)
RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents
by: Kagaya, Tomoyuki, et al.
Published: (2024)
by: Kagaya, Tomoyuki, et al.
Published: (2024)
FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning
by: Xie, Zhuohan, et al.
Published: (2025)
by: Xie, Zhuohan, et al.
Published: (2025)
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
by: Ramezanali, Mohammad, et al.
Published: (2025)
by: Ramezanali, Mohammad, et al.
Published: (2025)
EngTrace: A Symbolic Benchmark for Verifiable Process Supervision of Engineering Reasoning
by: Gull, Ayesha, et al.
Published: (2025)
by: Gull, Ayesha, et al.
Published: (2025)
MedRECT: A Medical Reasoning Benchmark for Error Correction in Clinical Texts
by: Iwase, Naoto, et al.
Published: (2025)
by: Iwase, Naoto, et al.
Published: (2025)
Beyond Benchmarks: On The False Promise of AI Regulation
by: Stanovsky, Gabriel, et al.
Published: (2025)
by: Stanovsky, Gabriel, et al.
Published: (2025)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
Similar Items
-
Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?
by: Roccabruna, Gabriel, et al.
Published: (2024) -
CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025) -
Should We Fine-Tune or RAG? Evaluating Different Techniques to Adapt LLMs for Dialogue
by: Alghisi, Simone, et al.
Published: (2024) -
SUPERChem: A Multimodal Reasoning Benchmark in Chemistry
by: Zhao, Zehua, et al.
Published: (2025) -
INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs' Performance in Insurance
by: Lin, Chenwei, et al.
Published: (2024)