:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pan, Yulu, Yi, Han, Ha, Seongsu, Islam, Md Mohaiminul, Zhang, Benjamin, Torresani, Lorenzo, Bertasius, Gedas
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.31529
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)

Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)

TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)

ExAct: A Video-Language Benchmark for Expert Action Analysis
by: Yi, Han, et al.
Published: (2025)

A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)

SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)

Step Differences in Instructional Video
by: Nagarajan, Tushar, et al.
Published: (2024)

LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
by: Jung, Minjoon, et al.
Published: (2026)

DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)

RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
by: Zhang, Ce, et al.
Published: (2025)

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)

DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
by: Siddiqui, Md. Saiful Bari, et al.
Published: (2024)

Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space
by: Lee, Junho, et al.
Published: (2024)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)

VITED: Video Temporal Evidence Distillation
by: Lu, Yujie, et al.
Published: (2025)

Semantic Compositions Enhance Vision-Language Contrastive Learning
by: Aladago, Maxwell, et al.
Published: (2024)

How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
by: Jun, Sejoon, et al.
Published: (2026)

ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes
by: Somayazulu, Arjun, et al.
Published: (2025)

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)

Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
by: Ha, Seongsu, et al.
Published: (2024)

Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation
by: Islam, Niful, et al.
Published: (2024)

Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
by: Chung, Hyungjin, et al.
Published: (2025)

Intelligent Systems in Neuroimaging: Pioneering AI Techniques for Brain Tumor Detection
by: Islam, Md. Mohaiminul, et al.
Published: (2025)