:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Ce, Song, Yale, Desai, Ruta, Iuzzolino, Michael Louis, Tighe, Joseph, Bertasius, Gedas, Kottur, Satwik
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.15130
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
by: Lebailly, Tim, et al.
Published: (2025)

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)

SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)

Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)

LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)

A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)

Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)

EgoTV: Egocentric Task Verification from Natural Language Task Descriptions
by: Hazra, Rishi, et al.
Published: (2023)

ExAct: A Video-Language Benchmark for Expert Action Analysis
by: Yi, Han, et al.
Published: (2025)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
by: Pan, Yulu, et al.
Published: (2026)

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)

Enhancing Monocular Depth Estimation with Multi-Source Auxiliary Tasks
by: Quercia, Alessio, et al.
Published: (2025)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)

Embodied Navigation with Auxiliary Task of Action Description Prediction
by: Kondoh, Haru, et al.
Published: (2025)

Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
by: Lin, Yan-Bo, et al.
Published: (2025)

Unprejudiced Training Auxiliary Tasks Makes Primary Better: A Multi-Task Learning Perspective
by: Li, Yuanze, et al.
Published: (2024)

TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)

Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation
by: Xu, Lian, et al.
Published: (2024)

User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
by: Verghese, Mrinal, et al.
Published: (2024)

Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization
by: Fontana, Maxime, et al.
Published: (2024)

SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction
by: Wu, Wei, et al.
Published: (2024)

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
by: Yeh, Chun-Hsiao, et al.
Published: (2026)

Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning
by: Zhang, Yi, et al.
Published: (2024)

Don't Pause! Every prediction matters in a streaming video
by: Chatterjee, Dibyadip, et al.
Published: (2026)

Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training
by: Sobotka, Daniel, et al.
Published: (2025)