Saved in:
| Main Authors: | Zhang, Ce, Song, Yale, Desai, Ruta, Iuzzolino, Michael Louis, Tighe, Joseph, Bertasius, Gedas, Kottur, Satwik |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.15130 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
by: Lebailly, Tim, et al.
Published: (2025)
by: Lebailly, Tim, et al.
Published: (2025)
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)
by: Pan, Yulu, et al.
Published: (2025)
SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)
by: Wang, Xizi, et al.
Published: (2023)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)
by: Hannan, Tanveer, et al.
Published: (2023)
A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)
by: Zhang, Ce, et al.
Published: (2023)
TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)
by: Li, Baiqi, et al.
Published: (2026)
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)
by: Tursynbek, Nurislam, et al.
Published: (2026)
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)
by: Islam, Md Mohaiminul, et al.
Published: (2025)
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)
by: Hannan, Tanveer, et al.
Published: (2024)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)
by: Zhou, Yiyang, et al.
Published: (2025)
Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
EgoTV: Egocentric Task Verification from Natural Language Task Descriptions
by: Hazra, Rishi, et al.
Published: (2023)
by: Hazra, Rishi, et al.
Published: (2023)
ExAct: A Video-Language Benchmark for Expert Action Analysis
by: Yi, Han, et al.
Published: (2025)
by: Yi, Han, et al.
Published: (2025)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
by: Pan, Yulu, et al.
Published: (2026)
by: Pan, Yulu, et al.
Published: (2026)
ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)
by: Fang, Yu, et al.
Published: (2025)
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)
by: Hannan, Tanveer, et al.
Published: (2025)
Enhancing Monocular Depth Estimation with Multi-Source Auxiliary Tasks
by: Quercia, Alessio, et al.
Published: (2025)
by: Quercia, Alessio, et al.
Published: (2025)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)
by: Cheng, Feng, et al.
Published: (2024)
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)
by: Lin, Yan-Bo, et al.
Published: (2026)
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)
by: Yang, Yue, et al.
Published: (2026)
Embodied Navigation with Auxiliary Task of Action Description Prediction
by: Kondoh, Haru, et al.
Published: (2025)
by: Kondoh, Haru, et al.
Published: (2025)
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
by: Lin, Yan-Bo, et al.
Published: (2025)
by: Lin, Yan-Bo, et al.
Published: (2025)
Unprejudiced Training Auxiliary Tasks Makes Primary Better: A Multi-Task Learning Perspective
by: Li, Yuanze, et al.
Published: (2024)
by: Li, Yuanze, et al.
Published: (2024)
TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)
by: Wang, Xizi, et al.
Published: (2024)
Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation
by: Xu, Lian, et al.
Published: (2024)
by: Xu, Lian, et al.
Published: (2024)
User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance
by: Verghese, Mrinal, et al.
Published: (2024)
by: Verghese, Mrinal, et al.
Published: (2024)
Optimizing Dense Visual Predictions Through Multi-Task Coherence and Prioritization
by: Fontana, Maxime, et al.
Published: (2024)
by: Fontana, Maxime, et al.
Published: (2024)
SMART: Scalable Multi-agent Real-time Motion Generation via Next-token Prediction
by: Wu, Wei, et al.
Published: (2024)
by: Wu, Wei, et al.
Published: (2024)
Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning
by: Yeh, Chun-Hsiao, et al.
Published: (2026)
by: Yeh, Chun-Hsiao, et al.
Published: (2026)
Test-time Distribution Learning Adapter for Cross-modal Visual Reasoning
by: Zhang, Yi, et al.
Published: (2024)
by: Zhang, Yi, et al.
Published: (2024)
Don't Pause! Every prediction matters in a streaming video
by: Chatterjee, Dibyadip, et al.
Published: (2026)
by: Chatterjee, Dibyadip, et al.
Published: (2026)
Improving Vessel Segmentation with Multi-Task Learning and Auxiliary Data Available Only During Model Training
by: Sobotka, Daniel, et al.
Published: (2025)
by: Sobotka, Daniel, et al.
Published: (2025)
Similar Items
-
Synthetic Captions for Open-Vocabulary Zero-Shot Segmentation
by: Lebailly, Tim, et al.
Published: (2025) -
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025) -
SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025) -
Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024) -
LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)