Saved in:
| Main Authors: | Pan, Yulu, Yi, Han, Ha, Seongsu, Islam, Md Mohaiminul, Zhang, Benjamin, Torresani, Lorenzo, Bertasius, Gedas |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.31529 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)
by: Islam, Md Mohaiminul, et al.
Published: (2025)
Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)
by: Hannan, Tanveer, et al.
Published: (2023)
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)
by: Pan, Yulu, et al.
Published: (2025)
TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)
by: Wang, Xizi, et al.
Published: (2024)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)
by: Hannan, Tanveer, et al.
Published: (2024)
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
ExAct: A Video-Language Benchmark for Expert Action Analysis
by: Yi, Han, et al.
Published: (2025)
by: Yi, Han, et al.
Published: (2025)
A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)
by: Zhang, Ce, et al.
Published: (2023)
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)
by: Tursynbek, Nurislam, et al.
Published: (2026)
Step Differences in Instructional Video
by: Nagarajan, Tushar, et al.
Published: (2024)
by: Nagarajan, Tushar, et al.
Published: (2024)
LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)
by: Wang, Xizi, et al.
Published: (2023)
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
by: Jung, Minjoon, et al.
Published: (2026)
by: Jung, Minjoon, et al.
Published: (2026)
DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)
by: Cheng, Feng, et al.
Published: (2024)
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026)
by: Seminara, Luigi, et al.
Published: (2026)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)
by: Zhou, Yiyang, et al.
Published: (2025)
TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)
by: Li, Baiqi, et al.
Published: (2026)
ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)
by: Fang, Yu, et al.
Published: (2025)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)
by: Lin, Yan-Bo, et al.
Published: (2026)
DUA-D2C: Dynamic Uncertainty Aware Method for Overfitting Remediation in Deep Learning
by: Siddiqui, Md. Saiful Bari, et al.
Published: (2024)
by: Siddiqui, Md. Saiful Bari, et al.
Published: (2024)
Scalable Frame Sampling for Video Classification: A Semi-Optimal Policy Approach with Reduced Search Space
by: Lee, Junho, et al.
Published: (2024)
by: Lee, Junho, et al.
Published: (2024)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)
by: Hannan, Tanveer, et al.
Published: (2025)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)
by: Pramanick, Shraman, et al.
Published: (2025)
VITED: Video Temporal Evidence Distillation
by: Lu, Yujie, et al.
Published: (2025)
by: Lu, Yujie, et al.
Published: (2025)
Semantic Compositions Enhance Vision-Language Contrastive Learning
by: Aladago, Maxwell, et al.
Published: (2024)
by: Aladago, Maxwell, et al.
Published: (2024)
How You Move Tells What You'll Do: Trajectory-Conditioned Egocentric Prediction
by: Jun, Sejoon, et al.
Published: (2026)
by: Jun, Sejoon, et al.
Published: (2026)
ViewBridge: Curriculum Knowledge Distillation for Activity View-Invariance Under Extreme Viewpoint Changes
by: Somayazulu, Arjun, et al.
Published: (2025)
by: Somayazulu, Arjun, et al.
Published: (2025)
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)
by: Yang, Yue, et al.
Published: (2026)
Finding NeMo: Negative-mined Mosaic Augmentation for Referring Image Segmentation
by: Ha, Seongsu, et al.
Published: (2024)
by: Ha, Seongsu, et al.
Published: (2024)
Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation
by: Islam, Niful, et al.
Published: (2024)
by: Islam, Niful, et al.
Published: (2024)
Video Parallel Scaling: Aggregating Diverse Frame Subsets for VideoLLMs
by: Chung, Hyungjin, et al.
Published: (2025)
by: Chung, Hyungjin, et al.
Published: (2025)
Intelligent Systems in Neuroimaging: Pioneering AI Techniques for Brain Tumor Detection
by: Islam, Md. Mohaiminul, et al.
Published: (2025)
by: Islam, Md. Mohaiminul, et al.
Published: (2025)
Similar Items
-
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025) -
Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024) -
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023) -
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025) -
TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)