Saved in:
| Main Authors: | Yi, Han, Pan, Yulu, He, Feihong, Liu, Xinyu, Zhang, Benjamin, Oguntola, Oluwatumininu, Bertasius, Gedas |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.06277 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)
by: Pan, Yulu, et al.
Published: (2025)
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
by: Pan, Yulu, et al.
Published: (2026)
by: Pan, Yulu, et al.
Published: (2026)
SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)
by: Hannan, Tanveer, et al.
Published: (2023)
Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)
by: Tursynbek, Nurislam, et al.
Published: (2026)
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)
by: Hannan, Tanveer, et al.
Published: (2024)
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)
by: Islam, Md Mohaiminul, et al.
Published: (2025)
LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)
by: Wang, Xizi, et al.
Published: (2023)
TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)
by: Li, Baiqi, et al.
Published: (2026)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)
by: Zhou, Yiyang, et al.
Published: (2025)
Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)
by: Lin, Yan-Bo, et al.
Published: (2024)
A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)
by: Zhang, Ce, et al.
Published: (2023)
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)
by: Islam, Md Mohaiminul, et al.
Published: (2024)
ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)
by: Fang, Yu, et al.
Published: (2025)
DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)
by: Cheng, Feng, et al.
Published: (2024)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)
by: Hannan, Tanveer, et al.
Published: (2025)
VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
by: Dong, Shaoqi, et al.
Published: (2025)
by: Dong, Shaoqi, et al.
Published: (2025)
V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)
by: Lin, Yan-Bo, et al.
Published: (2026)
Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
by: Zhang, Ce, et al.
Published: (2025)
by: Zhang, Ce, et al.
Published: (2025)
TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)
by: Wang, Xizi, et al.
Published: (2024)
AesRM: Improving Video Aesthetics with Expert-Level Feedback
by: Han, Yujin, et al.
Published: (2026)
by: Han, Yujin, et al.
Published: (2026)
LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)
by: Yang, Yue, et al.
Published: (2026)
FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes
by: Pan, Ziying, et al.
Published: (2024)
by: Pan, Ziying, et al.
Published: (2024)
PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification
by: Su, Meijuan, et al.
Published: (2023)
by: Su, Meijuan, et al.
Published: (2023)
ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
by: Peng, Ziqiao, et al.
Published: (2025)
by: Peng, Ziqiao, et al.
Published: (2025)
COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
by: Jacob, Darryl Cherian, et al.
Published: (2026)
by: Jacob, Darryl Cherian, et al.
Published: (2026)
InstrAct: Towards Action-Centric Understanding in Instructional Videos
by: Yang, Zhuoyi, et al.
Published: (2026)
by: Yang, Zhuoyi, et al.
Published: (2026)
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
by: Ling, Yiran, et al.
Published: (2026)
by: Ling, Yiran, et al.
Published: (2026)
ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding
by: Wang, Yubin, et al.
Published: (2024)
by: Wang, Yubin, et al.
Published: (2024)
ActAnywhere: Subject-Aware Video Background Generation
by: Pan, Boxiao, et al.
Published: (2024)
by: Pan, Boxiao, et al.
Published: (2024)
Kronecker Mask and Interpretive Prompts are Language-Action Video Learners
by: Yang, Jingyi, et al.
Published: (2025)
by: Yang, Jingyi, et al.
Published: (2025)
ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
by: Salehi, Mohammadreza, et al.
Published: (2024)
by: Salehi, Mohammadreza, et al.
Published: (2024)
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
by: Ye, Wencheng, et al.
Published: (2025)
by: Ye, Wencheng, et al.
Published: (2025)
Similar Items
-
BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025) -
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
by: Pan, Yulu, et al.
Published: (2026) -
SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025) -
RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023) -
Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)