:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yi, Han, Pan, Yulu, He, Feihong, Liu, Xinyu, Zhang, Benjamin, Oguntola, Oluwatumininu, Bertasius, Gedas
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2506.06277
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

BASKET: A Large-Scale Video Dataset for Fine-Grained Skill Estimation
by: Pan, Yulu, et al.
Published: (2025)

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
by: Pan, Yulu, et al.
Published: (2026)

SiLVR: A Simple Language-based Video Reasoning Framework
by: Zhang, Ce, et al.
Published: (2025)

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos
by: Hannan, Tanveer, et al.
Published: (2023)

Siamese Vision Transformers are Scalable Audio-visual Learners
by: Lin, Yan-Bo, et al.
Published: (2024)

TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
by: Tursynbek, Nurislam, et al.
Published: (2026)

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos
by: Hannan, Tanveer, et al.
Published: (2024)

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering
by: Islam, Md Mohaiminul, et al.
Published: (2025)

LoCoNet: Long-Short Context Network for Active Speaker Detection
by: Wang, Xizi, et al.
Published: (2023)

TimeBlind: A Spatio-Temporal Compositionality Benchmark for Video LLMs
by: Li, Baiqi, et al.
Published: (2026)

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)

Video ReCap: Recursive Captioning of Hour-Long Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding
by: Zhou, Yiyang, et al.
Published: (2025)

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences
by: Wang, Xiyao, et al.
Published: (2024)

VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
by: Lin, Yan-Bo, et al.
Published: (2024)

A Simple LLM Framework for Long-Range Video Question-Answering
by: Zhang, Ce, et al.
Published: (2023)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)

Propose, Assess, Search: Harnessing LLMs for Goal-Oriented Planning in Instructional Videos
by: Islam, Md Mohaiminul, et al.
Published: (2024)

ReBot: Scaling Robot Learning with Real-to-Sim-to-Real Robotic Video Synthesis
by: Fang, Yu, et al.
Published: (2025)

DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)

VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation
by: Dong, Shaoqi, et al.
Published: (2025)

V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation
by: Lin, Yan-Bo, et al.
Published: (2026)

Enhancing Visual Planning with Auxiliary Tasks and Multi-token Prediction
by: Zhang, Ce, et al.
Published: (2025)

TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)

AesRM: Improving Video Aesthetics with Expert-Level Feedback
by: Han, Yujin, et al.
Published: (2026)

LiLo-VLA: Compositional Long-Horizon Manipulation via Linked Object-Centric Policies
by: Yang, Yue, et al.
Published: (2026)

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes
by: Pan, Ziying, et al.
Published: (2024)

PrototypeFormer: Learning to Explore Prototype Relationships for Few-shot Image Classification
by: Su, Meijuan, et al.
Published: (2023)

ActAvatar: Temporally-Aware Precise Action Control for Talking Avatars
by: Peng, Ziqiao, et al.
Published: (2025)

COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection
by: Jacob, Darryl Cherian, et al.
Published: (2026)

InstrAct: Towards Action-Centric Understanding in Instructional Videos
by: Yang, Zhuoyi, et al.
Published: (2026)

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
by: Ling, Yiran, et al.
Published: (2026)

ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding
by: Wang, Yubin, et al.
Published: (2024)

ActAnywhere: Subject-Aware Video Background Generation
by: Pan, Boxiao, et al.
Published: (2024)

Kronecker Mask and Interpretive Prompts are Language-Action Video Learners
by: Yang, Jingyi, et al.
Published: (2025)

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
by: Salehi, Mohammadreza, et al.
Published: (2024)

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
by: Ye, Wencheng, et al.
Published: (2025)