Saved in:
| Main Authors: | Chang, Aiden, De Melo, Celso, Lukin, Stephanie M. |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.16421 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval
by: Li, Aiden Yiliu, et al.
Published: (2026)
by: Li, Aiden Yiliu, et al.
Published: (2026)
What and When to Look?: Temporal Span Proposal Network for Video Relation Detection
by: Woo, Sangmin, et al.
Published: (2021)
by: Woo, Sangmin, et al.
Published: (2021)
Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
by: Morini, Marco, et al.
Published: (2026)
by: Morini, Marco, et al.
Published: (2026)
ViLCo-Bench: VIdeo Language COntinual learning Benchmark
by: Tang, Tianqi, et al.
Published: (2024)
by: Tang, Tianqi, et al.
Published: (2024)
Unleash the Potential of CLIP for Video Highlight Detection
by: Han, Donghoon, et al.
Published: (2024)
by: Han, Donghoon, et al.
Published: (2024)
Video Diffusion Models Excel at Tracking Similar-Looking Objects Without Supervision
by: Zhang, Chenshuang, et al.
Published: (2025)
by: Zhang, Chenshuang, et al.
Published: (2025)
LookAhead Tuning: Safer Language Models via Partial Answer Previews
by: Liu, Kangwei, et al.
Published: (2025)
by: Liu, Kangwei, et al.
Published: (2025)
Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
by: Shabtay, Nimrod, et al.
Published: (2026)
by: Shabtay, Nimrod, et al.
Published: (2026)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
by: Wang, Xingrui, et al.
Published: (2025)
by: Wang, Xingrui, et al.
Published: (2025)
Unsupervised Transcript-assisted Video Summarization and Highlight Detection
by: Barbakos, Spyros, et al.
Published: (2025)
by: Barbakos, Spyros, et al.
Published: (2025)
Thinking Ahead: Foresight Intelligence in MLLMs and World Models
by: Gong, Zhantao, et al.
Published: (2025)
by: Gong, Zhantao, et al.
Published: (2025)
A Modern Look at Simplicity Bias in Image Classification Tasks
by: Chang, Xiaoguang, et al.
Published: (2025)
by: Chang, Xiaoguang, et al.
Published: (2025)
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
by: Poletti, Silvia, et al.
Published: (2026)
by: Poletti, Silvia, et al.
Published: (2026)
GPTSee: Enhancing Moment Retrieval and Highlight Detection via Description-Based Similarity Features
by: Sun, Yunzhuo, et al.
Published: (2024)
by: Sun, Yunzhuo, et al.
Published: (2024)
Predicting the Next Action by Modeling the Abstract Goal
by: Roy, Debaditya, et al.
Published: (2022)
by: Roy, Debaditya, et al.
Published: (2022)
What Matters in Range View 3D Object Detection
by: Wilson, Benjamin, et al.
Published: (2024)
by: Wilson, Benjamin, et al.
Published: (2024)
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)
by: Hao, Haihong, et al.
Published: (2026)
Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection
by: Yang, Jin, et al.
Published: (2024)
by: Yang, Jin, et al.
Published: (2024)
Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection
by: Um, Sung Jin, et al.
Published: (2025)
by: Um, Sung Jin, et al.
Published: (2025)
Overcoming Semantic Dilution in Transformer-Based Next Frame Prediction
by: Nguyen, Hy, et al.
Published: (2025)
by: Nguyen, Hy, et al.
Published: (2025)
Automated Detection of Sport Highlights from Audio and Video Sources
by: Della Santa, Francesco, et al.
Published: (2025)
by: Della Santa, Francesco, et al.
Published: (2025)
VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
by: Paul, Dhiman, et al.
Published: (2024)
by: Paul, Dhiman, et al.
Published: (2024)
Memorize What Matters: Emergent Scene Decomposition from Multitraverse
by: Li, Yiming, et al.
Published: (2024)
by: Li, Yiming, et al.
Published: (2024)
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
by: Ren, Shuhuai, et al.
Published: (2025)
by: Ren, Shuhuai, et al.
Published: (2025)
Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge
by: Medeiros, Heitor Rapela, et al.
Published: (2024)
by: Medeiros, Heitor Rapela, et al.
Published: (2024)
ETA: Efficiency through Thinking Ahead, A Dual Approach to Self-Driving with Large Models
by: Hamdan, Shadi, et al.
Published: (2025)
by: Hamdan, Shadi, et al.
Published: (2025)
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
by: Zhou, Chunting, et al.
Published: (2024)
by: Zhou, Chunting, et al.
Published: (2024)
VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction
by: Ji, Longbin, et al.
Published: (2026)
by: Ji, Longbin, et al.
Published: (2026)
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
by: Tian, Keyu, et al.
Published: (2024)
by: Tian, Keyu, et al.
Published: (2024)
Generating Narrated Lecture Videos from Slides with Synchronized Highlights
by: Holmberg, Alexander
Published: (2025)
by: Holmberg, Alexander
Published: (2025)
Fostering Video Reasoning via Next-Event Prediction
by: Wang, Haonan, et al.
Published: (2025)
by: Wang, Haonan, et al.
Published: (2025)
Looking into Concept Explanation Methods for Diabetic Retinopathy Classification
by: Storås, Andrea M., et al.
Published: (2024)
by: Storås, Andrea M., et al.
Published: (2024)
What Makes a Maze Look Like a Maze?
by: Hsu, Joy, et al.
Published: (2024)
by: Hsu, Joy, et al.
Published: (2024)
Learning What Matters: Prioritized Concept Learning via Relative Error-driven Sample Selection
by: Chandhok, Shivam, et al.
Published: (2025)
by: Chandhok, Shivam, et al.
Published: (2025)
What Matters in Practical Learned Image Compression
by: Tatwawadi, Kedar, et al.
Published: (2026)
by: Tatwawadi, Kedar, et al.
Published: (2026)
Object Aware Egocentric Online Action Detection
by: An, Joungbin, et al.
Published: (2024)
by: An, Joungbin, et al.
Published: (2024)
What to Do Next? Memorizing skills from Egocentric Instructional Video
by: Bi, Jing, et al.
Published: (2025)
by: Bi, Jing, et al.
Published: (2025)
What Happens Next? Anticipating Future Motion by Generating Point Trajectories
by: Boduljak, Gabrijel, et al.
Published: (2025)
by: Boduljak, Gabrijel, et al.
Published: (2025)
What Matters for Scalable and Robust Learning in End-to-End Driving Planners?
by: Holtz, David, et al.
Published: (2026)
by: Holtz, David, et al.
Published: (2026)
What Matters to You? Towards Visual Representation Alignment for Robot Learning
by: Tian, Ran, et al.
Published: (2023)
by: Tian, Ran, et al.
Published: (2023)
Similar Items
-
Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval
by: Li, Aiden Yiliu, et al.
Published: (2026) -
What and When to Look?: Temporal Span Proposal Network for Video Relation Detection
by: Woo, Sangmin, et al.
Published: (2021) -
Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
by: Morini, Marco, et al.
Published: (2026) -
ViLCo-Bench: VIdeo Language COntinual learning Benchmark
by: Tang, Tianqi, et al.
Published: (2024) -
Unleash the Potential of CLIP for Video Highlight Detection
by: Han, Donghoon, et al.
Published: (2024)