Saved in:
| Main Authors: | Fatima, Anam, Yu, Yi, Kapuriya, Janak, Lalanne, Julien, Shukla, Jainendra |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.26978 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)
by: Yu, Sicheng, et al.
Published: (2024)
Realizing Video Summarization from the Path of Language-based Semantic Understanding
by: Mu, Kuan-Chen, et al.
Published: (2024)
by: Mu, Kuan-Chen, et al.
Published: (2024)
Online Misinformation Detection in Live Streaming Videos
by: Cao, Rui
Published: (2025)
by: Cao, Rui
Published: (2025)
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
by: Liang, Jianxin, et al.
Published: (2024)
by: Liang, Jianxin, et al.
Published: (2024)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
by: Papalampidi, Pinelopi, et al.
Published: (2023)
by: Papalampidi, Pinelopi, et al.
Published: (2023)
HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
by: Yang, Yiqing, et al.
Published: (2025)
by: Yang, Yiqing, et al.
Published: (2025)
Adaptive Greedy Frame Selection for Long Video Understanding
by: Huang, Yuning, et al.
Published: (2026)
by: Huang, Yuning, et al.
Published: (2026)
Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
by: Fu, Fengyi, et al.
Published: (2024)
by: Fu, Fengyi, et al.
Published: (2024)
DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
by: Hong, Susung, et al.
Published: (2023)
by: Hong, Susung, et al.
Published: (2023)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
Everything is a Video: Unifying Modalities through Next-Frame Prediction
by: Hudson, G. Thomas, et al.
Published: (2024)
by: Hudson, G. Thomas, et al.
Published: (2024)
FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing
by: Premsri, Tanawan, et al.
Published: (2025)
by: Premsri, Tanawan, et al.
Published: (2025)
Semantic Map-based Generation of Navigation Instructions
by: Li, Chengzu, et al.
Published: (2024)
by: Li, Chengzu, et al.
Published: (2024)
Seeking and Updating with Live Visual Knowledge
by: Fu, Mingyang, et al.
Published: (2025)
by: Fu, Mingyang, et al.
Published: (2025)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
by: Yu, Suhao, et al.
Published: (2025)
by: Yu, Suhao, et al.
Published: (2025)
WikiVideo: Article Generation from Multiple Videos
by: Martin, Alexander, et al.
Published: (2025)
by: Martin, Alexander, et al.
Published: (2025)
MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)
by: Yu, Shoubin, et al.
Published: (2025)
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
by: Zhang, Zhizhen, et al.
Published: (2024)
by: Zhang, Zhizhen, et al.
Published: (2024)
Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
by: Lin, Ci-Siang, et al.
Published: (2024)
by: Lin, Ci-Siang, et al.
Published: (2024)
Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
by: Guo, Weiyu, et al.
Published: (2025)
by: Guo, Weiyu, et al.
Published: (2025)
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
by: Biyyala, Varun, et al.
Published: (2025)
by: Biyyala, Varun, et al.
Published: (2025)
VideoStudio: Generating Consistent-Content and Multi-Scene Videos
by: Long, Fuchen, et al.
Published: (2024)
by: Long, Fuchen, et al.
Published: (2024)
Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
by: Kapuriya, Janak, et al.
Published: (2025)
by: Kapuriya, Janak, et al.
Published: (2025)
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
by: Gao, Bingjie, et al.
Published: (2025)
by: Gao, Bingjie, et al.
Published: (2025)
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
by: Tong, Jingqi, et al.
Published: (2025)
by: Tong, Jingqi, et al.
Published: (2025)
Transformer with Controlled Attention for Synchronous Motion Captioning
by: Radouane, Karim, et al.
Published: (2024)
by: Radouane, Karim, et al.
Published: (2024)
Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
by: Peng, Bo, et al.
Published: (2026)
by: Peng, Bo, et al.
Published: (2026)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
by: Han, Songhao, et al.
Published: (2024)
by: Han, Songhao, et al.
Published: (2024)
Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
by: Huang, Haoyang, et al.
Published: (2025)
by: Huang, Haoyang, et al.
Published: (2025)
Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline
by: Jia, Qi, et al.
Published: (2024)
by: Jia, Qi, et al.
Published: (2024)
Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
by: Li, Chengzu, et al.
Published: (2026)
by: Li, Chengzu, et al.
Published: (2026)
Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
by: Dou, Zi-Yi, et al.
Published: (2024)
by: Dou, Zi-Yi, et al.
Published: (2024)
TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark
by: Ok, Hyunjong, et al.
Published: (2025)
by: Ok, Hyunjong, et al.
Published: (2025)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
General Transform: A Unified Framework for Adaptive Transform to Enhance Representations
by: Budiutama, Gekko, et al.
Published: (2025)
by: Budiutama, Gekko, et al.
Published: (2025)
Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
by: Chen, Wang, et al.
Published: (2026)
by: Chen, Wang, et al.
Published: (2026)
Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
by: Feng, Fu, et al.
Published: (2024)
by: Feng, Fu, et al.
Published: (2024)
VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate
by: Yuan, Zhihang, et al.
Published: (2025)
by: Yuan, Zhihang, et al.
Published: (2025)
FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
by: Le, Minh Khoa, et al.
Published: (2026)
by: Le, Minh Khoa, et al.
Published: (2026)
Semantic Frame Interpolation
by: Hong, Yijia, et al.
Published: (2025)
by: Hong, Yijia, et al.
Published: (2025)
Similar Items
-
Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024) -
Realizing Video Summarization from the Path of Language-based Semantic Understanding
by: Mu, Kuan-Chen, et al.
Published: (2024) -
Online Misinformation Detection in Live Streaming Videos
by: Cao, Rui
Published: (2025) -
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
by: Liang, Jianxin, et al.
Published: (2024) -
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
by: Papalampidi, Pinelopi, et al.
Published: (2023)