:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fatima, Anam, Yu, Yi, Kapuriya, Janak, Lalanne, Julien, Shukla, Jainendra
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2510.26978
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Frame-Voyager: Learning to Query Frames for Video Large Language Models
by: Yu, Sicheng, et al.
Published: (2024)

Realizing Video Summarization from the Path of Language-based Semantic Understanding
by: Mu, Kuan-Chen, et al.
Published: (2024)

Online Misinformation Detection in Live Streaming Videos
by: Cao, Rui
Published: (2025)

End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
by: Liang, Jianxin, et al.
Published: (2024)

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
by: Papalampidi, Pinelopi, et al.
Published: (2023)

HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
by: Yang, Yiqing, et al.
Published: (2025)

Adaptive Greedy Frame Selection for Long Video Understanding
by: Huang, Yuning, et al.
Published: (2026)

Sentiment-oriented Transformer-based Variational Autoencoder Network for Live Video Commenting
by: Fu, Fengyi, et al.
Published: (2024)

DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation
by: Hong, Susung, et al.
Published: (2023)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

Everything is a Video: Unifying Modalities through Next-Frame Prediction
by: Hudson, G. Thomas, et al.
Published: (2024)

FoR-SALE: Frame of Reference-guided Spatial Adjustment in LLM-based Diffusion Editing
by: Premsri, Tanawan, et al.
Published: (2025)

Semantic Map-based Generation of Navigation Instructions
by: Li, Chengzu, et al.
Published: (2024)

Seeking and Updating with Live Visual Knowledge
by: Fu, Mingyang, et al.
Published: (2025)

MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
by: Yu, Suhao, et al.
Published: (2025)

WikiVideo: Article Generation from Multiple Videos
by: Martin, Alexander, et al.
Published: (2025)

MEXA: Towards General Multimodal Reasoning with Dynamic Multi-Expert Aggregation
by: Yu, Shoubin, et al.
Published: (2025)

Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
by: Zhang, Zhizhen, et al.
Published: (2024)

Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation
by: Lin, Ci-Siang, et al.
Published: (2024)

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
by: Guo, Weiyu, et al.
Published: (2025)

SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing
by: Biyyala, Varun, et al.
Published: (2025)

VideoStudio: Generating Consistent-Content and Multi-Scene Videos
by: Long, Fuchen, et al.
Published: (2024)

Enhancing Scientific Visual Question Answering via Vision-Caption aware Supervised Fine-Tuning
by: Kapuriya, Janak, et al.
Published: (2025)

The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
by: Gao, Bingjie, et al.
Published: (2025)

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
by: Tong, Jingqi, et al.
Published: (2025)

Transformer with Controlled Attention for Synchronous Motion Captioning
by: Radouane, Karim, et al.
Published: (2024)

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
by: Peng, Bo, et al.
Published: (2026)

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
by: Han, Songhao, et al.
Published: (2024)

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model
by: Huang, Haoyang, et al.
Published: (2025)

Infer Induced Sentiment of Comment Response to Video: A New Task, Dataset and Baseline
by: Jia, Qi, et al.
Published: (2024)

Thinking in Frames: How Visual Context and Test-Time Scaling Empower Video Reasoning
by: Li, Chengzu, et al.
Published: (2026)

Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning
by: Dou, Zi-Yi, et al.
Published: (2024)

TempCore: Are Video QA Benchmarks Temporally Grounded? A Frame Selection Sensitivity Analysis and Benchmark
by: Ok, Hyunjong, et al.
Published: (2025)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

General Transform: A Unified Framework for Adaptive Transform to Enhance Representations
by: Budiutama, Gekko, et al.
Published: (2025)

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
by: Chen, Wang, et al.
Published: (2026)

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
by: Feng, Fu, et al.
Published: (2024)

VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate
by: Yuan, Zhihang, et al.
Published: (2025)

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
by: Le, Minh Khoa, et al.
Published: (2026)

Semantic Frame Interpolation
by: Hong, Yijia, et al.
Published: (2025)