:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Mu, Kuan-Chen, Chin, Zhi-Yi, Chiu, Wei-Chen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language
Online Access:	https://arxiv.org/abs/2410.04511
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
by: Chin, Zhi-Yi, et al.
Published: (2024)

Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
by: Pu, Muxin, et al.
Published: (2025)

Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
by: Chin, Zhi-Yi, et al.
Published: (2023)

VideoXum: Cross-modal Visual and Textural Summarization of Videos
by: Lin, Jingyang, et al.
Published: (2023)

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models
by: Lee, Yi-Lun, et al.
Published: (2024)

Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
by: Nazir, Maham, et al.
Published: (2026)

DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
by: Yang, Zongxin, et al.
Published: (2024)

Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)

Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
by: Pennec, Galann, et al.
Published: (2025)

Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
by: Fatima, Anam, et al.
Published: (2025)

SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation
by: Pu, Muxin, et al.
Published: (2026)

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models
by: Chew, Oscar, et al.
Published: (2026)

MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
by: Zang, Yuan, et al.
Published: (2025)

NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)

VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)

Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
by: Islam, Md Moinul, et al.
Published: (2025)

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
by: Li, Wei, et al.
Published: (2024)

Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)

VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)

From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)

Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
by: Shi, Yi, et al.
Published: (2025)

Wolf: Dense Video Captioning with a World Summarization Framework
by: Li, Boyi, et al.
Published: (2024)

VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
by: Lim, Qi Zhi, et al.
Published: (2025)

A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
by: Du, Wenzhuo, et al.
Published: (2025)

ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
by: Zhang, Yongheng, et al.
Published: (2025)

Personalized Video Summarization by Multimodal Video Understanding
by: Chen, Brian, et al.
Published: (2024)

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023)

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
by: Luo, Chuwei, et al.
Published: (2024)

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
by: Gigant, Théo, et al.
Published: (2025)

CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks
by: Wang, Yanan, et al.
Published: (2025)

DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
by: Chiu, Bo-Cheng, et al.
Published: (2025)

VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
by: Li, Shicheng, et al.
Published: (2023)

Video Summarization: Towards Entity-Aware Captions
by: Ayyubi, Hammad A., et al.
Published: (2023)

Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
by: Li, Sifan, et al.
Published: (2025)

VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
by: Li, Yunxin, et al.
Published: (2024)

Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)

Improving Language Understanding from Screenshots
by: Gao, Tianyu, et al.
Published: (2024)