Saved in:
| Main Authors: | Mu, Kuan-Chen, Chin, Zhi-Yi, Chiu, Wei-Chen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.04511 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
by: Chin, Zhi-Yi, et al.
Published: (2024)
by: Chin, Zhi-Yi, et al.
Published: (2024)
Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
by: Pu, Muxin, et al.
Published: (2025)
by: Pu, Muxin, et al.
Published: (2025)
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
by: Chin, Zhi-Yi, et al.
Published: (2023)
by: Chin, Zhi-Yi, et al.
Published: (2023)
VideoXum: Cross-modal Visual and Textural Summarization of Videos
by: Lin, Jingyang, et al.
Published: (2023)
by: Lin, Jingyang, et al.
Published: (2023)
Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models
by: Lee, Yi-Lun, et al.
Published: (2024)
by: Lee, Yi-Lun, et al.
Published: (2024)
Multimodal Abstractive Summarization of Instructional Videos with Vision-Language Models
by: Nazir, Maham, et al.
Published: (2026)
by: Nazir, Maham, et al.
Published: (2026)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
by: Yang, Zongxin, et al.
Published: (2024)
by: Yang, Zongxin, et al.
Published: (2024)
Unleashing Hour-Scale Video Training for Long Video-Language Understanding
by: Lin, Jingyang, et al.
Published: (2025)
by: Lin, Jingyang, et al.
Published: (2025)
Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction
by: Pennec, Galann, et al.
Published: (2025)
by: Pennec, Galann, et al.
Published: (2025)
Semantic Frame Aggregation-based Transformer for Live Video Comment Generation
by: Fatima, Anam, et al.
Published: (2025)
by: Fatima, Anam, et al.
Published: (2025)
SignDPO: Multi-level Direct Preference Optimisation for Skeleton-based Gloss-free Sign Language Translation
by: Pu, Muxin, et al.
Published: (2026)
by: Pu, Muxin, et al.
Published: (2026)
Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models
by: Chew, Oscar, et al.
Published: (2026)
by: Chew, Oscar, et al.
Published: (2026)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
by: Zang, Yuan, et al.
Published: (2025)
by: Zang, Yuan, et al.
Published: (2025)
NeMo: Needle in a Montage for Video-Language Understanding
by: Hu, Zi-Yuan, et al.
Published: (2025)
by: Hu, Zi-Yuan, et al.
Published: (2025)
VideoChat: Chat-Centric Video Understanding
by: Li, KunChang, et al.
Published: (2023)
by: Li, KunChang, et al.
Published: (2023)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
Towards an Automated Multimodal Approach for Video Summarization: Building a Bridge Between Text, Audio and Facial Cue-Based Summarization
by: Islam, Md Moinul, et al.
Published: (2025)
by: Islam, Md Moinul, et al.
Published: (2025)
TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
by: Li, Wei, et al.
Published: (2024)
by: Li, Wei, et al.
Published: (2024)
Video Understanding with Large Language Models: A Survey
by: Tang, Yolo Y., et al.
Published: (2023)
by: Tang, Yolo Y., et al.
Published: (2023)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization
by: Jin, Yang, et al.
Published: (2024)
by: Jin, Yang, et al.
Published: (2024)
VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding
by: Yin, Yufei, et al.
Published: (2025)
by: Yin, Yufei, et al.
Published: (2025)
From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
by: Wang, Xiangfeng, et al.
Published: (2025)
by: Wang, Xiangfeng, et al.
Published: (2025)
Enhancing Meme Emotion Understanding with Multi-Level Modality Enhancement and Dual-Stage Modal Fusion
by: Shi, Yi, et al.
Published: (2025)
by: Shi, Yi, et al.
Published: (2025)
Wolf: Dense Video Captioning with a World Summarization Framework
by: Li, Boyi, et al.
Published: (2024)
by: Li, Boyi, et al.
Published: (2024)
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
by: Lim, Qi Zhi, et al.
Published: (2025)
by: Lim, Qi Zhi, et al.
Published: (2025)
A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts
by: Du, Wenzhuo, et al.
Published: (2025)
by: Du, Wenzhuo, et al.
Published: (2025)
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
by: Zhang, Yongheng, et al.
Published: (2025)
by: Zhang, Yongheng, et al.
Published: (2025)
Personalized Video Summarization by Multimodal Video Understanding
by: Chen, Brian, et al.
Published: (2024)
by: Chen, Brian, et al.
Published: (2024)
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
by: Xu, Jiaqi, et al.
Published: (2023)
by: Xu, Jiaqi, et al.
Published: (2023)
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
by: Cheng, Zesen, et al.
Published: (2024)
by: Cheng, Zesen, et al.
Published: (2024)
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
by: Luo, Chuwei, et al.
Published: (2024)
by: Luo, Chuwei, et al.
Published: (2024)
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure
by: Gigant, Théo, et al.
Published: (2025)
by: Gigant, Théo, et al.
Published: (2025)
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks
by: Wang, Yanan, et al.
Published: (2025)
by: Wang, Yanan, et al.
Published: (2025)
DaMO: A Data-Efficient Multimodal Orchestrator for Temporal Reasoning with Video LLMs
by: Chiu, Bo-Cheng, et al.
Published: (2025)
by: Chiu, Bo-Cheng, et al.
Published: (2025)
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models
by: Li, Shicheng, et al.
Published: (2023)
by: Li, Shicheng, et al.
Published: (2023)
Video Summarization: Towards Entity-Aware Captions
by: Ayyubi, Hammad A., et al.
Published: (2023)
by: Ayyubi, Hammad A., et al.
Published: (2023)
Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector
by: Li, Sifan, et al.
Published: (2025)
by: Li, Sifan, et al.
Published: (2025)
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
by: Li, Yunxin, et al.
Published: (2024)
by: Li, Yunxin, et al.
Published: (2024)
Exploring the Role of Explicit Temporal Modeling in Multimodal Large Language Models for Video Understanding
by: Li, Yun, et al.
Published: (2025)
by: Li, Yun, et al.
Published: (2025)
Improving Language Understanding from Screenshots
by: Gao, Tianyu, et al.
Published: (2024)
by: Gao, Tianyu, et al.
Published: (2024)
Similar Items
-
Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting
by: Chin, Zhi-Yi, et al.
Published: (2024) -
Sigma: Semantically Informative Pre-training for Skeleton-based Sign Language Understanding
by: Pu, Muxin, et al.
Published: (2025) -
Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts
by: Chin, Zhi-Yi, et al.
Published: (2023) -
VideoXum: Cross-modal Visual and Textural Summarization of Videos
by: Lin, Jingyang, et al.
Published: (2023) -
Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models
by: Lee, Yi-Lun, et al.
Published: (2024)