Saved in:
| Main Authors: | Ilaslan, Muhammet Furkan, Koksal, Ali, Lin, Kevin Qinhong, Satar, Burak, Shou, Mike Zheng, Xu, Qianli |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.11621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
by: Guan, Runwei, et al.
Published: (2024)
by: Guan, Runwei, et al.
Published: (2024)
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)
by: Zhu, Zeyu, et al.
Published: (2025)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)
by: Satar, Burak, et al.
Published: (2025)
Factorized Learning for Temporally Grounded Video-Language Models
by: Zeng, Wenzheng, et al.
Published: (2025)
by: Zeng, Wenzheng, et al.
Published: (2025)
MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
by: Fang, Pengcheng, et al.
Published: (2026)
by: Fang, Pengcheng, et al.
Published: (2026)
Scene-Text Grounding for Text-Based Video Question Answering
by: Zhou, Sheng, et al.
Published: (2024)
by: Zhou, Sheng, et al.
Published: (2024)
Music Grounding by Short Video
by: Xin, Zijie, et al.
Published: (2024)
by: Xin, Zijie, et al.
Published: (2024)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)
by: Cheng, Zhi-Qi, et al.
Published: (2024)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)
by: Pramanick, Shraman, et al.
Published: (2025)
Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
by: Xie, Zequn, et al.
Published: (2026)
by: Xie, Zequn, et al.
Published: (2026)
PlanLLM: Video Procedure Planning with Refinable Large Language Models
by: Yang, Dejie, et al.
Published: (2024)
by: Yang, Dejie, et al.
Published: (2024)
TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
by: Zhang, Deyu, et al.
Published: (2025)
by: Zhang, Deyu, et al.
Published: (2025)
Prototypical Prompting for Text-to-image Person Re-identification
by: Yan, Shuanglin, et al.
Published: (2024)
by: Yan, Shuanglin, et al.
Published: (2024)
Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026)
by: Zha, Mingyue, et al.
Published: (2026)
Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)
by: Chen, Shuang, et al.
Published: (2026)
Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles
by: Wang, Zihan, et al.
Published: (2024)
by: Wang, Zihan, et al.
Published: (2024)
Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web
by: Heo, Ryang, et al.
Published: (2026)
by: Heo, Ryang, et al.
Published: (2026)
RAG-VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation
by: Tourani, Ali, et al.
Published: (2025)
by: Tourani, Ali, et al.
Published: (2025)
VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
by: Li, Xiang, et al.
Published: (2023)
by: Li, Xiang, et al.
Published: (2023)
VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025)
by: Wang, Yuyue, et al.
Published: (2025)
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
by: Tan, Chaolei, et al.
Published: (2024)
by: Tan, Chaolei, et al.
Published: (2024)
Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction
by: Zhang, Meishan, et al.
Published: (2024)
by: Zhang, Meishan, et al.
Published: (2024)
Multimodal LLM-based Query Paraphrasing for Video Search
by: Wu, Jiaxin, et al.
Published: (2024)
by: Wu, Jiaxin, et al.
Published: (2024)
Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing
by: Zhang, Juan, et al.
Published: (2024)
by: Zhang, Juan, et al.
Published: (2024)
Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
by: Tong, Haonan, et al.
Published: (2024)
by: Tong, Haonan, et al.
Published: (2024)
Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)
by: Jiang, Yidi, et al.
Published: (2024)
ProMSC-MIS: Prompt-based Multimodal Semantic Communication for Multi-Spectral Image Segmentation
by: Zhang, Haoshuo, et al.
Published: (2025)
by: Zhang, Haoshuo, et al.
Published: (2025)
Towards Multimodal Sentiment Analysis via Contrastive Cross-modal Retrieval Augmentation and Hierachical Prompts
by: Zhao, Xianbing, et al.
Published: (2025)
by: Zhao, Xianbing, et al.
Published: (2025)
TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)
by: Xie, Jingjing, et al.
Published: (2024)
VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
by: Chen, Yang, et al.
Published: (2024)
by: Chen, Yang, et al.
Published: (2024)
Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement
by: Gao, Jiayi, et al.
Published: (2025)
by: Gao, Jiayi, et al.
Published: (2025)
A New Dataset and Benchmark for Grounding Multimodal Misinformation
by: Yang, Bingjian, et al.
Published: (2025)
by: Yang, Bingjian, et al.
Published: (2025)
Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video
by: Kyaw, Alexander Htet, et al.
Published: (2025)
by: Kyaw, Alexander Htet, et al.
Published: (2025)
MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
by: Hu, Huanran, et al.
Published: (2026)
by: Hu, Huanran, et al.
Published: (2026)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)
by: Zhao, Pengcheng, et al.
Published: (2024)
Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection
by: Cheung, Tsun-Hin, et al.
Published: (2024)
by: Cheung, Tsun-Hin, et al.
Published: (2024)
Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
by: Nanang, Minsak, et al.
Published: (2026)
by: Nanang, Minsak, et al.
Published: (2026)
Similar Items
-
WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
by: Guan, Runwei, et al.
Published: (2024) -
Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025) -
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024) -
Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025) -
Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)