:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ilaslan, Muhammet Furkan, Koksal, Ali, Lin, Kevin Qinhong, Satar, Burak, Shou, Mike Zheng, Xu, Qianli
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2412.11621
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

WaterVG: Waterway Visual Grounding based on Text-Guided Vision and mmWave Radar
by: Guan, Runwei, et al.
Published: (2024)

Paper2Video: Automatic Video Generation from Scientific Papers
by: Zhu, Zeyu, et al.
Published: (2025)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

Code2Video: A Code-centric Paradigm for Educational Video Generation
by: Chen, Yanzhe, et al.
Published: (2025)

Seeing Culture: A Benchmark for Visual Reasoning and Grounding
by: Satar, Burak, et al.
Published: (2025)

Factorized Learning for Temporally Grounded Video-Language Models
by: Zeng, Wenzheng, et al.
Published: (2025)

MarkIt: Training-Free Visual Markers for Precise Video Temporal Grounding
by: Fang, Pengcheng, et al.
Published: (2026)

Scene-Text Grounding for Text-Based Video Question Answering
by: Zhou, Sheng, et al.
Published: (2024)

Music Grounding by Short Video
by: Xin, Zijie, et al.
Published: (2024)

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
by: Cheng, Zhi-Qi, et al.
Published: (2024)

Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
by: Pramanick, Shraman, et al.
Published: (2025)

Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
by: Xie, Zequn, et al.
Published: (2026)

PlanLLM: Video Procedure Planning with Refinable Large Language Models
by: Yang, Dejie, et al.
Published: (2024)

TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)

Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
by: Zhang, Deyu, et al.
Published: (2025)

Prototypical Prompting for Text-to-image Person Re-identification
by: Yan, Shuanglin, et al.
Published: (2024)

Interpreting Multimodal Communication at Scale in Short-Form Video: Visual, Audio, and Textual Mental Health Discourse on TikTok
by: Zha, Mingyue, et al.
Published: (2026)

Unify-Agent: A Unified Multimodal Agent for World-Grounded Image Synthesis
by: Chen, Shuang, et al.
Published: (2026)

Muse: A Multimodal Conversational Recommendation Dataset with Scenario-Grounded User Profiles
by: Wang, Zihan, et al.
Published: (2024)

Will It Go Viral? Grounding Micro-Video Popularity Prediction on the Open Web
by: Heo, Ryang, et al.
Published: (2026)

RAG-VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation
by: Tourani, Ali, et al.
Published: (2025)

VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models
by: Li, Xiang, et al.
Published: (2023)

VSpeechLM: A Visual Speech Language Model for Visual Text-to-Speech Task
by: Wang, Yuyue, et al.
Published: (2025)

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
by: Tan, Chaolei, et al.
Published: (2024)

Recognizing Everything from All Modalities at Once: Grounded Multimodal Universal Information Extraction
by: Zhang, Meishan, et al.
Published: (2024)

Multimodal LLM-based Query Paraphrasing for Video Search
by: Wu, Jiaxin, et al.
Published: (2024)

Virbo: Multimodal Multilingual Avatar Video Generation in Digital Marketing
by: Zhang, Juan, et al.
Published: (2024)

Multimodal Semantic Communication for Generative Audio-Driven Video Conferencing
by: Tong, Haonan, et al.
Published: (2024)

Target Speech Diarization with Multimodal Prompts
by: Jiang, Yidi, et al.
Published: (2024)

ProMSC-MIS: Prompt-based Multimodal Semantic Communication for Multi-Spectral Image Segmentation
by: Zhang, Haoshuo, et al.
Published: (2025)

Towards Multimodal Sentiment Analysis via Contrastive Cross-modal Retrieval Augmentation and Hierachical Prompts
by: Zhao, Xianbing, et al.
Published: (2025)

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)

VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation
by: Chen, Yang, et al.
Published: (2024)

Identity-Preserving Text-to-Video Generation via Training-Free Prompt, Image, and Guidance Enhancement
by: Gao, Jiayi, et al.
Published: (2025)

A New Dataset and Benchmark for Grounding Multimodal Misinformation
by: Yang, Bingjian, et al.
Published: (2025)

Node-Based Editing for Multimodal Generation of Text, Audio, Image, and Video
by: Kyaw, Alexander Htet, et al.
Published: (2025)

MCSC-Bench: Multimodal Context-to-Script Creation for Realistic Video Production
by: Hu, Huanran, et al.
Published: (2026)

Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)

Automatic Prompt Generation and Grounding Object Detection for Zero-Shot Image Anomaly Detection
by: Cheung, Tsun-Hin, et al.
Published: (2024)

Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
by: Nanang, Minsak, et al.
Published: (2026)