Saved in:
| Main Authors: | Wang, Yueqian, Meng, Xiaojun, Wang, Yuxuan, Liang, Jianxin, Liu, Qun, Zhao, Dongyan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.17295 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
by: Liang, Jianxin, et al.
Published: (2024)
by: Liang, Jianxin, et al.
Published: (2024)
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
ReasVQA: Advancing VideoQA with Imperfect Reasoning Process
by: Liang, Jianxin, et al.
Published: (2025)
by: Liang, Jianxin, et al.
Published: (2025)
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
Beyond Isolated Facts: Synthesizing Narrative and Grounded Supervision for VideoQA
by: Liang, Jianxin, et al.
Published: (2025)
by: Liang, Jianxin, et al.
Published: (2025)
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results for Video Question Answering
by: Wang, Yueqian, et al.
Published: (2024)
by: Wang, Yueqian, et al.
Published: (2024)
OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
by: Wang, Yuxuan, et al.
Published: (2025)
by: Wang, Yuxuan, et al.
Published: (2025)
MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning
by: Wang, Yueqian, et al.
Published: (2025)
by: Wang, Yueqian, et al.
Published: (2025)
ProactiveVideoQA: A Comprehensive Benchmark Evaluating Proactive Interactions in Video Large Language Models
by: Wang, Yueqian, et al.
Published: (2025)
by: Wang, Yueqian, et al.
Published: (2025)
With a Little Help from my (Linguistic) Friends: Topic Segmentation of Multi-party Casual Conversations
by: Decker, Amandine, et al.
Published: (2024)
by: Decker, Amandine, et al.
Published: (2024)
FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
by: Yin, Zhihan, et al.
Published: (2026)
by: Yin, Zhihan, et al.
Published: (2026)
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
by: Inoue, Koji, et al.
Published: (2025)
by: Inoue, Koji, et al.
Published: (2025)
Multi-modal Stance Detection: New Datasets and Model
by: Liang, Bin, et al.
Published: (2024)
by: Liang, Bin, et al.
Published: (2024)
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning
by: Liu, Fuxiao, et al.
Published: (2023)
by: Liu, Fuxiao, et al.
Published: (2023)
MM-Conv: A Multi-modal Conversational Dataset for Virtual Humans
by: Deichler, Anna, et al.
Published: (2024)
by: Deichler, Anna, et al.
Published: (2024)
VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
$M^3EL$: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking
by: Wang, Fang, et al.
Published: (2024)
by: Wang, Fang, et al.
Published: (2024)
Dynamic Graph Neural ODE Network for Multi-modal Emotion Recognition in Conversation
by: Shou, Yuntao, et al.
Published: (2024)
by: Shou, Yuntao, et al.
Published: (2024)
DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding
by: Liu, Zixuan, et al.
Published: (2025)
by: Liu, Zixuan, et al.
Published: (2025)
Multi-modal Data Spectrum: Multi-modal Datasets are Multi-dimensional
by: Madaan, Divyam, et al.
Published: (2025)
by: Madaan, Divyam, et al.
Published: (2025)
Chain-of-Discussion: A Multi-Model Framework for Complex Evidence-Based Question Answering
by: Tao, Mingxu, et al.
Published: (2024)
by: Tao, Mingxu, et al.
Published: (2024)
Intent Recognition and Out-of-Scope Detection using LLMs in Multi-party Conversations
by: Castillo-López, Galo, et al.
Published: (2025)
by: Castillo-López, Galo, et al.
Published: (2025)
NaturalConv: A Chinese Dialogue Dataset Towards Multi-turn Topic-driven Conversation
by: Wang, Xiaoyang, et al.
Published: (2021)
by: Wang, Xiaoyang, et al.
Published: (2021)
Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines
by: Ma, Zi-Ao, et al.
Published: (2024)
by: Ma, Zi-Ao, et al.
Published: (2024)
Teaching Text-to-Image Models to Communicate in Dialog
by: Sun, Xiaowen, et al.
Published: (2023)
by: Sun, Xiaowen, et al.
Published: (2023)
Contrastive Speaker-Aware Learning for Multi-party Dialogue Generation with LLMs
by: Sun, Tianyu, et al.
Published: (2025)
by: Sun, Tianyu, et al.
Published: (2025)
Akan Cinematic Emotions (ACE): A Multimodal Multi-party Dataset for Emotion Recognition in Movie Dialogues
by: Sasu, David, et al.
Published: (2025)
by: Sasu, David, et al.
Published: (2025)
NYK-MS: A Well-annotated Multi-modal Metaphor and Sarcasm Understanding Benchmark on Cartoon-Caption Dataset
by: Chang, Ke, et al.
Published: (2024)
by: Chang, Ke, et al.
Published: (2024)
Multi-party Response Generation with Relation Disentanglement
by: Dai, Tianhao, et al.
Published: (2024)
by: Dai, Tianhao, et al.
Published: (2024)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment
by: Zhang, Ming, et al.
Published: (2024)
by: Zhang, Ming, et al.
Published: (2024)
Multi-Granularity Information Interaction Framework for Incomplete Utterance Rewriting
by: Du, Haowei, et al.
Published: (2023)
by: Du, Haowei, et al.
Published: (2023)
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
by: Liu, Rui, et al.
Published: (2024)
by: Liu, Rui, et al.
Published: (2024)
CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning
by: He, Zheqi, et al.
Published: (2024)
by: He, Zheqi, et al.
Published: (2024)
Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding
by: Wu, Zichen, et al.
Published: (2024)
by: Wu, Zichen, et al.
Published: (2024)
MAGID: An Automated Pipeline for Generating Synthetic Multi-modal Datasets
by: Aboutalebi, Hossein, et al.
Published: (2024)
by: Aboutalebi, Hossein, et al.
Published: (2024)
LingYi: Medical Conversational Question Answering System based on Multi-modal Knowledge Graphs
by: Xia, Fei, et al.
Published: (2022)
by: Xia, Fei, et al.
Published: (2022)
Towards Detecting LLMs Hallucination via Markov Chain-based Multi-agent Debate Framework
by: Sun, Xiaoxi, et al.
Published: (2024)
by: Sun, Xiaoxi, et al.
Published: (2024)
Kiss up, Kick down: Exploring Behavioral Changes in Multi-modal Large Language Models with Assigned Visual Personas
by: Sun, Seungjong, et al.
Published: (2024)
by: Sun, Seungjong, et al.
Published: (2024)
Similar Items
-
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
by: Wang, Yueqian, et al.
Published: (2024) -
End-to-End Video Question Answering with Frame Scoring Mechanisms and Adaptive Sampling
by: Liang, Jianxin, et al.
Published: (2024) -
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
by: Wang, Yueqian, et al.
Published: (2024) -
ReasVQA: Advancing VideoQA with Imperfect Reasoning Process
by: Liang, Jianxin, et al.
Published: (2025) -
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
by: Wang, Yuxuan, et al.
Published: (2024)