Saved in:
| Main Authors: | Jin, Yongkang, Luo, Jianwen, Wang, Jingjing, Yao, Jianmin, Hong, Yu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.13748 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Multi-task Prompt Words Learning for Social Media Content Generation
by: Xue, Haochen, et al.
Published: (2024)
by: Xue, Haochen, et al.
Published: (2024)
Rethinking Training Dynamics in Scale-wise Autoregressive Generation
by: Zhou, Gengze, et al.
Published: (2025)
by: Zhou, Gengze, et al.
Published: (2025)
Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
by: Xing, Fuyu, et al.
Published: (2025)
by: Xing, Fuyu, et al.
Published: (2025)
Multimedia Generative Script Learning for Task Planning
by: Wang, Qingyun, et al.
Published: (2022)
by: Wang, Qingyun, et al.
Published: (2022)
ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction
by: Chu, Hailong, et al.
Published: (2026)
by: Chu, Hailong, et al.
Published: (2026)
TacoERE: Cluster-aware Compression for Event Relation Extraction
by: Guan, Yong, et al.
Published: (2024)
by: Guan, Yong, et al.
Published: (2024)
Musketeer: Joint Training for Multi-task Vision Language Model with Task Explanation Prompts
by: Zhang, Zhaoyang, et al.
Published: (2023)
by: Zhang, Zhaoyang, et al.
Published: (2023)
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
by: Zhang, Jingyi, et al.
Published: (2025)
by: Zhang, Jingyi, et al.
Published: (2025)
MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
by: Seeberger, Philipp, et al.
Published: (2024)
by: Seeberger, Philipp, et al.
Published: (2024)
SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control
by: Lu, Quanfeng, et al.
Published: (2025)
by: Lu, Quanfeng, et al.
Published: (2025)
ReMeREC: Relation-aware and Multi-entity Referring Expression Comprehension
by: Hu, Yizhi, et al.
Published: (2025)
by: Hu, Yizhi, et al.
Published: (2025)
EComStage: Stage-wise and Orientation-specific Benchmarking for Large Language Models in E-commerce
by: Zhao, Kaiyan, et al.
Published: (2026)
by: Zhao, Kaiyan, et al.
Published: (2026)
Few-Shot Relation Extraction with Hybrid Visual Evidence
by: Gong, Jiaying, et al.
Published: (2024)
by: Gong, Jiaying, et al.
Published: (2024)
CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation
by: Messina, Pablo, et al.
Published: (2026)
by: Messina, Pablo, et al.
Published: (2026)
Mitigating Multimodal Hallucination via Phase-wise Self-reward
by: Zhang, Yu, et al.
Published: (2026)
by: Zhang, Yu, et al.
Published: (2026)
Rethinking Patient Education as Multi-turn Multi-modal Interaction
by: Yao, Zonghai, et al.
Published: (2026)
by: Yao, Zonghai, et al.
Published: (2026)
LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning
by: Kowsher, Md, et al.
Published: (2026)
by: Kowsher, Md, et al.
Published: (2026)
Distilling Multi-Scale Knowledge for Event Temporal Relation Extraction
by: Yao, Hao-Ren, et al.
Published: (2022)
by: Yao, Hao-Ren, et al.
Published: (2022)
Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval
by: Liu, Zhuchenyang, et al.
Published: (2026)
by: Liu, Zhuchenyang, et al.
Published: (2026)
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)
by: Jiang, Jingjing, et al.
Published: (2025)
3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding
by: Ding, Yihao, et al.
Published: (2024)
by: Ding, Yihao, et al.
Published: (2024)
Pixel-Level Reasoning Segmentation via Multi-turn Conversations
by: Cai, Dexian, et al.
Published: (2025)
by: Cai, Dexian, et al.
Published: (2025)
MLVU: Benchmarking Multi-task Long Video Understanding
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models
by: Ohi, Masanari, et al.
Published: (2024)
by: Ohi, Masanari, et al.
Published: (2024)
Joint Extraction Matters: Prompt-Based Visual Question Answering for Multi-Field Document Information Extraction
by: Loem, Mengsay, et al.
Published: (2025)
by: Loem, Mengsay, et al.
Published: (2025)
Towards Better Multi-head Attention via Channel-wise Sample Permutation
by: Yuan, Shen, et al.
Published: (2024)
by: Yuan, Shen, et al.
Published: (2024)
PM4Bench: Benchmarking Large Vision-Language Models with Parallel Multilingual Multi-Modal Multi-task Corpus
by: Gao, Junyuan, et al.
Published: (2025)
by: Gao, Junyuan, et al.
Published: (2025)
FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models
by: Song, Mingyang, et al.
Published: (2025)
by: Song, Mingyang, et al.
Published: (2025)
Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift
by: Xu, Qinwu
Published: (2026)
by: Xu, Qinwu
Published: (2026)
Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining
by: Peng, Bo, et al.
Published: (2026)
by: Peng, Bo, et al.
Published: (2026)
A Structure-aware Generative Model for Biomedical Event Extraction
by: Yuan, Haohan, et al.
Published: (2024)
by: Yuan, Haohan, et al.
Published: (2024)
GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training
by: Bai, Yuyang, et al.
Published: (2026)
by: Bai, Yuyang, et al.
Published: (2026)
MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings
by: Chen, Haonan, et al.
Published: (2025)
by: Chen, Haonan, et al.
Published: (2025)
Sherlock: Towards Multi-scene Video Abnormal Event Extraction and Localization via a Global-local Spatial-sensitive LLM
by: Ma, Junxiao, et al.
Published: (2025)
by: Ma, Junxiao, et al.
Published: (2025)
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL
by: Zhang, Yu, et al.
Published: (2025)
by: Zhang, Yu, et al.
Published: (2025)
SignBind-LLM: Multi-Stage Modality Fusion for Sign Language Translation
by: Thomas, Marshall, et al.
Published: (2025)
by: Thomas, Marshall, et al.
Published: (2025)
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
by: Ma, Mingjie, et al.
Published: (2024)
by: Ma, Mingjie, et al.
Published: (2024)
3M: Multi-modal Multi-task Multi-teacher Learning for Game Event Detection
by: Ng, Thye Shan, et al.
Published: (2024)
by: Ng, Thye Shan, et al.
Published: (2024)
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
by: Siingh, Shikhhar, et al.
Published: (2025)
by: Siingh, Shikhhar, et al.
Published: (2025)
Relevance-aware Multi-context Contrastive Decoding for Retrieval-augmented Visual Question Answering
by: Kim, Jongha, et al.
Published: (2026)
by: Kim, Jongha, et al.
Published: (2026)
Similar Items
-
Multi-task Prompt Words Learning for Social Media Content Generation
by: Xue, Haochen, et al.
Published: (2024) -
Rethinking Training Dynamics in Scale-wise Autoregressive Generation
by: Zhou, Gengze, et al.
Published: (2025) -
Benchmarking and Improving LVLMs on Event Extraction from Multimedia Documents
by: Xing, Fuyu, et al.
Published: (2025) -
Multimedia Generative Script Learning for Task Planning
by: Wang, Qingyun, et al.
Published: (2022) -
ECHO: Event-Centric Hypergraph Operations via Multi-Agent Collaboration for Multimedia Event Extraction
by: Chu, Hailong, et al.
Published: (2026)