Saved in:
| Main Authors: | Liu, Jingping, Liu, Ziyan, Cen, Zhedong, Zhou, Yan, Zou, Yinan, Zhang, Weiyan, Jiang, Haiyun, Ruan, Tong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.19015 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
by: Liu, Lingyu, et al.
Published: (2026)
by: Liu, Lingyu, et al.
Published: (2026)
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
by: Chu, Meng, et al.
Published: (2023)
by: Chu, Meng, et al.
Published: (2023)
SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models
by: Zhu, Yule, et al.
Published: (2025)
by: Zhu, Yule, et al.
Published: (2025)
Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation
by: Liu, Lingyu, et al.
Published: (2025)
by: Liu, Lingyu, et al.
Published: (2025)
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
by: Yang, Shuyu, et al.
Published: (2024)
by: Yang, Shuyu, et al.
Published: (2024)
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
by: Zhou, Shengli, et al.
Published: (2026)
by: Zhou, Shengli, et al.
Published: (2026)
Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
by: Ma, Xueqi, et al.
Published: (2025)
by: Ma, Xueqi, et al.
Published: (2025)
Grounded Chain-of-Thought for Multimodal Large Language Models
by: Wu, Qiong, et al.
Published: (2025)
by: Wu, Qiong, et al.
Published: (2025)
Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting
by: Liu, Lingyu, et al.
Published: (2026)
by: Liu, Lingyu, et al.
Published: (2026)
CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
by: Yu, Hang, et al.
Published: (2025)
by: Yu, Hang, et al.
Published: (2025)
Scaling up Multimodal Pre-training for Sign Language Understanding
by: Zhou, Wengang, et al.
Published: (2024)
by: Zhou, Wengang, et al.
Published: (2024)
Can We Edit Multimodal Large Language Models?
by: Cheng, Siyuan, et al.
Published: (2023)
by: Cheng, Siyuan, et al.
Published: (2023)
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
by: E, Shaojun, et al.
Published: (2025)
by: E, Shaojun, et al.
Published: (2025)
UniCode$^2$: Cascaded Large-scale Codebooks for Unified Multimodal Understanding and Generation
by: Chen, Yanzhe, et al.
Published: (2025)
by: Chen, Yanzhe, et al.
Published: (2025)
MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
by: Ikuta, Hikaru, et al.
Published: (2024)
by: Ikuta, Hikaru, et al.
Published: (2024)
PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning
by: Xiao, Yicheng, et al.
Published: (2025)
by: Xiao, Yicheng, et al.
Published: (2025)
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
by: Sun, Jintao, et al.
Published: (2024)
by: Sun, Jintao, et al.
Published: (2024)
ArchGPT: Understanding the World's Architectures with Large Multimodal Models
by: Wang, Yuze, et al.
Published: (2025)
by: Wang, Yuze, et al.
Published: (2025)
Traj-MLLM: Can Multimodal Large Language Models Reform Trajectory Data Mining?
by: Liu, Shuo, et al.
Published: (2025)
by: Liu, Shuo, et al.
Published: (2025)
OralGPT-Omni: A Versatile Dental Multimodal Large Language Model
by: Hao, Jing, et al.
Published: (2025)
by: Hao, Jing, et al.
Published: (2025)
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
by: Yu, Xiaomin, et al.
Published: (2026)
by: Yu, Xiaomin, et al.
Published: (2026)
Towards Training-free Multimodal Hate Localisation with Large Language Models
by: Sun, Yueming, et al.
Published: (2026)
by: Sun, Yueming, et al.
Published: (2026)
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
by: Du, Mengfei, et al.
Published: (2024)
by: Du, Mengfei, et al.
Published: (2024)
GuardAlign: Test-time Safety Alignment in Multimodal Large Language Models
by: Zhu, Xingyu, et al.
Published: (2026)
by: Zhu, Xingyu, et al.
Published: (2026)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
by: He, Xin, et al.
Published: (2024)
by: He, Xin, et al.
Published: (2024)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
by: Jiang, Chaoya, et al.
Published: (2024)
by: Jiang, Chaoya, et al.
Published: (2024)
Context-Enhanced Video Moment Retrieval with Large Language Models
by: Liu, Weijia, et al.
Published: (2024)
by: Liu, Weijia, et al.
Published: (2024)
Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)
by: Chen, Sen, et al.
Published: (2022)
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
by: Ma, Jingtian, et al.
Published: (2025)
by: Ma, Jingtian, et al.
Published: (2025)
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
by: Lan, Xiaohan, et al.
Published: (2024)
by: Lan, Xiaohan, et al.
Published: (2024)
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
by: Imajuku, Yuki, et al.
Published: (2024)
by: Imajuku, Yuki, et al.
Published: (2024)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
by: Wu, Xun, et al.
Published: (2024)
by: Wu, Xun, et al.
Published: (2024)
FakeBench: Probing Explainable Fake Image Detection via Large Multimodal Models
by: Li, Yixuan, et al.
Published: (2024)
by: Li, Yixuan, et al.
Published: (2024)
Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
by: Xu, Yifang, et al.
Published: (2025)
by: Xu, Yifang, et al.
Published: (2025)
When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
by: Zhang, Pingping, et al.
Published: (2024)
by: Zhang, Pingping, et al.
Published: (2024)
Cross Modal Fine-Grained Alignment via Granularity-Aware and Region-Uncertain Modeling
by: Liu, Jiale, et al.
Published: (2025)
by: Liu, Jiale, et al.
Published: (2025)
Bridging Compressed Image Latents and Multimodal Large Language Models
by: Kao, Chia-Hao, et al.
Published: (2024)
by: Kao, Chia-Hao, et al.
Published: (2024)
MotionScape: A Large-Scale Real-World Highly Dynamic UAV Video Dataset for World Models
by: Guo, Zile, et al.
Published: (2026)
by: Guo, Zile, et al.
Published: (2026)
Co-Reinforcement Learning for Unified Multimodal Understanding and Generation
by: Jiang, Jingjing, et al.
Published: (2025)
by: Jiang, Jingjing, et al.
Published: (2025)
MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models
by: Zhou, Tian-Yi, et al.
Published: (2026)
by: Zhou, Tian-Yi, et al.
Published: (2026)
Similar Items
-
Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
by: Liu, Lingyu, et al.
Published: (2026) -
Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
by: Chu, Meng, et al.
Published: (2023) -
SEED: A Benchmark Dataset for Sequential Facial Attribute Editing with Diffusion Models
by: Zhu, Yule, et al.
Published: (2025) -
Every Painting Awakened: A Training-free Framework for Painting-to-Animation Generation
by: Liu, Lingyu, et al.
Published: (2025) -
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
by: Yang, Shuyu, et al.
Published: (2024)