Saved in:
| Main Authors: | Li, Pengzhi, Yu, Pengfei, Liu, Zide, He, Wei, Pan, Xuhao, Rao, Xudong, Wei, Tao, Chen, Wei |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.18302 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
by: Chen, Wei, et al.
Published: (2025)
by: Chen, Wei, et al.
Published: (2025)
StreamingClaw Technical Report
by: Chen, Jiawei, et al.
Published: (2026)
by: Chen, Jiawei, et al.
Published: (2026)
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
by: Wei, Zhixiang, et al.
Published: (2025)
by: Wei, Zhixiang, et al.
Published: (2025)
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
by: Deng, Yihe, et al.
Published: (2024)
by: Deng, Yihe, et al.
Published: (2024)
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
by: Pan, Hewen, et al.
Published: (2025)
by: Pan, Hewen, et al.
Published: (2025)
Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis
by: Tang, Bingda, et al.
Published: (2025)
by: Tang, Bingda, et al.
Published: (2025)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
by: Wu, Xun, et al.
Published: (2024)
by: Wu, Xun, et al.
Published: (2024)
Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models
by: Wei, Canshi
Published: (2024)
by: Wei, Canshi
Published: (2024)
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation
by: Yu, Hong-Tao, et al.
Published: (2025)
by: Yu, Hong-Tao, et al.
Published: (2025)
Enhancing MMDiT-Based Text-to-Image Models for Similar Subject Generation
by: Wei, Tianyi, et al.
Published: (2024)
by: Wei, Tianyi, et al.
Published: (2024)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling
by: Lan, Mengcheng, et al.
Published: (2025)
by: Lan, Mengcheng, et al.
Published: (2025)
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension
by: Zhu, Lei, et al.
Published: (2024)
by: Zhu, Lei, et al.
Published: (2024)
SegPoint: Segment Any Point Cloud via Large Language Model
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
TauGenNet: Plasma-Driven Tau PET Image Synthesis via Text-Guided 3D Diffusion Models
by: Gong, Yuxin, et al.
Published: (2025)
by: Gong, Yuxin, et al.
Published: (2025)
Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models
by: Zhang, Yang, et al.
Published: (2024)
by: Zhang, Yang, et al.
Published: (2024)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)
by: Yang, Jian, et al.
Published: (2025)
FLAME: Frozen Large Language Models Enable Data-Efficient Language-Image Pre-training
by: Cao, Anjia, et al.
Published: (2024)
by: Cao, Anjia, et al.
Published: (2024)
Enhancing Vision-Language Models Generalization via Diversity-Driven Novel Feature Synthesis
by: Yan, Siyuan, et al.
Published: (2024)
by: Yan, Siyuan, et al.
Published: (2024)
Safety of Multimodal Large Language Models on Images and Texts
by: Liu, Xin, et al.
Published: (2024)
by: Liu, Xin, et al.
Published: (2024)
Guiding Cross-Modal Representations with MLLM Priors via Preference Alignment
by: Zhao, Pengfei, et al.
Published: (2025)
by: Zhao, Pengfei, et al.
Published: (2025)
Generating Daylight-driven Architectural Design via Diffusion Models
by: Li, Pengzhi, et al.
Published: (2024)
by: Li, Pengzhi, et al.
Published: (2024)
Unified Scene Representation and Reconstruction for 3D Large Language Models
by: Chu, Tao, et al.
Published: (2024)
by: Chu, Tao, et al.
Published: (2024)
Kosmos-G: Generating Images in Context with Multimodal Large Language Models
by: Pan, Xichen, et al.
Published: (2023)
by: Pan, Xichen, et al.
Published: (2023)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
by: Lu, Fan, et al.
Published: (2024)
by: Lu, Fan, et al.
Published: (2024)
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
by: Zhao, Rui, et al.
Published: (2024)
by: Zhao, Rui, et al.
Published: (2024)
CHOICE: Benchmarking the Remote Sensing Capabilities of Large Vision-Language Models
by: An, Xiao, et al.
Published: (2024)
by: An, Xiao, et al.
Published: (2024)
Aggregated Structural Representation with Large Language Models for Human-Centric Layout Generation
by: Jin, Jiongchao, et al.
Published: (2025)
by: Jin, Jiongchao, et al.
Published: (2025)
TextBlockV2: Towards Precise-Detection-Free Scene Text Spotting with Pre-trained Language Model
by: Lyu, Jiahao, et al.
Published: (2024)
by: Lyu, Jiahao, et al.
Published: (2024)
SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding
by: Li, Zhaoxu, et al.
Published: (2026)
by: Li, Zhaoxu, et al.
Published: (2026)
Growing a Multi-head Twig via Distillation and Reinforcement Learning to Accelerate Large Vision-Language Models
by: Shao, Zhenwei, et al.
Published: (2025)
by: Shao, Zhenwei, et al.
Published: (2025)
Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models
by: Li, Xiaohe, et al.
Published: (2026)
by: Li, Xiaohe, et al.
Published: (2026)
Unleashing the Potential of Large Language Models for Text-to-Image Generation through Autoregressive Representation Alignment
by: Xie, Xing, et al.
Published: (2025)
by: Xie, Xing, et al.
Published: (2025)
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends
by: Liu, Daizong, et al.
Published: (2024)
by: Liu, Daizong, et al.
Published: (2024)
Look, Compare, Decide: Alleviating Hallucination in Large Vision-Language Models via Multi-View Multi-Path Reasoning
by: Qu, Xiaoye, et al.
Published: (2024)
by: Qu, Xiaoye, et al.
Published: (2024)
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
by: Dong, Xiaoyi, et al.
Published: (2024)
by: Dong, Xiaoyi, et al.
Published: (2024)
ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models
by: Tan, Chuangchuang, et al.
Published: (2025)
by: Tan, Chuangchuang, et al.
Published: (2025)
CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification
by: Li, Wei, et al.
Published: (2025)
by: Li, Wei, et al.
Published: (2025)
MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
by: Huang, Yu, et al.
Published: (2025)
by: Huang, Yu, et al.
Published: (2025)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models
by: Zhan, Yu-Wei, et al.
Published: (2023)
by: Zhan, Yu-Wei, et al.
Published: (2023)
Text-Driven Diffusion Model for Sign Language Production
by: He, Jiayi, et al.
Published: (2025)
by: He, Jiayi, et al.
Published: (2025)
Similar Items
-
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
by: Chen, Wei, et al.
Published: (2025) -
StreamingClaw Technical Report
by: Chen, Jiawei, et al.
Published: (2026) -
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
by: Wei, Zhixiang, et al.
Published: (2025) -
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
by: Deng, Yihe, et al.
Published: (2024) -
UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models
by: Pan, Hewen, et al.
Published: (2025)