Saved in:
| Main Authors: | Xie, Rongchang, Du, Chen, Song, Ping, Liu, Chang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.17762 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
by: Shi, Huafeng, et al.
Published: (2025)
by: Shi, Huafeng, et al.
Published: (2025)
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
by: Zhang, Boqiang, et al.
Published: (2026)
by: Zhang, Boqiang, et al.
Published: (2026)
MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
by: Peng, Fei, et al.
Published: (2025)
by: Peng, Fei, et al.
Published: (2025)
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)
by: Tian, Changyao, et al.
Published: (2026)
FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
by: Chen, Jiahao, et al.
Published: (2025)
by: Chen, Jiahao, et al.
Published: (2025)
WebAccessVL: Violation-Aware VLM for Web Accessibility
by: Zheng, Amber Yijia, et al.
Published: (2025)
by: Zheng, Amber Yijia, et al.
Published: (2025)
Kimi-VL Technical Report
by: Kimi Team, et al.
Published: (2025)
by: Kimi Team, et al.
Published: (2025)
Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
by: Wei, Zhixiang, et al.
Published: (2026)
by: Wei, Zhixiang, et al.
Published: (2026)
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025)
by: Wang, Xiaokun, et al.
Published: (2025)
UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation
by: Chen, Dengbo, et al.
Published: (2025)
by: Chen, Dengbo, et al.
Published: (2025)
PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching
by: Chang, Zewei, et al.
Published: (2025)
by: Chang, Zewei, et al.
Published: (2025)
PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
by: Cui, Cheng, et al.
Published: (2026)
by: Cui, Cheng, et al.
Published: (2026)
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)
by: Wu, Chengyue, et al.
Published: (2024)
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024)
by: Li, Lei, et al.
Published: (2024)
UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
by: Xu, Yueming, et al.
Published: (2025)
by: Xu, Yueming, et al.
Published: (2025)
ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
by: Chakraborty, Omprakash, et al.
Published: (2026)
by: Chakraborty, Omprakash, et al.
Published: (2026)
VL4Gaze: Unleashing Vision-Language Models for Gaze Following
by: Wang, Shijing, et al.
Published: (2025)
by: Wang, Shijing, et al.
Published: (2025)
GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery
by: Solanki, Bhupendra, et al.
Published: (2024)
by: Solanki, Bhupendra, et al.
Published: (2024)
VL-Mamba: Exploring State Space Models for Multimodal Learning
by: Qiao, Yanyuan, et al.
Published: (2024)
by: Qiao, Yanyuan, et al.
Published: (2024)
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
by: Liang, Jiawei, et al.
Published: (2024)
by: Liang, Jiawei, et al.
Published: (2024)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)
by: Chen, Pingyi, et al.
Published: (2025)
Singpath-VL Technical Report
by: Qiu, Zhen, et al.
Published: (2026)
by: Qiu, Zhen, et al.
Published: (2026)
Semantic Residual for Multimodal Unified Discrete Representation
by: Huang, Hai, et al.
Published: (2024)
by: Huang, Hai, et al.
Published: (2024)
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding
by: Liu, Tao, et al.
Published: (2024)
by: Liu, Tao, et al.
Published: (2024)
InspectVLM: Unified in Theory, Unreliable in Practice
by: Wallace, Conor, et al.
Published: (2025)
by: Wallace, Conor, et al.
Published: (2025)
Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025)
by: Huang, Jie, et al.
Published: (2025)
Slot-VLM: SlowFast Slots for Video-Language Modeling
by: Xu, Jiaqi, et al.
Published: (2024)
by: Xu, Jiaqi, et al.
Published: (2024)
MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization
by: Xia, Yingjie, et al.
Published: (2025)
by: Xia, Yingjie, et al.
Published: (2025)
Qwen2.5-VL Technical Report
by: Bai, Shuai, et al.
Published: (2025)
by: Bai, Shuai, et al.
Published: (2025)
SWinMamba: Serpentine Window State Space Model for Vascular Segmentation
by: Zhao, Rongchang, et al.
Published: (2025)
by: Zhao, Rongchang, et al.
Published: (2025)
Kwai Keye-VL Technical Report
by: Kwai Keye Team, et al.
Published: (2025)
by: Kwai Keye Team, et al.
Published: (2025)
Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
by: Bao, Wentao, et al.
Published: (2024)
by: Bao, Wentao, et al.
Published: (2024)
PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
by: Li, Yuliang, et al.
Published: (2026)
by: Li, Yuliang, et al.
Published: (2026)
DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
by: Singh, Aditya Kumar, et al.
Published: (2026)
by: Singh, Aditya Kumar, et al.
Published: (2026)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)
by: Chen, Zhe, et al.
Published: (2023)
VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
by: Du, Yi, et al.
Published: (2025)
by: Du, Yi, et al.
Published: (2025)
Qwen3-VL Technical Report
by: Bai, Shuai, et al.
Published: (2025)
by: Bai, Shuai, et al.
Published: (2025)
TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference
by: Hu, Junshan, et al.
Published: (2025)
by: Hu, Junshan, et al.
Published: (2025)
MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
by: Xu, Jiahao, et al.
Published: (2026)
by: Xu, Jiahao, et al.
Published: (2026)
GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
by: Li, Zesheng, et al.
Published: (2026)
by: Li, Zesheng, et al.
Published: (2026)
Similar Items
-
Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
by: Shi, Huafeng, et al.
Published: (2025) -
Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
by: Zhang, Boqiang, et al.
Published: (2026) -
MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
by: Peng, Fei, et al.
Published: (2025) -
InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026) -
FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
by: Chen, Jiahao, et al.
Published: (2025)