:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Xie, Rongchang, Du, Chen, Song, Ping, Liu, Chang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2411.17762
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Aquarius: A Family of Industry-Level Video Generation Models for Marketing Scenarios
by: Shi, Huafeng, et al.
Published: (2025)

Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
by: Zhang, Boqiang, et al.
Published: (2026)

MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion
by: Peng, Fei, et al.
Published: (2025)

InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing
by: Tian, Changyao, et al.
Published: (2026)

FlexMUSE: Multimodal Unification and Semantics Enhancement Framework with Flexible interaction for Creative Writing
by: Chen, Jiahao, et al.
Published: (2025)

WebAccessVL: Violation-Aware VLM for Web Accessibility
by: Zheng, Amber Yijia, et al.
Published: (2025)

Kimi-VL Technical Report
by: Kimi Team, et al.
Published: (2025)

Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision
by: Wei, Zhixiang, et al.
Published: (2026)

Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
by: Wang, Xiaokun, et al.
Published: (2025)

UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation
by: Chen, Dengbo, et al.
Published: (2025)

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching
by: Chang, Zewei, et al.
Published: (2025)

PaddleOCR-VL-1.5: Towards a Multi-Task 0.9B VLM for Robust In-the-Wild Document Parsing
by: Cui, Cheng, et al.
Published: (2026)

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
by: Wu, Chengyue, et al.
Published: (2024)

VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
by: Li, Lei, et al.
Published: (2024)

UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding
by: Xu, Yueming, et al.
Published: (2025)

ORION: ORthonormal Text Encoding for Universal VLM AdaptatION
by: Chakraborty, Omprakash, et al.
Published: (2026)

VL4Gaze: Unleashing Vision-Language Models for Gaze Following
by: Wang, Shijing, et al.
Published: (2025)

GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery
by: Solanki, Bhupendra, et al.
Published: (2024)

VL-Mamba: Exploring State Space Models for Multimodal Learning
by: Qiao, Yanyuan, et al.
Published: (2024)

VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
by: Liang, Jiawei, et al.
Published: (2024)

SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)

Singpath-VL Technical Report
by: Qiu, Zhen, et al.
Published: (2026)

Semantic Residual for Multimodal Unified Discrete Representation
by: Huang, Hai, et al.
Published: (2024)

AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding
by: Liu, Tao, et al.
Published: (2024)

InspectVLM: Unified in Theory, Unreliable in Practice
by: Wallace, Conor, et al.
Published: (2025)

Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025)

Slot-VLM: SlowFast Slots for Video-Language Modeling
by: Xu, Jiaqi, et al.
Published: (2024)

MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization
by: Xia, Yingjie, et al.
Published: (2025)

Qwen2.5-VL Technical Report
by: Bai, Shuai, et al.
Published: (2025)

SWinMamba: Serpentine Window State Space Model for Vascular Segmentation
by: Zhao, Rongchang, et al.
Published: (2025)

Kwai Keye-VL Technical Report
by: Kwai Keye Team, et al.
Published: (2025)

Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection
by: Bao, Wentao, et al.
Published: (2024)

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
by: Li, Yuliang, et al.
Published: (2026)

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference
by: Singh, Aditya Kumar, et al.
Published: (2026)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)

VL-Nav: A Neuro-Symbolic Approach for Reasoning-based Vision-Language Navigation
by: Du, Yi, et al.
Published: (2025)

Qwen3-VL Technical Report
by: Bai, Shuai, et al.
Published: (2025)

TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference
by: Hu, Junshan, et al.
Published: (2025)

MUSE: Harnessing Precise and Diverse Semantics for Few-Shot Whole Slide Image Classification
by: Xu, Jiahao, et al.
Published: (2026)

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
by: Li, Zesheng, et al.
Published: (2026)