:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Tianyu, Fu, Xingcheng, Gao, Yisen, Qian, Haodong, Wei, Yuecen, Yan, Kun, Zhou, Haoyi, Li, Jianxin
Format:	Preprint
Published:	2025
Subjects:	Machine Learning Artificial Intelligence Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.18578
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards Long-window Anchoring in Vision-Language Model Distillation
by: Zhou, Haoyi, et al.
Published: (2025)

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs
by: Gao, Yiling, et al.
Published: (2026)

Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
by: Chen, Jingkun, et al.
Published: (2026)

DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents
by: Qian, Kun, et al.
Published: (2025)

SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding
by: Zhou, Xingcheng, et al.
Published: (2026)

DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
by: Zhang, Hongfei, et al.
Published: (2025)

Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?
by: Chen, Xin, et al.
Published: (2025)

MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
by: Dong, Sixun, et al.
Published: (2025)

Deep Pre-Alignment for VLMs
by: Yu, Tianyu, et al.
Published: (2026)

ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
by: Li, Zhen, et al.
Published: (2025)

Rectify the Regression Bias in Long-Tailed Object Detection
by: Zhu, Ke, et al.
Published: (2024)

GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events
by: Zhou, Xingcheng, et al.
Published: (2024)

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
by: Qiao, Yuxuan, et al.
Published: (2024)

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
by: Qian, Kun, et al.
Published: (2024)

Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
by: Lu, Wenjun, et al.
Published: (2025)

Hyperbolic Geometric Latent Diffusion Model for Graph Generation
by: Fu, Xingcheng, et al.
Published: (2024)

Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
by: Wu, Minghui, et al.
Published: (2024)

$π^3$: Permutation-Equivariant Visual Geometry Learning
by: Wang, Yifan, et al.
Published: (2025)

Data Factory with Minimal Human Effort Using VLMs
by: Ye, Jiaojiao, et al.
Published: (2025)

WM-MoE: Weather-aware Multi-scale Mixture-of-Experts for Blind Adverse Weather Removal
by: Luo, Yulin, et al.
Published: (2023)

Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
by: Pani, Anupam, et al.
Published: (2026)

Linear Scaling Video VLMs for Long Video Understanding
by: Eyzaguirre, Cristobal, et al.
Published: (2026)

Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces
by: Li, Pengxiang, et al.
Published: (2025)

On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)

CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)

Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding
by: Xie, Zhenghao, et al.
Published: (2026)

Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction
by: Gao, Yisen, et al.
Published: (2025)

LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset
by: Zhang, Ruikun, et al.
Published: (2024)

Identifying and Understanding Cross-Class Features in Adversarial Training
by: Wei, Zeming, et al.
Published: (2025)

FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
by: Feng, Guofeng, et al.
Published: (2024)

Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs
by: Li, Haoyuan, et al.
Published: (2025)

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
by: Yang, Yuchen, et al.
Published: (2026)

OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
by: Tao, Haoyi, et al.
Published: (2026)

Real-time 3D-aware Portrait Video Relighting
by: Cai, Ziqi, et al.
Published: (2024)

GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering
by: Li, Yanyan, et al.
Published: (2024)

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)

Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
by: Yang, Qian, et al.
Published: (2024)

Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process
by: Lin, Yuji, et al.
Published: (2026)

Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation
by: Wang, Haoyi
Published: (2025)

GEARS: Local Geometry-aware Hand-object Interaction Synthesis
by: Zhou, Keyang, et al.
Published: (2024)