Saved in:
| Main Authors: | Chen, Tianyu, Fu, Xingcheng, Gao, Yisen, Qian, Haodong, Wei, Yuecen, Yan, Kun, Zhou, Haoyi, Li, Jianxin |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.18578 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Long-window Anchoring in Vision-Language Model Distillation
by: Zhou, Haoyi, et al.
Published: (2025)
by: Zhou, Haoyi, et al.
Published: (2025)
ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs
by: Gao, Yiling, et al.
Published: (2026)
by: Gao, Yiling, et al.
Published: (2026)
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
by: Chen, Jingkun, et al.
Published: (2026)
by: Chen, Jingkun, et al.
Published: (2026)
DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents
by: Qian, Kun, et al.
Published: (2025)
by: Qian, Kun, et al.
Published: (2025)
SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding
by: Zhou, Xingcheng, et al.
Published: (2026)
by: Zhou, Xingcheng, et al.
Published: (2026)
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation
by: Zhang, Hongfei, et al.
Published: (2025)
by: Zhang, Hongfei, et al.
Published: (2025)
Are VLMs Ready for Lane Topology Awareness in Autonomous Driving?
by: Chen, Xin, et al.
Published: (2025)
by: Chen, Xin, et al.
Published: (2025)
MMTok: Multimodal Coverage Maximization for Efficient Inference of VLMs
by: Dong, Sixun, et al.
Published: (2025)
by: Dong, Sixun, et al.
Published: (2025)
Deep Pre-Alignment for VLMs
by: Yu, Tianyu, et al.
Published: (2026)
by: Yu, Tianyu, et al.
Published: (2026)
ChartGalaxy: A Dataset for Infographic Chart Understanding and Generation
by: Li, Zhen, et al.
Published: (2025)
by: Li, Zhen, et al.
Published: (2025)
Rectify the Regression Bias in Long-Tailed Object Detection
by: Zhu, Ke, et al.
Published: (2024)
by: Zhu, Ke, et al.
Published: (2024)
GPT-4V as Traffic Assistant: An In-depth Look at Vision Language Model on Complex Traffic Events
by: Zhou, Xingcheng, et al.
Published: (2024)
by: Zhou, Xingcheng, et al.
Published: (2024)
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
by: Qiao, Yuxuan, et al.
Published: (2024)
by: Qiao, Yuxuan, et al.
Published: (2024)
Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
by: Qian, Kun, et al.
Published: (2024)
by: Qian, Kun, et al.
Published: (2024)
Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
by: Lu, Wenjun, et al.
Published: (2025)
by: Lu, Wenjun, et al.
Published: (2025)
Hyperbolic Geometric Latent Diffusion Model for Graph Generation
by: Fu, Xingcheng, et al.
Published: (2024)
by: Fu, Xingcheng, et al.
Published: (2024)
Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
by: Wu, Minghui, et al.
Published: (2024)
by: Wu, Minghui, et al.
Published: (2024)
$π^3$: Permutation-Equivariant Visual Geometry Learning
by: Wang, Yifan, et al.
Published: (2025)
by: Wang, Yifan, et al.
Published: (2025)
Data Factory with Minimal Human Effort Using VLMs
by: Ye, Jiaojiao, et al.
Published: (2025)
by: Ye, Jiaojiao, et al.
Published: (2025)
WM-MoE: Weather-aware Multi-scale Mixture-of-Experts for Blind Adverse Weather Removal
by: Luo, Yulin, et al.
Published: (2023)
by: Luo, Yulin, et al.
Published: (2023)
Gaze-Regularized VLMs for Ego-Centric Behavior Understanding
by: Pani, Anupam, et al.
Published: (2026)
by: Pani, Anupam, et al.
Published: (2026)
Linear Scaling Video VLMs for Long Video Understanding
by: Eyzaguirre, Cristobal, et al.
Published: (2026)
by: Eyzaguirre, Cristobal, et al.
Published: (2026)
Geometry-aware Distance Measure for Diverse Hierarchical Structures in Hyperbolic Spaces
by: Li, Pengxiang, et al.
Published: (2025)
by: Li, Pengxiang, et al.
Published: (2025)
On the Perception Bottleneck of VLMs for Chart Understanding
by: Liu, Junteng, et al.
Published: (2025)
by: Liu, Junteng, et al.
Published: (2025)
CIVET: Systematic Evaluation of Understanding in VLMs
by: Rizzoli, Massimo, et al.
Published: (2025)
by: Rizzoli, Massimo, et al.
Published: (2025)
Observe Less, Understand More: Cost-aware Cross-scale Observation for Remote Sensing Understanding
by: Xie, Zhenghao, et al.
Published: (2026)
by: Xie, Zhenghao, et al.
Published: (2026)
Toward a Unified Geometry Understanding: Riemannian Diffusion Framework for Graph Generation and Prediction
by: Gao, Yisen, et al.
Published: (2025)
by: Gao, Yisen, et al.
Published: (2025)
LMHaze: Intensity-aware Image Dehazing with a Large-scale Multi-intensity Real Haze Dataset
by: Zhang, Ruikun, et al.
Published: (2024)
by: Zhang, Ruikun, et al.
Published: (2024)
Identifying and Understanding Cross-Class Features in Adversarial Training
by: Wei, Zeming, et al.
Published: (2025)
by: Wei, Zeming, et al.
Published: (2025)
FlashGS: Efficient 3D Gaussian Splatting for Large-scale and High-resolution Rendering
by: Feng, Guofeng, et al.
Published: (2024)
by: Feng, Guofeng, et al.
Published: (2024)
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs
by: Li, Haoyuan, et al.
Published: (2025)
by: Li, Haoyuan, et al.
Published: (2025)
Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports
by: Yang, Yuchen, et al.
Published: (2026)
by: Yang, Yuchen, et al.
Published: (2026)
OmniScience: A Large-scale Multi-modal Dataset for Scientific Image Understanding
by: Tao, Haoyi, et al.
Published: (2026)
by: Tao, Haoyi, et al.
Published: (2026)
Real-time 3D-aware Portrait Video Relighting
by: Cai, Ziqi, et al.
Published: (2024)
by: Cai, Ziqi, et al.
Published: (2024)
GeoGaussian: Geometry-aware Gaussian Splatting for Scene Rendering
by: Li, Yanyan, et al.
Published: (2024)
by: Li, Yanyan, et al.
Published: (2024)
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better
by: Wang, Dianyi, et al.
Published: (2025)
by: Wang, Dianyi, et al.
Published: (2025)
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
by: Yang, Qian, et al.
Published: (2024)
by: Yang, Qian, et al.
Published: (2024)
Enhancing Underwater Light Field Images via Global Geometry-aware Diffusion Process
by: Lin, Yuji, et al.
Published: (2026)
by: Lin, Yuji, et al.
Published: (2026)
Coordinative Learning with Ordinal and Relational Priors for Volumetric Medical Image Segmentation
by: Wang, Haoyi
Published: (2025)
by: Wang, Haoyi
Published: (2025)
GEARS: Local Geometry-aware Hand-object Interaction Synthesis
by: Zhou, Keyang, et al.
Published: (2024)
by: Zhou, Keyang, et al.
Published: (2024)
Similar Items
-
Towards Long-window Anchoring in Vision-Language Model Distillation
by: Zhou, Haoyi, et al.
Published: (2025) -
ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs
by: Gao, Yiling, et al.
Published: (2026) -
Reinforcing 3D Understanding in Point-VLMs via Geometric Reward Credit Assignment
by: Chen, Jingkun, et al.
Published: (2026) -
DocRefine: An Intelligent Framework for Scientific Document Understanding and Content Optimization based on Multimodal Large Model Agents
by: Qian, Kun, et al.
Published: (2025) -
SGTA: Scene-Graph Based Multi-Modal Traffic Agent for Video Understanding
by: Zhou, Xingcheng, et al.
Published: (2026)