:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	He, Yuxin, Li, An, Xue, Cheng
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.06619
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CauSight: Learning to Supersense for Visual Causal Discovery
by: Zhang, Yize, et al.
Published: (2025)

Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives
by: Park, Ji-jun, et al.
Published: (2024)

Bridging the Sim2Real Gap: Vision Encoder Pre-Training for Visuomotor Policy Transfer
by: Yardi, Yash, et al.
Published: (2025)

RALAD: Bridging the Real-to-Sim Domain Gap in Autonomous Driving with Retrieval-Augmented Learning
by: Zuo, Jiacheng, et al.
Published: (2025)

RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation
by: Bialer, Oded, et al.
Published: (2024)

OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
by: Hu, Ming, et al.
Published: (2024)

Sim2Radar: Toward Bridging the Radar Sim-to-Real Gap with VLM-Guided Scene Reconstruction
by: Bejerano, Emily, et al.
Published: (2026)

CauSkelNet: Causal Representation Learning for Human Behaviour Analysis
by: Gu, Xingrui, et al.
Published: (2024)

Bridging the Task Gap: Multi-Task Adversarial Transferability in CLIP and Its Derivatives
by: Liu, Kuanrong, et al.
Published: (2025)

Bridge the Modality and Capability Gaps in Vision-Language Model Selection
by: Yi, Chao, et al.
Published: (2024)

PolarVLM: Bridging the Semantic-Physical Gap in Vision-Language Models
by: Li, Yuliang, et al.
Published: (2026)

MaskedCLIP: Bridging the Masked and CLIP Space for Semi-Supervised Medical Vision-Language Pre-training
by: Zhu, Lei, et al.
Published: (2025)

Driving with DINO: Vision Foundation Features as a Unified Bridge for Sim-to-Real Generation in Autonomous Driving
by: Chen, Xuyang, et al.
Published: (2026)

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
by: Patel, Maitreya, et al.
Published: (2024)

How Real is CARLAs Dynamic Vision Sensor? A Study on the Sim-to-Real Gap in Traffic Object Detection
by: Tan, Kaiyuan, et al.
Published: (2025)

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models
by: Maack, Lennart, et al.
Published: (2026)

Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation
by: Wang, Zihan, et al.
Published: (2024)

fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models
by: Sharma, Saurav, et al.
Published: (2025)

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
by: Wang, Jiapeng, et al.
Published: (2024)

A Causality-Inspired Model for Intima-Media Thickening Assessment in Ultrasound Videos
by: Gao, Shuo, et al.
Published: (2025)

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
by: Jin, Juseong, et al.
Published: (2024)

The Abstraction Gap in Vision-Language Causal Reasoning
by: Hoang, Chinh, et al.
Published: (2026)

High-Fidelity Digital Twins for Bridging the Sim2Real Gap in LiDAR-Based ITS Perception
by: Shahbaz, Muhammad, et al.
Published: (2025)

Embodied Scene Understanding for Vision Language Models via MetaVQA
by: Wang, Weizhen, et al.
Published: (2025)

Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior
by: Wu, Haitao, et al.
Published: (2025)

CausalDiff: Causality-Inspired Disentanglement via Diffusion Model for Adversarial Defense
by: Zhang, Mingkun, et al.
Published: (2024)

Token Merging via Spatiotemporal Information Mining for Surgical Video Understanding
by: Jiang, Xixi, et al.
Published: (2025)

ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder
by: Hu, Xiaoxing, et al.
Published: (2025)

VLPose: Bridging the Domain Gap in Pose Estimation with Language-Vision Tuning
by: Li, Jingyao, et al.
Published: (2024)

Synthetic Data Generation for Bridging Sim2Real Gap in a Production Environment
by: Rawal, Parth, et al.
Published: (2023)

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP
by: Zhu, Wencheng, et al.
Published: (2025)

VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
by: Cheng, Ying, et al.
Published: (2025)

SimRecon: SimReady Compositional Scene Reconstruction from Real Videos
by: Xia, Chong, et al.
Published: (2026)

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training
by: Du, Yiyang, et al.
Published: (2026)

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
by: Maaz, Muhammad, et al.
Published: (2023)

Mitigating Hallucinations in Large Vision-Language Models via Causal Route Gating
by: Cheng, Zhe, et al.
Published: (2026)

Causality Model for Semantic Understanding on Videos
by: Yicong, Li
Published: (2025)

InfoCLIP: Bridging Vision-Language Pretraining and Open-Vocabulary Semantic Segmentation via Information-Theoretic Alignment Transfer
by: Yuan, Muyao, et al.
Published: (2025)

Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
by: Wang, Xiao, et al.
Published: (2023)

Q-CLIP: Unleashing the Power of Vision-Language Models for Video Quality Assessment through Unified Cross-Modal Adaptation
by: Mi, Yachun, et al.
Published: (2025)