:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Siyu, Liu, Wenzhe, Chen, Yeming, Wu, Yiming, Zheng, Heming, Cheng, Cheng
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition Artificial Intelligence
Accès en ligne:	https://arxiv.org/abs/2501.19069
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

VLA-Mark: A cross modal watermark for large vision-language alignment model
par: Liu, Shuliang, et autres
Publié: (2025)

Enhancing Breast Cancer Detection with Vision Transformers and Graph Neural Networks
par: Cai, Yeming, et autres
Publié: (2025)

Superpixel Semantics Representation and Pre-training for Vision-Language Task
par: Zhang, Siyu, et autres
Publié: (2023)

Phantom: Subject-consistent video generation via cross-modal alignment
par: Liu, Lijie, et autres
Publié: (2025)

Towards aligned body representations in vision models
par: Gizdov, Andrey, et autres
Publié: (2025)

KNVQA: A Benchmark for evaluation knowledge-based VQA
par: Cheng, Sirui, et autres
Publié: (2023)

Transformer-Based Framework for Motion Capture Denoising and Anomaly Detection in Medical Rehabilitation
par: Cai, Yeming, et autres
Publié: (2025)

Hallucination-aware intermediate representation edit in large vision-language models
par: Suo, Wei, et autres
Publié: (2026)

Are vision language models robust to uncertain inputs?
par: Wang, Xi, et autres
Publié: (2025)

Thinker: A vision-language foundation model for embodied intelligence
par: Pan, Baiyu, et autres
Publié: (2026)

Quantifying the human visual exposome with vision language models
par: Rominger, Christian, et autres
Publié: (2026)

What matters when building vision-language models?
par: Laurençon, Hugo, et autres
Publié: (2024)

When language and vision meet road safety: leveraging multimodal large language models for video-based traffic accident analysis
par: Zhang, Ruixuan, et autres
Publié: (2025)

Partial Channel Network: Compute Fewer, Perform Better
par: Huang, Haiduo, et autres
Publié: (2025)

SAFER-AiD: Saccade-Assisted Foveal-peripheral vision Enhanced Reconstruction for Adversarial Defense
par: Liu, Jiayang, et autres
Publié: (2025)

A benchmark multimodal oro-dental dataset for large vision-language models
par: Lv, Haoxin, et autres
Publié: (2025)

VaPR -- Vision-language Preference alignment for Reasoning
par: Wadhawan, Rohan, et autres
Publié: (2025)

Building and better understanding vision-language models: insights and future directions
par: Laurençon, Hugo, et autres
Publié: (2024)

Generalizing vision-language models to novel domains: A comprehensive survey
par: Li, Xinyao, et autres
Publié: (2025)

EyeCLIP: A visual-language foundation model for multi-modal ophthalmic image analysis
par: Shi, Danli, et autres
Publié: (2024)

Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein
par: Guo, Xiaotong, et autres
Publié: (2025)

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods
par: Li, Zheng, et autres
Publié: (2026)

MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation
par: Xing, Yang, et autres
Publié: (2026)

Representation geometry shapes task performance in vision-language modeling for CT enterography
par: Minoccheri, Cristian, et autres
Publié: (2026)

Beyond the Hype: A dispassionate look at vision-language models in medical scenario
par: Nan, Yang, et autres
Publié: (2024)

MogaNet: Multi-order Gated Aggregation Network
par: Li, Siyuan, et autres
Publié: (2022)

StructXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
par: Ruan, Zanxi, et autres
Publié: (2026)

Self-supervised vision-langage alignment of deep learning representations for bone X-rays analysis
par: Englebert, Alexandre, et autres
Publié: (2024)

GPTDrawer: Enhancing Visual Synthesis through ChatGPT
par: Li, Kun, et autres
Publié: (2024)

PEAR: Pixel-aligned Expressive humAn mesh Recovery
par: Wu, Jiahao, et autres
Publié: (2026)

BRAVE: Broadening the visual encoding of vision-language models
par: Kar, Oğuzhan Fatih, et autres
Publié: (2024)

FlashVTG: Feature Layering and Adaptive Score Handling Network for Video Temporal Grounding
par: Cao, Zhuo, et autres
Publié: (2024)

Embedded Representation Learning Network for Animating Styled Video Portrait
par: Wang, Tianyong, et autres
Publié: (2024)

Effective Attention-Guided Multi-Scale Medical Network for Skin Lesion Segmentation
par: Wang, Siyu, et autres
Publié: (2025)

Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models
par: Aman, Tabinda, et autres
Publié: (2025)

Nearly Lossless Adaptive Bit Switching
par: Huang, Haiduo, et autres
Publié: (2025)

Zero-shot large vision-language model prompting for automated bone identification in paleoradiology x-ray archives
par: Dong, Owen, et autres
Publié: (2026)

MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images
par: Meseguer, Pablo, et autres
Publié: (2024)

POINTS: Improving Your Vision-language Model with Affordable Strategies
par: Liu, Yuan, et autres
Publié: (2024)

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
par: Chen, Cong, et autres
Publié: (2025)