:: Library Catalog

Buchumschlag

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Cao, Yue, Liu, Yangzhou, Chen, Zhe, Shi, Guangchen, Wang, Wenhai, Zhao, Danhuai, Lu, Tong
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Computer Vision and Pattern Recognition
Online-Zugang:	https://arxiv.org/abs/2410.11829
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Ähnliche Einträge

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
von: Wang, Weiyun, et al.
Veröffentlicht: (2024)

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
von: Liu, Yangzhou, et al.
Veröffentlicht: (2024)

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
von: Wang, Ruoyu, et al.
Veröffentlicht: (2024)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
von: Wu, Jiannan, et al.
Veröffentlicht: (2024)

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
von: Wang, Weiyun, et al.
Veröffentlicht: (2025)

Docopilot: Improving Multimodal Models for Document-Level Understanding
von: Duan, Yuchen, et al.
Veröffentlicht: (2025)

Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
von: Peng, Wujian, et al.
Veröffentlicht: (2023)

CorrMoE: Mixture of Experts with De-stylization Learning for Cross-Scene and Cross-Domain Correspondence Pruning
von: Xia, Peiwen, et al.
Veröffentlicht: (2025)

Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models
von: Jiang, Jiachen, et al.
Veröffentlicht: (2025)

ZFusion: An Effective Fuser of Camera and 4D Radar for 3D Object Perception in Autonomous Driving
von: Yang, Sheng, et al.
Veröffentlicht: (2025)

ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
von: Cao, Shuo, et al.
Veröffentlicht: (2025)

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
von: Tao, Chenxin, et al.
Veröffentlicht: (2024)

Vision Function Layer in Multimodal LLMs
von: Shi, Cheng, et al.
Veröffentlicht: (2025)

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
von: Duan, Yuchen, et al.
Veröffentlicht: (2024)

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
von: Meng, Fanqing, et al.
Veröffentlicht: (2024)

Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving
von: Li, Yue, et al.
Veröffentlicht: (2025)

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models
von: Zhao, Fufangchen, et al.
Veröffentlicht: (2025)

Can Large Vision-Language Models Understand Multimodal Sarcasm?
von: Wang, Xinyu, et al.
Veröffentlicht: (2025)

ActionArt: Advancing Multimodal Large Models for Fine-Grained Human-Centric Video Understanding
von: Peng, Yi-Xing, et al.
Veröffentlicht: (2025)

MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
von: Du, Yipeng, et al.
Veröffentlicht: (2025)

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
von: Ghosh, Dhruba, et al.
Veröffentlicht: (2026)

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments
von: Yang, Yang, et al.
Veröffentlicht: (2024)

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model
von: Liu, Ting, et al.
Veröffentlicht: (2024)

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding
von: Li, Hongyu, et al.
Veröffentlicht: (2025)

An Approach to Enriching Surgical Video Datasets for Fine-Grained Spatial-Temporal Understanding of Vision-Language Models
von: Maack, Lennart, et al.
Veröffentlicht: (2026)

From Classification to Cross-Modal Understanding: Leveraging Vision-Language Models for Fine-Grained Renal Pathology
von: Guo, Zhenhao, et al.
Veröffentlicht: (2025)

Towards Understanding Multimodal Fine-Tuning: Spatial Features
von: Naghashyar, Lachin, et al.
Veröffentlicht: (2026)

Generating Vision-Language Navigation Instructions Incorporated Fine-Grained Alignment Annotations
von: Cui, Yibo, et al.
Veröffentlicht: (2025)

ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding
von: Shi, Liang, et al.
Veröffentlicht: (2024)

MaskFuser: Masked Fusion of Joint Multi-Modal Tokenization for End-to-End Autonomous Driving
von: Duan, Yiqun, et al.
Veröffentlicht: (2024)

Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
von: Lin, Junyan, et al.
Veröffentlicht: (2025)

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
von: Tian, Changyao, et al.
Veröffentlicht: (2024)

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models
von: Wang, Yeyuan, et al.
Veröffentlicht: (2024)

FILA: Fine-Grained Vision Language Models
von: Zhu, Shiding, et al.
Veröffentlicht: (2024)

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
von: Yang, Chenyu, et al.
Veröffentlicht: (2024)

MMDocBench: Benchmarking Large Vision-Language Models for Fine-Grained Visual Document Understanding
von: Zhu, Fengbin, et al.
Veröffentlicht: (2024)

Preserve and Sculpt: Manifold-Aligned Fine-tuning of Vision-Language Models for Few-Shot Learning
von: Chen, Dexia, et al.
Veröffentlicht: (2025)

Multimodal Alignment with Cross-Attentive GRUs for Fine-Grained Video Understanding
von: Kim, Namho, et al.
Veröffentlicht: (2025)

Beyond Global Similarity: Towards Fine-Grained, Multi-Condition Multimodal Retrieval
von: Lu, Xuan, et al.
Veröffentlicht: (2026)

SliceLens: Fine-Grained and Grounded Error Slice Discovery for Multi-Instance Vision Tasks
von: Zhang, Wei, et al.
Veröffentlicht: (2025)