:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhang, Dewen, Hussain, Tahir, An, Wangpeng, Shouno, Hayaru
Format:	Preprint
Publié:	2025
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2506.21317
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models
par: Zhang, Dewen, et autres
Publié: (2024)

PoseLLM: Enhancing Language-Guided Human Pose Estimation with MLP Alignment
par: Zhang, Dewen, et autres
Publié: (2025)

LLaVA-Video: Video Instruction Tuning With Synthetic Data
par: Zhang, Yuanhan, et autres
Publié: (2024)

LLaVA-c: Continual Improved Visual Instruction Tuning
par: Liu, Wenzhuo, et autres
Publié: (2025)

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
par: Zhao, Xiangyu, et autres
Publié: (2024)

Face-LLaVA: Facial Expression and Attribute Understanding through Instruction Tuning
par: Chaubey, Ashutosh, et autres
Publié: (2025)

Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models
par: Cao, Meng, et autres
Publié: (2024)

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
par: Shu, Fangxun, et autres
Publié: (2024)

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings
par: Yan, Dawei, et autres
Publié: (2024)

MovePose: A High-performance Human Pose Estimation Algorithm on Mobile and Edge Devices
par: Yu, Dongyang, et autres
Publié: (2023)

Dr-LLaVA: Visual Instruction Tuning with Symbolic Clinical Grounding
par: Sun, Shenghuan, et autres
Publié: (2024)

LLaVA-SLT: Visual Language Tuning for Sign Language Translation
par: Liang, Han, et autres
Publié: (2024)

LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding
par: Sun, Boyuan, et autres
Publié: (2025)

LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
par: Cocchi, Federico, et autres
Publié: (2025)

Enhance Image-to-Image Generation with LLaVA-generated Prompts
par: Ding, Zhicheng, et autres
Publié: (2024)

Cosmos-LLaVA: Chatting with the Visual Cosmos-LLaVA: Görselle Sohbet Etmek
par: Zeer, Ahmed, et autres
Publié: (2024)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
par: Yuan, Haobo, et autres
Publié: (2025)

LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models
par: Zhang, Ruiyi, et autres
Publié: (2024)

X-Pose: Detecting Any Keypoints
par: Yang, Jie, et autres
Publié: (2023)

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering
par: Bi, Jinhe, et autres
Publié: (2024)

LLaVA-Critic: Learning to Evaluate Multimodal Models
par: Xiong, Tianyi, et autres
Publié: (2024)

LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
par: Lou, Haoran, et autres
Publié: (2025)

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
par: Zhang, Tao, et autres
Publié: (2024)

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
par: Seyfioglu, Mehmet Saygin, et autres
Publié: (2023)

ER-Pose: Rethinking Keypoint-Driven Representation Learning for Real-Time Human Pose Estimation
par: Li, Nanjun, et autres
Publié: (2026)

Pose-RFT: Enhancing MLLMs for 3D Pose Generation via Hybrid Action Reinforcement Fine-Tuning
par: Li, Bao, et autres
Publié: (2025)

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
par: Jin, Juseong, et autres
Publié: (2024)

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations
par: Xu, Mingjie, et autres
Publié: (2024)

LLaVA-OneVision: Easy Visual Task Transfer
par: Li, Bo, et autres
Publié: (2024)

Understanding and Mitigating Toxicity in Image-Text Pretraining Datasets: A Case Study on LLaVA
par: Kanjula, Karthik Reddy, et autres
Publié: (2025)

Social-LLaVA: Enhancing Robot Navigation through Human-Language Reasoning in Social Spaces
par: Payandeh, Amirreza, et autres
Publié: (2024)

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models
par: Zhang, Yi-Fan, et autres
Publié: (2024)

LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer
par: Zhang, Yipeng, et autres
Publié: (2024)

LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
par: Zheng, Pengcheng, et autres
Publié: (2026)

Keypoints as Dynamic Centroids for Unified Human Pose and Segmentation
par: Ahmad, Niaz, et autres
Publié: (2025)

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models
par: Ye, Xubing, et autres
Publié: (2024)

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest
par: Chen, Xupeng, et autres
Publié: (2024)

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations
par: Gao, Mingze, et autres
Publié: (2024)

Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach
par: Yu, Dongyang, et autres
Publié: (2023)

LLaVA-LE: Large Language-and-Vision Assistant for Lunar Exploration
par: Inal, Gokce, et autres
Publié: (2026)