Saved in:
| Main Authors: | Park, Sungjune, Kim, Yeongyun, Kim, Se Yeon, Ro, Yong Man |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.21863 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
by: Park, Beomchan, et al.
Published: (2026)
by: Park, Beomchan, et al.
Published: (2026)
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
by: Park, Sungjune, et al.
Published: (2023)
by: Park, Sungjune, et al.
Published: (2023)
Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank
by: Park, Sungjune, et al.
Published: (2024)
by: Park, Sungjune, et al.
Published: (2024)
Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning
by: Park, Sungjune, et al.
Published: (2026)
by: Park, Sungjune, et al.
Published: (2026)
Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
by: Park, Sungjune, et al.
Published: (2025)
by: Park, Sungjune, et al.
Published: (2025)
Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
by: Chung, Sangyun, et al.
Published: (2024)
by: Chung, Sangyun, et al.
Published: (2024)
DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
by: Park, Sungjune, et al.
Published: (2025)
by: Park, Sungjune, et al.
Published: (2025)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
MoAI: Mixture of All Intelligence for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Phantom of Latent for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor
by: Kim, Yeonju, et al.
Published: (2024)
by: Kim, Yeonju, et al.
Published: (2024)
CoLLaVO: Crayon Large Language and Vision mOdel
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
by: Kim, Hayeon, et al.
Published: (2026)
by: Kim, Hayeon, et al.
Published: (2026)
TroL: Traversal of Layers for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Causal Unsupervised Semantic Segmentation
by: Kim, Junho, et al.
Published: (2023)
by: Kim, Junho, et al.
Published: (2023)
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
by: Yu, Youngjoon, et al.
Published: (2024)
by: Yu, Youngjoon, et al.
Published: (2024)
GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
by: Kim, Taeheon, et al.
Published: (2024)
by: Kim, Taeheon, et al.
Published: (2024)
AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)
by: Choi, Jeongsoo, et al.
Published: (2023)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
by: Yeo, Jeong Hun, et al.
Published: (2025)
by: Yeo, Jeong Hun, et al.
Published: (2025)
Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
by: Park, Se Jin, et al.
Published: (2023)
by: Park, Se Jin, et al.
Published: (2023)
RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation
by: Ma, Xianping, et al.
Published: (2024)
by: Ma, Xianping, et al.
Published: (2024)
AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
by: Park, Se Jin, et al.
Published: (2024)
by: Park, Se Jin, et al.
Published: (2024)
A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
by: Kim, SuYeon, et al.
Published: (2026)
by: Kim, SuYeon, et al.
Published: (2026)
Semantics-aware Motion Retargeting with Vision-Language Models
by: Zhang, Haodong, et al.
Published: (2023)
by: Zhang, Haodong, et al.
Published: (2023)
Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images
by: Wang, Shanwen, et al.
Published: (2026)
by: Wang, Shanwen, et al.
Published: (2026)
Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
by: Ro, Yusung, et al.
Published: (2026)
by: Ro, Yusung, et al.
Published: (2026)
Semantic Alignment for Multimodal Large Language Models
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)
by: Kim, Minsu, et al.
Published: (2024)
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Kolmogorov-Arnold Network for Remote Sensing Image Semantic Segmentation
by: Ma, Xianping, et al.
Published: (2025)
by: Ma, Xianping, et al.
Published: (2025)
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
by: Park, Se Jin, et al.
Published: (2024)
by: Park, Se Jin, et al.
Published: (2024)
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
by: Kim, Taeheon, et al.
Published: (2024)
by: Kim, Taeheon, et al.
Published: (2024)
FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment
by: Oh, Gyeongrok, et al.
Published: (2023)
by: Oh, Gyeongrok, et al.
Published: (2023)
ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
by: Kim, Minchan, et al.
Published: (2024)
by: Kim, Minchan, et al.
Published: (2024)
Unified Reinforcement and Imitation Learning for Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)
by: Lee, Byung-Kwan, et al.
Published: (2025)
Remote Sensing SpatioTemporal Vision-Language Models: A Comprehensive Survey
by: Liu, Chenyang, et al.
Published: (2024)
by: Liu, Chenyang, et al.
Published: (2024)
Similar Items
-
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
by: Park, Beomchan, et al.
Published: (2026) -
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
by: Park, Sungjune, et al.
Published: (2023) -
Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank
by: Park, Sungjune, et al.
Published: (2024) -
Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning
by: Park, Sungjune, et al.
Published: (2026) -
Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
by: Park, Sungjune, et al.
Published: (2025)