:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Park, Sungjune, Kim, Yeongyun, Kim, Se Yeon, Ro, Yong Man
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2506.21863
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
by: Park, Beomchan, et al.
Published: (2026)

Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
by: Park, Sungjune, et al.
Published: (2023)

Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank
by: Park, Sungjune, et al.
Published: (2024)

Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning
by: Park, Sungjune, et al.
Published: (2026)

Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
by: Park, Sungjune, et al.
Published: (2025)

Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
by: Chung, Sangyun, et al.
Published: (2024)

DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
by: Park, Sungjune, et al.
Published: (2025)

What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)

MoAI: Mixture of All Intelligence for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)

Phantom of Latent for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)

Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressor
by: Kim, Yeonju, et al.
Published: (2024)

CoLLaVO: Crayon Large Language and Vision mOdel
by: Lee, Byung-Kwan, et al.
Published: (2024)

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
by: Kim, Hayeon, et al.
Published: (2026)

TroL: Traversal of Layers for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)

CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
by: Choi, Jeongsoo, et al.
Published: (2023)

Causal Unsupervised Semantic Segmentation
by: Kim, Junho, et al.
Published: (2023)

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
by: Yu, Youngjoon, et al.
Published: (2024)

GCAgent: Long-Video Understanding via Schematic and Narrative Episodic Memory
by: Yeo, Jeong Hun, et al.
Published: (2025)

Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
by: Kim, Taeheon, et al.
Published: (2024)

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation
by: Choi, Jeongsoo, et al.
Published: (2023)

Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio
by: Yeo, Jeong Hun, et al.
Published: (2025)

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
by: Park, Se Jin, et al.
Published: (2023)

RS3Mamba: Visual State Space Model for Remote Sensing Images Semantic Segmentation
by: Ma, Xianping, et al.
Published: (2024)

AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional Cues
by: Park, Se Jin, et al.
Published: (2024)

A Semantically Disentangled Unified Model for Multi-category 3D Anomaly Detection
by: Kim, SuYeon, et al.
Published: (2026)

Semantics-aware Motion Retargeting with Vision-Language Models
by: Zhang, Haodong, et al.
Published: (2023)

Vision-Language Model Purified Semi-Supervised Semantic Segmentation for Remote Sensing Images
by: Wang, Shanwen, et al.
Published: (2026)

Beyond Semantics: Disentangling Information Scope in Sparse Autoencoders for CLIP
by: Ro, Yusung, et al.
Published: (2026)

Semantic Alignment for Multimodal Large Language Models
by: Wu, Tao, et al.
Published: (2024)

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation
by: Kim, Minsu, et al.
Published: (2024)

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
by: Lee, Byung-Kwan, et al.
Published: (2024)

Kolmogorov-Arnold Network for Remote Sensing Image Semantic Segmentation
by: Ma, Xianping, et al.
Published: (2025)

Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation
by: Park, Se Jin, et al.
Published: (2024)

MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
by: Kim, Taeheon, et al.
Published: (2024)

FPANet: Frequency-based Video Demoireing using Frame-level Post Alignment
by: Oh, Gyeongrok, et al.
Published: (2023)

ESREAL: Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models
by: Kim, Minchan, et al.
Published: (2024)

Unified Reinforcement and Imitation Learning for Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)

Remote Sensing SpatioTemporal Vision-Language Models: A Comprehensive Survey
by: Liu, Chenyang, et al.
Published: (2024)