:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Baek, Ingeol, Chang, Hwan, Ryu, Sunghyun, Lee, Hwanhee
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2505.15865
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
por: Lee, Dong-Jae, et al.
Publicado: (2026)

Gyro-based Neural Single Image Deblurring
por: Yang, Heemin, et al.
Publicado: (2024)

Generalizable Novel-View Synthesis using a Stereo Camera
por: Lee, Haechan, et al.
Publicado: (2024)

UGPNet: Universal Generative Prior for Image Restoration
por: Lee, Hwayoon, et al.
Publicado: (2023)

Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval
por: Baek, Ingeol, et al.
Publicado: (2024)

Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
por: He, Zhentao, et al.
Publicado: (2025)

FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
por: Jin, Wonjoon, et al.
Publicado: (2025)

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
por: Kim, Jinnyeong, et al.
Publicado: (2024)

Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
por: Li, Aaron Branson Cigres, et al.
Publicado: (2026)

What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?
por: Ryu, Koki, et al.
Publicado: (2026)

How Well Can Vision Language Models See Image Details?
por: Gou, Chenhui, et al.
Publicado: (2024)

Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
por: Kim, Sohee, et al.
Publicado: (2025)

Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
por: Ryu, Suho, et al.
Publicado: (2025)

Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
por: Gong, Minsu, et al.
Publicado: (2026)

ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
por: Kim, Woohyeok, et al.
Publicado: (2023)

LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
por: Ye, Maoyuan, et al.
Publicado: (2025)

Ocean-OCR: Towards General OCR Application via a Vision-Language Model
por: Chen, Song, et al.
Publicado: (2025)

Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models
por: Kim, Jinyeong, et al.
Publicado: (2025)

Steering Guidance for Personalized Text-to-Image Diffusion Models
por: Park, Sunghyun, et al.
Publicado: (2025)

Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models
por: Park, Minho, et al.
Publicado: (2024)

Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
por: Nigam, Shubham Kumar, et al.
Publicado: (2025)

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
por: Abdelhamed, Abdelrahman, et al.
Publicado: (2024)

Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models
por: Sharma, Pranav, et al.
Publicado: (2025)

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
por: Xu, Longwei, et al.
Publicado: (2026)

DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
por: Jin, Wonjoon, et al.
Publicado: (2026)

Addressing Text Embedding Leakage in Diffusion-based Image Editing
por: Mun, Sunung, et al.
Publicado: (2024)

SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
por: Hwang, Chan Yeong, et al.
Publicado: (2026)

Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model
por: Vesalainen, Ari, et al.
Publicado: (2026)

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
por: Mushkani, Rashid
Publicado: (2025)

Unveiling the Tapestry of Consistency in Large Vision-Language Models
por: Zhang, Yuan, et al.
Publicado: (2024)

CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
por: Lee, Hyeongmin, et al.
Publicado: (2024)

AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
por: Momayiz, Imane, et al.
Publicado: (2026)

MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
por: Dai, Shiqi, et al.
Publicado: (2025)

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
por: Taghadouini, Said, et al.
Publicado: (2026)

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
por: Yu, Ya-Qi, et al.
Publicado: (2024)

Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
por: Huang, Yuhang, et al.
Publicado: (2024)

ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image Identification
por: Kim, Ga-Eun, et al.
Publicado: (2023)

Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
por: Maeda, Ryota, et al.
Publicado: (2024)

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
por: Zhong, Yufeng, et al.
Publicado: (2026)

How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
por: Lee, Seongyun, et al.
Publicado: (2024)