Guardado en:
| Autores principales: | Baek, Ingeol, Chang, Hwan, Ryu, Sunghyun, Lee, Hwanhee |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2505.15865 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
por: Lee, Dong-Jae, et al.
Publicado: (2026)
por: Lee, Dong-Jae, et al.
Publicado: (2026)
Gyro-based Neural Single Image Deblurring
por: Yang, Heemin, et al.
Publicado: (2024)
por: Yang, Heemin, et al.
Publicado: (2024)
Generalizable Novel-View Synthesis using a Stereo Camera
por: Lee, Haechan, et al.
Publicado: (2024)
por: Lee, Haechan, et al.
Publicado: (2024)
UGPNet: Universal Generative Prior for Image Restoration
por: Lee, Hwayoon, et al.
Publicado: (2023)
por: Lee, Hwayoon, et al.
Publicado: (2023)
Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval
por: Baek, Ingeol, et al.
Publicado: (2024)
por: Baek, Ingeol, et al.
Publicado: (2024)
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
por: He, Zhentao, et al.
Publicado: (2025)
por: He, Zhentao, et al.
Publicado: (2025)
FloVD: Optical Flow Meets Video Diffusion Model for Enhanced Camera-Controlled Video Synthesis
por: Jin, Wonjoon, et al.
Publicado: (2025)
por: Jin, Wonjoon, et al.
Publicado: (2025)
Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision
por: Kim, Jinnyeong, et al.
Publicado: (2024)
por: Kim, Jinnyeong, et al.
Publicado: (2024)
Can Retrieval Heads See Images? Multimodal Retrieval Heads in Long-Context Vision-Language Models
por: Li, Aaron Branson Cigres, et al.
Publicado: (2026)
por: Li, Aaron Branson Cigres, et al.
Publicado: (2026)
What Do Vision-Language Models Encode for Personalized Image Aesthetics Assessment?
por: Ryu, Koki, et al.
Publicado: (2026)
por: Ryu, Koki, et al.
Publicado: (2026)
How Well Can Vision Language Models See Image Details?
por: Gou, Chenhui, et al.
Publicado: (2024)
por: Gou, Chenhui, et al.
Publicado: (2024)
Unveiling the Response of Large Vision-Language Models to Visually Absent Tokens
por: Kim, Sohee, et al.
Publicado: (2025)
por: Kim, Sohee, et al.
Publicado: (2025)
Towards Scalable Human-aligned Benchmark for Text-guided Image Editing
por: Ryu, Suho, et al.
Publicado: (2025)
por: Ryu, Suho, et al.
Publicado: (2025)
Edge-Aware Image Manipulation via Diffusion Models with a Novel Structure-Preservation Loss
por: Gong, Minsu, et al.
Publicado: (2026)
por: Gong, Minsu, et al.
Publicado: (2026)
ParamISP: Learned Forward and Inverse ISPs using Camera Parameters
por: Kim, Woohyeok, et al.
Publicado: (2023)
por: Kim, Woohyeok, et al.
Publicado: (2023)
LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?
por: Ye, Maoyuan, et al.
Publicado: (2025)
por: Ye, Maoyuan, et al.
Publicado: (2025)
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
por: Chen, Song, et al.
Publicado: (2025)
por: Chen, Song, et al.
Publicado: (2025)
Interpreting Attention Heads for Image-to-Text Information Flow in Large Vision-Language Models
por: Kim, Jinyeong, et al.
Publicado: (2025)
por: Kim, Jinyeong, et al.
Publicado: (2025)
Steering Guidance for Personalized Text-to-Image Diffusion Models
por: Park, Sunghyun, et al.
Publicado: (2025)
por: Park, Sunghyun, et al.
Publicado: (2025)
Regularized Training with Generated Datasets for Name-Only Transfer of Vision-Language Models
por: Park, Minho, et al.
Publicado: (2024)
por: Park, Minho, et al.
Publicado: (2024)
Seeing Justice Clearly: Handwritten Legal Document Translation with OCR and Vision-Language Models
por: Nigam, Shubham Kumar, et al.
Publicado: (2025)
por: Nigam, Shubham Kumar, et al.
Publicado: (2025)
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
por: Abdelhamed, Abdelrahman, et al.
Publicado: (2024)
por: Abdelhamed, Abdelrahman, et al.
Publicado: (2024)
Mirage: Unveiling Hidden Artifacts in Synthetic Images with Large Vision-Language Models
por: Sharma, Pranav, et al.
Publicado: (2025)
por: Sharma, Pranav, et al.
Publicado: (2025)
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
por: Xu, Longwei, et al.
Publicado: (2026)
por: Xu, Longwei, et al.
Publicado: (2026)
DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
por: Jin, Wonjoon, et al.
Publicado: (2026)
por: Jin, Wonjoon, et al.
Publicado: (2026)
Addressing Text Embedding Leakage in Diffusion-based Image Editing
por: Mun, Sunung, et al.
Publicado: (2024)
por: Mun, Sunung, et al.
Publicado: (2024)
SpatiO: Adaptive Test-Time Orchestration of Vision-Language Agents for Spatial Reasoning
por: Hwang, Chan Yeong, et al.
Publicado: (2026)
por: Hwang, Chan Yeong, et al.
Publicado: (2026)
Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model
por: Vesalainen, Ari, et al.
Publicado: (2026)
por: Vesalainen, Ari, et al.
Publicado: (2026)
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
por: Mushkani, Rashid
Publicado: (2025)
por: Mushkani, Rashid
Publicado: (2025)
Unveiling the Tapestry of Consistency in Large Vision-Language Models
por: Zhang, Yuan, et al.
Publicado: (2024)
por: Zhang, Yuan, et al.
Publicado: (2024)
CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment
por: Lee, Hyeongmin, et al.
Publicado: (2024)
por: Lee, Hyeongmin, et al.
Publicado: (2024)
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
por: Momayiz, Imane, et al.
Publicado: (2026)
por: Momayiz, Imane, et al.
Publicado: (2026)
MM-UAVBench: How Well Do Multimodal Large Language Models See, Think, and Plan in Low-Altitude UAV Scenarios?
por: Dai, Shiqi, et al.
Publicado: (2025)
por: Dai, Shiqi, et al.
Publicado: (2025)
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
por: Taghadouini, Said, et al.
Publicado: (2026)
por: Taghadouini, Said, et al.
Publicado: (2026)
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
por: Yu, Ya-Qi, et al.
Publicado: (2024)
por: Yu, Ya-Qi, et al.
Publicado: (2024)
Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models
por: Huang, Yuhang, et al.
Publicado: (2024)
por: Huang, Yuhang, et al.
Publicado: (2024)
ROI-Aware Multiscale Cross-Attention Vision Transformer for Pest Image Identification
por: Kim, Ga-Eun, et al.
Publicado: (2023)
por: Kim, Ga-Eun, et al.
Publicado: (2023)
Event Ellipsometer: Event-based Mueller-Matrix Video Imaging
por: Maeda, Ryota, et al.
Publicado: (2024)
por: Maeda, Ryota, et al.
Publicado: (2024)
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
por: Zhong, Yufeng, et al.
Publicado: (2026)
por: Zhong, Yufeng, et al.
Publicado: (2026)
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?
por: Lee, Seongyun, et al.
Publicado: (2024)
por: Lee, Seongyun, et al.
Publicado: (2024)
Ejemplares similares
-
IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models
por: Lee, Dong-Jae, et al.
Publicado: (2026) -
Gyro-based Neural Single Image Deblurring
por: Yang, Heemin, et al.
Publicado: (2024) -
Generalizable Novel-View Synthesis using a Stereo Camera
por: Lee, Haechan, et al.
Publicado: (2024) -
UGPNet: Universal Generative Prior for Image Restoration
por: Lee, Hwayoon, et al.
Publicado: (2023) -
Probing-RAG: Self-Probing to Guide Language Models in Selective Document Retrieval
por: Baek, Ingeol, et al.
Publicado: (2024)