Saved in:
| Main Authors: | Huang, Zeyi, Ojha, Utkarsh, Ji, Yuyang, Lee, Donghyun, Lee, Yong Jae |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.13058 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
by: Cai, Mu, et al.
Published: (2023)
by: Cai, Mu, et al.
Published: (2023)
Towards Universal Fake Image Detectors that Generalize Across Generative Models
by: Ojha, Utkarsh, et al.
Published: (2023)
by: Ojha, Utkarsh, et al.
Published: (2023)
Aligned Datasets Improve Detection of Latent Diffusion-Generated Images
by: Rajan, Anirudh Sundara, et al.
Published: (2024)
by: Rajan, Anirudh Sundara, et al.
Published: (2024)
Yo'LLaVA: Your Personalized Language and Vision Assistant
by: Nguyen, Thao, et al.
Published: (2024)
by: Nguyen, Thao, et al.
Published: (2024)
Edit One for All: Interactive Batch Image Editing
by: Nguyen, Thao, et al.
Published: (2024)
by: Nguyen, Thao, et al.
Published: (2024)
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection
by: Huang, Zeyi, et al.
Published: (2025)
by: Huang, Zeyi, et al.
Published: (2025)
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs
by: Huang, Zeyi, et al.
Published: (2025)
by: Huang, Zeyi, et al.
Published: (2025)
Do Vision-Language Models Understand Compound Nouns?
by: Kumar, Sonal, et al.
Published: (2024)
by: Kumar, Sonal, et al.
Published: (2024)
FastPoint: Accelerating 3D Point Cloud Model Inference via Sample Point Distance Prediction
by: Lee, Donghyun, et al.
Published: (2025)
by: Lee, Donghyun, et al.
Published: (2025)
Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection
by: An, Sojung, et al.
Published: (2025)
by: An, Sojung, et al.
Published: (2025)
GS-Scale: Unlocking Large-Scale 3D Gaussian Splatting Training via Host Offloading
by: Lee, Donghyun, et al.
Published: (2025)
by: Lee, Donghyun, et al.
Published: (2025)
IMPROVE: Iterative Model Pipeline Refinement and Optimization Leveraging LLM Experts
by: Xue, Eric, et al.
Published: (2025)
by: Xue, Eric, et al.
Published: (2025)
PLATYPUS: Progressive Local Surface Estimator for Arbitrary-Scale Point Cloud Upsampling
by: Kim, Donghyun, et al.
Published: (2024)
by: Kim, Donghyun, et al.
Published: (2024)
MATE: Meet At The Embedding -- Connecting Images with Long Texts
by: Jang, Young Kyun, et al.
Published: (2024)
by: Jang, Young Kyun, et al.
Published: (2024)
TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
by: Yang, Fan, et al.
Published: (2026)
by: Yang, Fan, et al.
Published: (2026)
Do Your Best and Get Enough Rest for Continual Learning
by: Kang, Hankyul, et al.
Published: (2025)
by: Kang, Hankyul, et al.
Published: (2025)
Language-Guided Invariance Probing of Vision-Language Models
by: Lee, Jae Joong
Published: (2025)
by: Lee, Jae Joong
Published: (2025)
Do Vision Transformers See Like Humans? Evaluating their Perceptual Alignment
by: Hernández-Cámara, Pablo, et al.
Published: (2025)
by: Hernández-Cámara, Pablo, et al.
Published: (2025)
Your Embedding Model is SMARTer Than You Think
by: Zhang, Jianrui, et al.
Published: (2026)
by: Zhang, Jianrui, et al.
Published: (2026)
MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models
by: Zou, Bocheng, et al.
Published: (2026)
by: Zou, Bocheng, et al.
Published: (2026)
Do Multimodal Large Language Models Understand Welding?
by: Khvatskii, Grigorii, et al.
Published: (2025)
by: Khvatskii, Grigorii, et al.
Published: (2025)
Active Prompt Learning in Vision Language Models
by: Bang, Jihwan, et al.
Published: (2023)
by: Bang, Jihwan, et al.
Published: (2023)
VisDoT : Enhancing Visual Reasoning through Human-Like Interpretation Grounding and Decomposition of Thought
by: Lee, Eunsoo, et al.
Published: (2026)
by: Lee, Eunsoo, et al.
Published: (2026)
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
by: Ji, Yuyang, et al.
Published: (2025)
by: Ji, Yuyang, et al.
Published: (2025)
uCLIP: Parameter-Efficient Multilingual Extension of Vision-Language Models with Unpaired Data
by: Chung, Dahyun, et al.
Published: (2025)
by: Chung, Dahyun, et al.
Published: (2025)
DrVD-Bench: Do Vision-Language Models Reason Like Human Doctors in Medical Image Diagnosis?
by: Zhou, Tianhong, et al.
Published: (2025)
by: Zhou, Tianhong, et al.
Published: (2025)
Do Vision Models Encode Object-Level Semantic Relatedness? A Cognitive Psychology-Inspired Benchmark
by: Lee, Hansang, et al.
Published: (2017)
by: Lee, Hansang, et al.
Published: (2017)
FALCON: Frequency Adjoint Link with CONtinuous Density Mask for Fast Single Image Dehazing
by: Kim, Donghyun, et al.
Published: (2024)
by: Kim, Donghyun, et al.
Published: (2024)
Do Vision Language Models Understand Human Engagement in Games?
by: Wang, Ziyi, et al.
Published: (2026)
by: Wang, Ziyi, et al.
Published: (2026)
Low-Resolution Editing is All You Need for High-Resolution Editing
by: Lee, Junsung, et al.
Published: (2025)
by: Lee, Junsung, et al.
Published: (2025)
Vision-Language Models Do Not Understand Negation
by: Alhamoud, Kumail, et al.
Published: (2025)
by: Alhamoud, Kumail, et al.
Published: (2025)
See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)
Can Large Vision Language Models Read Maps Like a Human?
by: Xing, Shuo, et al.
Published: (2025)
by: Xing, Shuo, et al.
Published: (2025)
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
by: Zou, Bocheng, et al.
Published: (2024)
by: Zou, Bocheng, et al.
Published: (2024)
Advancing Vision-based Human Action Recognition: Exploring Vision-Language CLIP Model for Generalisation in Domain-Independent Tasks
by: Shandilya, Utkarsh, et al.
Published: (2025)
by: Shandilya, Utkarsh, et al.
Published: (2025)
SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
by: Kim, Hoon, et al.
Published: (2024)
by: Kim, Hoon, et al.
Published: (2024)
Stay-Positive: A Case for Ignoring Real Image Features in Fake Image Detection
by: Rajan, Anirudh Sundara, et al.
Published: (2025)
by: Rajan, Anirudh Sundara, et al.
Published: (2025)
Do Vision-Language Models Understand Visual Persuasiveness?
by: Park, Gyuwon
Published: (2025)
by: Park, Gyuwon
Published: (2025)
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
by: Zhu, Yuhan, et al.
Published: (2024)
by: Zhu, Yuhan, et al.
Published: (2024)
Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)
by: Lee, Jungbeom, et al.
Published: (2024)
Similar Items
-
Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
by: Cai, Mu, et al.
Published: (2023) -
Towards Universal Fake Image Detectors that Generalize Across Generative Models
by: Ojha, Utkarsh, et al.
Published: (2023) -
Aligned Datasets Improve Detection of Latent Diffusion-Generated Images
by: Rajan, Anirudh Sundara, et al.
Published: (2024) -
Yo'LLaVA: Your Personalized Language and Vision Assistant
by: Nguyen, Thao, et al.
Published: (2024) -
Edit One for All: Interactive Batch Image Editing
by: Nguyen, Thao, et al.
Published: (2024)