:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fu, Shuai, Zhou, Jian, Chen, Qi, Jing, Huang, Nguyen, Huy Anh, Liu, Xiaohan, Zeng, Zhixiong, Ma, Lin, Zhang, Quanshi, Wu, Qi
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.13080
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base
by: Nguyen, Cong-Duy, et al.
Published: (2025)

Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner
by: Chen, Lei, et al.
Published: (2025)

Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
by: Shoby, Abin, et al.
Published: (2026)

SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
by: Duong, Huy, et al.
Published: (2026)

DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
by: Zhang, Chi, et al.
Published: (2025)

HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild
by: Narasimhaswamy, Supreeth, et al.
Published: (2024)

DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios
by: Zhong, Yufeng, et al.
Published: (2025)

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
by: Chen, Lei, et al.
Published: (2025)

Detecting Precise Hand Touch Moments in Egocentric Video
by: Nguyen, Huy Anh, et al.
Published: (2026)

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
by: Nguyen, Minh Khoi, et al.
Published: (2026)

OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
by: Zhong, Yufeng, et al.
Published: (2026)

Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
by: Nguyen, Tuan Dung, et al.
Published: (2026)

Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
by: Nguyen, Viet, et al.
Published: (2024)

TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)

Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
by: Le, Minh, et al.
Published: (2025)

WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge
by: Le, Huy, et al.
Published: (2023)

One-Shot Crowd Counting With Density Guidance For Scene Adaptation
by: Chen, Jiwei, et al.
Published: (2026)

Streaming Video Diffusion: Online Video Editing with Diffusion Models
by: Chen, Feng, et al.
Published: (2024)

GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
by: Qi, Zhangyang, et al.
Published: (2025)

Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs
by: Nguyen, Dung, et al.
Published: (2025)

FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
by: Che, Huy, et al.
Published: (2025)

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
by: Nguyen, Thuan Hoang, et al.
Published: (2023)

Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
by: Tran, Huu-Loc, et al.
Published: (2025)

GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
by: Chen, Boyuan, et al.
Published: (2026)

VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning
by: Zhao, Xuanle, et al.
Published: (2025)

UItron: Foundational GUI Agent with Advanced Perception and Planning
by: Zeng, Zhixiong, et al.
Published: (2025)

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
by: Yang, Longrong, et al.
Published: (2025)

FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing
by: Nguyen, Trong-Tung, et al.
Published: (2024)

VMambaCC: A Visual State Space Model for Crowd Counting
by: Ma, Hao-Yuan, et al.
Published: (2024)

SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization
by: Zhang, Qi, et al.
Published: (2026)

AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
by: Chang, Boyu, et al.
Published: (2026)

Bridging Classification and Segmentation in Osteosarcoma Assessment via Foundation and Discrete Diffusion Models
by: Nguyen, Manh Duong, et al.
Published: (2025)

Improving Generalization in Visual Reasoning via Self-Ensemble
by: Nguyen, Tien-Huy, et al.
Published: (2024)

GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning
by: Nguyen, Huy Hoang, et al.
Published: (2024)

A Survey of Emerging Applications of Diffusion Probabilistic Models in MRI
by: Fan, Yuheng, et al.
Published: (2023)

Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence
by: Nguyen, Hung Huy, et al.
Published: (2025)

STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
by: Ma, Xiaoxiao, et al.
Published: (2025)

FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation
by: Fang, Xueji, et al.
Published: (2026)

Defining and Extracting generalizable interaction primitives from DNNs
by: Chen, Lu, et al.
Published: (2024)

SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
by: Nguyen, Kien, et al.
Published: (2025)