Saved in:
| Main Authors: | Fu, Shuai, Zhou, Jian, Chen, Qi, Jing, Huang, Nguyen, Huy Anh, Liu, Xiaohan, Zeng, Zhixiong, Ma, Lin, Zhang, Quanshi, Wu, Qi |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.13080 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base
by: Nguyen, Cong-Duy, et al.
Published: (2025)
by: Nguyen, Cong-Duy, et al.
Published: (2025)
Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner
by: Chen, Lei, et al.
Published: (2025)
by: Chen, Lei, et al.
Published: (2025)
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
by: Shoby, Abin, et al.
Published: (2026)
by: Shoby, Abin, et al.
Published: (2026)
SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
by: Duong, Huy, et al.
Published: (2026)
by: Duong, Huy, et al.
Published: (2026)
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild
by: Narasimhaswamy, Supreeth, et al.
Published: (2024)
by: Narasimhaswamy, Supreeth, et al.
Published: (2024)
DocTron-Formula: Generalized Formula Recognition in Complex and Structured Scenarios
by: Zhong, Yufeng, et al.
Published: (2025)
by: Zhong, Yufeng, et al.
Published: (2025)
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
by: Chen, Lei, et al.
Published: (2025)
by: Chen, Lei, et al.
Published: (2025)
Detecting Precise Hand Touch Moments in Egocentric Video
by: Nguyen, Huy Anh, et al.
Published: (2026)
by: Nguyen, Huy Anh, et al.
Published: (2026)
Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
by: Nguyen, Minh Khoi, et al.
Published: (2026)
by: Nguyen, Minh Khoi, et al.
Published: (2026)
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
by: Zhong, Yufeng, et al.
Published: (2026)
by: Zhong, Yufeng, et al.
Published: (2026)
Beyond the Global Scores: Fine-Grained Token Grounding as a Robust Detector of LVLM Hallucinations
by: Nguyen, Tuan Dung, et al.
Published: (2026)
by: Nguyen, Tuan Dung, et al.
Published: (2026)
Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
by: Nguyen, Viet, et al.
Published: (2024)
by: Nguyen, Viet, et al.
Published: (2024)
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
by: Le, Minh, et al.
Published: (2025)
by: Le, Minh, et al.
Published: (2025)
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge
by: Le, Huy, et al.
Published: (2023)
by: Le, Huy, et al.
Published: (2023)
One-Shot Crowd Counting With Density Guidance For Scene Adaptation
by: Chen, Jiwei, et al.
Published: (2026)
by: Chen, Jiwei, et al.
Published: (2026)
Streaming Video Diffusion: Online Video Editing with Diffusion Models
by: Chen, Feng, et al.
Published: (2024)
by: Chen, Feng, et al.
Published: (2024)
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models
by: Qi, Zhangyang, et al.
Published: (2025)
by: Qi, Zhangyang, et al.
Published: (2025)
Localizing Before Answering: A Hallucination Evaluation Benchmark for Grounded Medical Multimodal LLMs
by: Nguyen, Dung, et al.
Published: (2025)
by: Nguyen, Dung, et al.
Published: (2025)
FA-Seg: A Fast and Accurate Diffusion-Based Method for Open-Vocabulary Segmentation
by: Che, Huy, et al.
Published: (2025)
by: Che, Huy, et al.
Published: (2025)
SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation
by: Nguyen, Thuan Hoang, et al.
Published: (2023)
by: Nguyen, Thuan Hoang, et al.
Published: (2023)
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
by: Tran, Huu-Loc, et al.
Published: (2025)
by: Tran, Huu-Loc, et al.
Published: (2025)
GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations
by: Chen, Boyuan, et al.
Published: (2026)
by: Chen, Boyuan, et al.
Published: (2026)
VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning
by: Zhao, Xuanle, et al.
Published: (2025)
by: Zhao, Xuanle, et al.
Published: (2025)
UItron: Foundational GUI Agent with Advanced Perception and Planning
by: Zeng, Zhixiong, et al.
Published: (2025)
by: Zeng, Zhixiong, et al.
Published: (2025)
OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
by: Yang, Longrong, et al.
Published: (2025)
by: Yang, Longrong, et al.
Published: (2025)
FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing
by: Nguyen, Trong-Tung, et al.
Published: (2024)
by: Nguyen, Trong-Tung, et al.
Published: (2024)
VMambaCC: A Visual State Space Model for Crowd Counting
by: Ma, Hao-Yuan, et al.
Published: (2024)
by: Ma, Hao-Yuan, et al.
Published: (2024)
SynMVCrowd: A Large Synthetic Benchmark for Multi-view Crowd Counting and Localization
by: Zhang, Qi, et al.
Published: (2026)
by: Zhang, Qi, et al.
Published: (2026)
AbductiveMLLM: Boosting Visual Abductive Reasoning Within MLLMs
by: Chang, Boyu, et al.
Published: (2026)
by: Chang, Boyu, et al.
Published: (2026)
Bridging Classification and Segmentation in Osteosarcoma Assessment via Foundation and Discrete Diffusion Models
by: Nguyen, Manh Duong, et al.
Published: (2025)
by: Nguyen, Manh Duong, et al.
Published: (2025)
Improving Generalization in Visual Reasoning via Self-Ensemble
by: Nguyen, Tien-Huy, et al.
Published: (2024)
by: Nguyen, Tien-Huy, et al.
Published: (2024)
GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning
by: Nguyen, Huy Hoang, et al.
Published: (2024)
by: Nguyen, Huy Hoang, et al.
Published: (2024)
A Survey of Emerging Applications of Diffusion Probabilistic Models in MRI
by: Fan, Yuheng, et al.
Published: (2023)
by: Fan, Yuheng, et al.
Published: (2023)
Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence
by: Nguyen, Hung Huy, et al.
Published: (2025)
by: Nguyen, Hung Huy, et al.
Published: (2025)
STAGE: Stable and Generalizable GRPO for Autoregressive Image Generation
by: Ma, Xiaoxiao, et al.
Published: (2025)
by: Ma, Xiaoxiao, et al.
Published: (2025)
FocusDiT: Masking Queries in Diffusion Transformers for Fine-grained Image Generation
by: Fang, Xueji, et al.
Published: (2026)
by: Fang, Xueji, et al.
Published: (2026)
Defining and Extracting generalizable interaction primitives from DNNs
by: Chen, Lu, et al.
Published: (2024)
by: Chen, Lu, et al.
Published: (2024)
SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
by: Nguyen, Kien, et al.
Published: (2025)
by: Nguyen, Kien, et al.
Published: (2025)
Similar Items
-
CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base
by: Nguyen, Cong-Duy, et al.
Published: (2025) -
Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner
by: Chen, Lei, et al.
Published: (2025) -
Overthinking Causes Hallucination: Tracing Confounder Propagation in Vision Language Models
by: Shoby, Abin, et al.
Published: (2026) -
SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
by: Duong, Huy, et al.
Published: (2026) -
DeepSketcher: Internalizing Visual Manipulation for Multimodal Reasoning
by: Zhang, Chi, et al.
Published: (2025)