:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Nguyen, Van Quang
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.24020
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CSD-VAR: Content-Style Decomposition in Visual Autoregressive Models
by: Nguyen, Quang-Binh, et al.
Published: (2025)

MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation
by: Vu, Huu-An, et al.
Published: (2025)

KTVIC: A Vietnamese Image Captioning Dataset on the Life Domain
by: Pham, Anh-Cuong, et al.
Published: (2024)

FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System
by: Nguyen, Toan, et al.
Published: (2025)

MambaU-Lite: A Lightweight Model based on Mamba and Integrated Channel-Spatial Attention for Skin Lesion Segmentation
by: Nguyen, Thi-Nhu-Quynh, et al.
Published: (2024)

VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages
by: Atuhurra, Jesse, et al.
Published: (2025)

SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models
by: Nguyen, Hung, et al.
Published: (2024)

Learning Generative Interactive Environments By Trained Agent Exploration
by: Kazemi, Naser, et al.
Published: (2024)

V-Math: An Agentic Approach to the Vietnamese National High School Graduation Mathematics Exams
by: Nguyen, Duong Q., et al.
Published: (2025)

360° Image Perception with MLLMs: A Comprehensive Benchmark and a Training-Free Method
by: Tran, Huyen T. T., et al.
Published: (2026)

VRD-IU: Lessons from Visually Rich Document Intelligence and Understanding
by: Ding, Yihao, et al.
Published: (2025)

V2P-Bench: Evaluating Video-Language Understanding with Visual Prompts for Better Human-Model Interaction
by: Zhao, Yiming, et al.
Published: (2025)

Non-verbal Real-time Human-AI Interaction in Constrained Robotic Environments
by: Costea, Dragos, et al.
Published: (2026)

Training Deep Visual Networks Beyond Loss and Accuracy Through a Dynamical Systems Approach
by: La Quang, Hai, et al.
Published: (2026)

Robustness Evaluation of OCR-based Visual Document Understanding under Multi-Modal Adversarial Attacks
by: Tien, Dong Nguyen, et al.
Published: (2025)

Discovering Hidden Visual Concepts Beyond Linguistic Input in Infant Learning
by: Ke, Xueyi, et al.
Published: (2025)

Generation and Detection of Sign Language Deepfakes - A Linguistic and Visual Analysis
by: Naeem, Shahzeb, et al.
Published: (2024)

UniHOI: Unified Human-Object Interaction Understanding via Unified Token Space
by: Yang, Panqi, et al.
Published: (2025)

A Multimodal Framework for Aligning Human Linguistic Descriptions with Visual Perceptual Data
by: Bingham, Joseph
Published: (2026)

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
by: Blume, Ansel, et al.
Published: (2025)

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech
by: Liu, Rui, et al.
Published: (2024)

Reinforced Embodied Active Defense: Exploiting Adaptive Interaction for Robust Visual Perception in Adversarial 3D Environments
by: Yang, Xiao, et al.
Published: (2025)

SUGAR: A Sweeter Spot for Generative Unlearning of Many Identities
by: Nguyen, Dung Thuy, et al.
Published: (2025)

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification
by: Quang, Ngoc Bui Lam, et al.
Published: (2025)

Lang2Act: Fine-Grained Visual Reasoning through Self-Emergent Linguistic Toolchains
by: Xiong, Yuqi, et al.
Published: (2026)

VDC: Versatile Data Cleanser based on Visual-Linguistic Inconsistency by Multimodal Large Language Models
by: Zhu, Zihao, et al.
Published: (2023)

Towards Understanding Visual Grounding in Visual Language Models
by: Pantazopoulos, Georgios, et al.
Published: (2025)

PerspectiveNet: Multi-View Perception for Dynamic Scene Understanding
by: Nguyen, Vinh
Published: (2024)

Semi-Supervised Semantic Segmentation using Redesigned Self-Training for White Blood Cells
by: Luu, Vinh Quoc, et al.
Published: (2024)

LangXAI: Integrating Large Vision Models for Generating Textual Explanations to Enhance Explainability in Visual Perception Tasks
by: Nguyen, Truong Thanh Hung, et al.
Published: (2024)

Solving Scene Understanding for Autonomous Navigation in Unstructured Environments
by: Renji, Naveen Mathews, et al.
Published: (2025)

Contrastive Integrated Gradients: A Feature Attribution-Based Method for Explaining Whole Slide Image Classification
by: Vu, Anh Mai, et al.
Published: (2025)

Human-Object Interaction from Human-Level Instructions
by: Wu, Zhen, et al.
Published: (2024)

Aligning Machine and Human Visual Representations across Abstraction Levels
by: Muttenthaler, Lukas, et al.
Published: (2024)

Vision-Aware Text Features in Referring Image Segmentation: From Object Understanding to Context Understanding
by: Nguyen-Truong, Hai, et al.
Published: (2024)

A Survey of Video Datasets for Grounded Event Understanding
by: Sanders, Kate, et al.
Published: (2024)

HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
by: Dong, Yifei, et al.
Published: (2025)

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)

VITAL: Interactive Few-Shot Imitation Learning via Visual Human-in-the-Loop Corrections
by: Kasaei, Hamidreza, et al.
Published: (2024)