Saved in:
| Main Authors: | Rahmanzadehgervi, Pooyan, Nguyen, Hung Huy, Liu, Rosanne, Mai, Long, Nguyen, Anh Totti |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.18675 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence
by: Nguyen, Hung Huy, et al.
Published: (2025)
by: Nguyen, Hung Huy, et al.
Published: (2025)
Vision language models are blind: Failing to translate detailed visual features into words
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
by: Collins, Brandon, et al.
Published: (2026)
by: Collins, Brandon, et al.
Published: (2026)
Vision Language Models are Biased
by: Vo, An, et al.
Published: (2025)
by: Vo, An, et al.
Published: (2025)
PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck
by: Pham, Thang M., et al.
Published: (2024)
by: Pham, Thang M., et al.
Published: (2024)
Leveraging Habitat Information for Fine-grained Bird Identification
by: Nguyen, Tin, et al.
Published: (2023)
by: Nguyen, Tin, et al.
Published: (2023)
Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
by: Nguyen, Minh Khoi, et al.
Published: (2026)
by: Nguyen, Minh Khoi, et al.
Published: (2026)
LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task
by: Le-Duc, Khai, et al.
Published: (2024)
by: Le-Duc, Khai, et al.
Published: (2024)
PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans
by: Giang, et al.
Published: (2023)
by: Giang, et al.
Published: (2023)
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation
by: Nguyen, Huu Tien, et al.
Published: (2025)
by: Nguyen, Huu Tien, et al.
Published: (2025)
WAVER: Writing-style Agnostic Text-Video Retrieval via Distilling Vision-Language Models Through Open-Vocabulary Knowledge
by: Le, Huy, et al.
Published: (2023)
by: Le, Huy, et al.
Published: (2023)
Exploring Self-Supervised Vision Transformers for Deepfake Detection: A Comparative Analysis
by: Nguyen, Huy H., et al.
Published: (2024)
by: Nguyen, Huy H., et al.
Published: (2024)
STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models
by: Pham, Hieu Dinh Trung, et al.
Published: (2025)
by: Pham, Hieu Dinh Trung, et al.
Published: (2025)
ITSELF: Attention Guided Fine-Grained Alignment for Vision-Language Retrieval
by: Nguyen, Tien-Huy, et al.
Published: (2026)
by: Nguyen, Tien-Huy, et al.
Published: (2026)
Revisit Visual Prompt Tuning: The Expressiveness of Prompt Experts
by: Le, Minh, et al.
Published: (2025)
by: Le, Minh, et al.
Published: (2025)
GraspMamba: A Mamba-based Language-driven Grasp Detection Framework with Hierarchical Feature Learning
by: Nguyen, Huy Hoang, et al.
Published: (2024)
by: Nguyen, Huy Hoang, et al.
Published: (2024)
Detecting Precise Hand Touch Moments in Egocentric Video
by: Nguyen, Huy Anh, et al.
Published: (2026)
by: Nguyen, Huy Anh, et al.
Published: (2026)
Beyond Traditional Approaches: Multi-Task Network for Breast Ultrasound Diagnosis
by: Chung, Dat T., et al.
Published: (2024)
by: Chung, Dat T., et al.
Published: (2024)
A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
by: Le, Anh Duy, et al.
Published: (2026)
by: Le, Anh Duy, et al.
Published: (2026)
Emotic Masked Autoencoder with Attention Fusion for Facial Expression Recognition
by: Nguyen-Xuan, Bach, et al.
Published: (2024)
by: Nguyen-Xuan, Bach, et al.
Published: (2024)
Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
by: Bui, Phuoc-Nguyen, et al.
Published: (2025)
by: Bui, Phuoc-Nguyen, et al.
Published: (2025)
Blurry-Consistency Segmentation Framework with Selective Stacking on Differential Interference Contrast 3D Breast Cancer Spheroid
by: Nguyen, Thanh-Huy, et al.
Published: (2024)
by: Nguyen, Thanh-Huy, et al.
Published: (2024)
Debugging Concept Bottleneck Models through Removal and Retraining
by: Enouen, Eric, et al.
Published: (2025)
by: Enouen, Eric, et al.
Published: (2025)
VisionGuard: Synergistic Framework for Helmet Violation Detection
by: Nguyen, Lam-Huy, et al.
Published: (2025)
by: Nguyen, Lam-Huy, et al.
Published: (2025)
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking
by: Tran, Huu-Loc, et al.
Published: (2025)
by: Tran, Huu-Loc, et al.
Published: (2025)
A smart fridge with AI-enabled food computing
by: Thuc, Khue Nong, et al.
Published: (2025)
by: Thuc, Khue Nong, et al.
Published: (2025)
Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
by: Nguyen, Thong, et al.
Published: (2025)
by: Nguyen, Thong, et al.
Published: (2025)
Deep-Wide Learning Assistance for Insect Pest Classification
by: Nguyen, Toan, et al.
Published: (2024)
by: Nguyen, Toan, et al.
Published: (2024)
Leveraging Chat-Based Large Vision Language Models for Multimodal Out-Of-Context Detection
by: Shalabi, Fatma, et al.
Published: (2024)
by: Shalabi, Fatma, et al.
Published: (2024)
Semi-supervised 3D Semantic Scene Completion with 2D Vision Foundation Model Guidance
by: Pham, Duc-Hai, et al.
Published: (2024)
by: Pham, Duc-Hai, et al.
Published: (2024)
HOIST-Former: Hand-held Objects Identification, Segmentation, and Tracking in the Wild
by: Narasimhaswamy, Supreeth, et al.
Published: (2024)
by: Narasimhaswamy, Supreeth, et al.
Published: (2024)
VideoSAVi: Self-Aligned Video Language Models without Human Supervision
by: Kulkarni, Yogesh, et al.
Published: (2024)
by: Kulkarni, Yogesh, et al.
Published: (2024)
CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling
by: Pham, Trong-Thang, et al.
Published: (2025)
by: Pham, Trong-Thang, et al.
Published: (2025)
RT-VLM: Re-Thinking Vision Language Model with 4-Clues for Real-World Object Recognition Robustness
by: Park, Junghyun, et al.
Published: (2025)
by: Park, Junghyun, et al.
Published: (2025)
A Lightweight Moment Retrieval System with Global Re-Ranking and Robust Adaptive Bidirectional Temporal Search
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
by: Nguyen-Nhu, Tinh-Anh, et al.
Published: (2025)
From Coarse to Fine: Learnable Discrete Wavelet Transforms for Efficient 3D Gaussian Splatting
by: Nguyen, Hung, et al.
Published: (2025)
by: Nguyen, Hung, et al.
Published: (2025)
DWTNeRF: Boosting Few-shot Neural Radiance Fields via Discrete Wavelet Transform
by: Nguyen, Hung, et al.
Published: (2025)
by: Nguyen, Hung, et al.
Published: (2025)
SwiftPie: Lightning-fast Subject-driven Image Personalization via One step Diffusion
by: Duong, Huy, et al.
Published: (2026)
by: Duong, Huy, et al.
Published: (2026)
Cycle Training with Semi-Supervised Domain Adaptation: Bridging Accuracy and Efficiency for Real-Time Mobile Scene Detection
by: Phan-Nguyen, Huu-Phong, et al.
Published: (2025)
by: Phan-Nguyen, Huu-Phong, et al.
Published: (2025)
Similar Items
-
Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence
by: Nguyen, Hung Huy, et al.
Published: (2025) -
Vision language models are blind: Failing to translate detailed visual features into words
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024) -
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
by: Collins, Brandon, et al.
Published: (2026) -
Vision Language Models are Biased
by: Vo, An, et al.
Published: (2025) -
PEEB: Part-based Image Classifiers with an Explainable and Editable Language Bottleneck
by: Pham, Thang M., et al.
Published: (2024)