Saved in:
| Main Authors: | Pham, Thang M., Chen, Peijie, Nguyen, Tin, Yoon, Seunghyun, Bui, Trung, Nguyen, Anh Totti |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.05297 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leveraging Habitat Information for Fine-grained Bird Identification
by: Nguyen, Tin, et al.
Published: (2023)
by: Nguyen, Tin, et al.
Published: (2023)
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
by: Jing, Liqiang, et al.
Published: (2025)
by: Jing, Liqiang, et al.
Published: (2025)
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
by: Collins, Brandon, et al.
Published: (2026)
by: Collins, Brandon, et al.
Published: (2026)
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
by: Lee, Saehyung, et al.
Published: (2024)
by: Lee, Saehyung, et al.
Published: (2024)
MS4UI: A Dataset for Multi-modal Summarization of User Interface Instructional Videos
by: Zang, Yuan, et al.
Published: (2025)
by: Zang, Yuan, et al.
Published: (2025)
Understanding Generative AI Capabilities in Everyday Image Editing Tasks
by: Taesiri, Mohammad Reza, et al.
Published: (2025)
by: Taesiri, Mohammad Reza, et al.
Published: (2025)
LiteGPT: Large Vision-Language Model for Joint Chest X-ray Localization and Classification Task
by: Le-Duc, Khai, et al.
Published: (2024)
by: Le-Duc, Khai, et al.
Published: (2024)
HoT: Highlighted Chain of Thought for Referencing Supporting Facts from Inputs
by: Nguyen, Tin, et al.
Published: (2025)
by: Nguyen, Tin, et al.
Published: (2025)
PCNN: Probable-Class Nearest-Neighbor Explanations Improve Fine-Grained Image Classification Accuracy for AIs and Humans
by: Giang, et al.
Published: (2023)
by: Giang, et al.
Published: (2023)
Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts
by: Nguyen, Viet, et al.
Published: (2024)
by: Nguyen, Viet, et al.
Published: (2024)
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
by: Yoon, Yejun, et al.
Published: (2024)
by: Yoon, Yejun, et al.
Published: (2024)
Linguistically Informed Multimodal Fusion for Vietnamese Scene-Text Image Captioning: Dataset, Graph Framework, and Phonological Attention
by: Nguyen, Nhi Ngoc-Yen, et al.
Published: (2026)
by: Nguyen, Nhi Ngoc-Yen, et al.
Published: (2026)
FlexEdit: Flexible and Controllable Diffusion-based Object-centric Image Editing
by: Nguyen, Trong-Tung, et al.
Published: (2024)
by: Nguyen, Trong-Tung, et al.
Published: (2024)
MedSteer: Counterfactual Endoscopic Synthesis via Training-Free Activation Steering
by: Pham, Trong-Thang, et al.
Published: (2026)
by: Pham, Trong-Thang, et al.
Published: (2026)
SlimLM: An Efficient Small Language Model for On-Device Document Assistance
by: Pham, Thang M., et al.
Published: (2024)
by: Pham, Thang M., et al.
Published: (2024)
Vision Language Models are Biased
by: Vo, An, et al.
Published: (2025)
by: Vo, An, et al.
Published: (2025)
Improving Zero-Shot Object-Level Change Detection by Incorporating Visual Correspondence
by: Nguyen, Hung Huy, et al.
Published: (2025)
by: Nguyen, Hung Huy, et al.
Published: (2025)
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
by: Lee, Daeun, et al.
Published: (2025)
by: Lee, Daeun, et al.
Published: (2025)
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
by: Nguyen, Thong, et al.
Published: (2023)
by: Nguyen, Thong, et al.
Published: (2023)
CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base
by: Nguyen, Cong-Duy, et al.
Published: (2025)
by: Nguyen, Cong-Duy, et al.
Published: (2025)
READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
by: Nguyen, Thong, et al.
Published: (2023)
by: Nguyen, Thong, et al.
Published: (2023)
TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
by: Vo, Khang H. N., et al.
Published: (2025)
by: Vo, Khang H. N., et al.
Published: (2025)
PageGuide: Browser extension to assist users in navigating a webpage and locating information
by: Nguyen, Tin, et al.
Published: (2026)
by: Nguyen, Tin, et al.
Published: (2026)
Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media
by: Nguyen, Thi Huyen, et al.
Published: (2026)
by: Nguyen, Thi Huyen, et al.
Published: (2026)
VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation
by: Nguyen, Anh Tien, et al.
Published: (2025)
by: Nguyen, Anh Tien, et al.
Published: (2025)
CT-ScanGaze: A Dataset and Baselines for 3D Volumetric Scanpath Modeling
by: Pham, Trong-Thang, et al.
Published: (2025)
by: Pham, Trong-Thang, et al.
Published: (2025)
PADM: A Physics-aware Diffusion Model for Attenuation Correction
by: Pham, Trung Kien, et al.
Published: (2025)
by: Pham, Trung Kien, et al.
Published: (2025)
SuMa: A Subspace Mapping Approach for Robust and Effective Concept Erasure in Text-to-Image Diffusion Models
by: Nguyen, Kien, et al.
Published: (2025)
by: Nguyen, Kien, et al.
Published: (2025)
SwiftEdit: Lightning Fast Text-Guided Image Editing via One-Step Diffusion
by: Nguyen, Trong-Tung, et al.
Published: (2024)
by: Nguyen, Trong-Tung, et al.
Published: (2024)
Vision language models are blind: Failing to translate detailed visual features into words
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024)
A Hybrid Vision Transformer Approach for Mathematical Expression Recognition
by: Le, Anh Duy, et al.
Published: (2026)
by: Le, Anh Duy, et al.
Published: (2026)
Stable Messenger: Steganography for Message-Concealed Image Generation
by: Nguyen, Quang, et al.
Published: (2023)
by: Nguyen, Quang, et al.
Published: (2023)
MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning
by: Nguyen, Thong, et al.
Published: (2024)
by: Nguyen, Thong, et al.
Published: (2024)
GPC: Generative and General Pathology Image Classifier
by: Nguyen, Anh Tien, et al.
Published: (2024)
by: Nguyen, Anh Tien, et al.
Published: (2024)
Fourier-Attentive Representation Learning: A Fourier-Guided Framework for Few-Shot Generalization in Vision-Language Models
by: Pham, Hieu Dinh Trung, et al.
Published: (2025)
by: Pham, Hieu Dinh Trung, et al.
Published: (2025)
Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
by: Bui, Phuoc-Nguyen, et al.
Published: (2025)
by: Bui, Phuoc-Nguyen, et al.
Published: (2025)
Frequency Adapter with SAM for Generalized Medical Image Segmentation
by: Bui, Phuoc-Nguyen, et al.
Published: (2026)
by: Bui, Phuoc-Nguyen, et al.
Published: (2026)
VLind-Bench: Measuring Language Priors in Large Vision-Language Models
by: Lee, Kang-il, et al.
Published: (2024)
by: Lee, Kang-il, et al.
Published: (2024)
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
by: Nguyen, Nghia Hieu, et al.
Published: (2024)
by: Nguyen, Nghia Hieu, et al.
Published: (2024)
Similar Items
-
Leveraging Habitat Information for Fine-grained Bird Identification
by: Nguyen, Tin, et al.
Published: (2023) -
TAB: Transformer Attention Bottlenecks enable User Intervention and Debugging in Vision-Language Models
by: Rahmanzadehgervi, Pooyan, et al.
Published: (2024) -
FIFA: Unified Faithfulness Evaluation Framework for Text-to-Video and Video-to-Text Generation
by: Jing, Liqiang, et al.
Published: (2025) -
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
by: Collins, Brandon, et al.
Published: (2026) -
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
by: Lee, Saehyung, et al.
Published: (2024)