:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gu, Chao, Lin, Ke, Luo, Yiyang, Hou, Jiahui, Li, Xiang-Yang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2409.00909
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ViPO: Visual Preference Optimization at Scale
by: Li, Ming, et al.
Published: (2026)

ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
by: Luo, Rundong, et al.
Published: (2025)

ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)

Natural Language Supervision for Low-light Image Enhancement
by: Tang, Jiahui, et al.
Published: (2025)

ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
by: Yang, Ziteng, et al.
Published: (2025)

VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
by: Wu, Meiqi, et al.
Published: (2025)

Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction
by: Khan, Muhammad Tayyab, et al.
Published: (2024)

Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors
by: Meng, Ke, et al.
Published: (2024)

CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)

LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
by: Wu, Linquan, et al.
Published: (2026)

Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
by: Wu, Junfei, et al.
Published: (2025)

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory
by: Li, Quanjiang, et al.
Published: (2026)

3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
by: Xiao, Hongcan, et al.
Published: (2026)

ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)

Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
by: Li, Chengtai, et al.
Published: (2026)

SFMViT: SlowFast Meet ViT in Chaotic World
by: Lin, Jiaying, et al.
Published: (2024)

ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
by: Tong, Haoyu, et al.
Published: (2026)

Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition
by: Zuo, Lin, et al.
Published: (2024)

From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge
by: Khan, Muhammad Tayyab, et al.
Published: (2025)

ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
by: Liao, Bencheng, et al.
Published: (2024)

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
by: Yan, Siming, et al.
Published: (2024)

Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer
by: Khan, Muhammad Tayyab, et al.
Published: (2025)

Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
by: Liu, Xianlin, et al.
Published: (2025)

Context-Aware Indoor Point Cloud Object Generation through User Instructions
by: Luo, Yiyang, et al.
Published: (2023)

Neural Proteomics Fields for Super-resolved Spatial Proteomics Prediction
by: Zhao, Bokai, et al.
Published: (2025)

Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
by: Luo, Yongdong, et al.
Published: (2024)

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
by: Qin, Luozheng, et al.
Published: (2026)

Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning
by: Xu, Shihao, et al.
Published: (2024)

GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
by: Wu, Fengyi, et al.
Published: (2025)

ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition
by: Park, Minjeong, et al.
Published: (2025)

ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs
by: Zhang, Ben, et al.
Published: (2025)

LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation
by: Jeon, Hyunsik, et al.
Published: (2025)

Frequency-Dynamic Attention Modulation for Dense Prediction
by: Chen, Linwei, et al.
Published: (2025)

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
by: Hansen-Estruch, Philippe, et al.
Published: (2026)

Frequency Dynamic Convolution for Dense Image Prediction
by: Chen, Linwei, et al.
Published: (2025)

Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)

CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
by: Tang, Zicong, et al.
Published: (2025)

MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition
by: Kim, Ye-eun, et al.
Published: (2025)

Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
by: Yang, Yi, et al.
Published: (2025)

VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
by: Wang, Xiyao, et al.
Published: (2026)