Saved in:
| Main Authors: | Gu, Chao, Lin, Ke, Luo, Yiyang, Hou, Jiahui, Li, Xiang-Yang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.00909 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ViPO: Visual Preference Optimization at Scale
by: Li, Ming, et al.
Published: (2026)
by: Li, Ming, et al.
Published: (2026)
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
by: Luo, Rundong, et al.
Published: (2025)
by: Luo, Rundong, et al.
Published: (2025)
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026)
by: Han, Haonan, et al.
Published: (2026)
Natural Language Supervision for Low-light Image Enhancement
by: Tang, Jiahui, et al.
Published: (2025)
by: Tang, Jiahui, et al.
Published: (2025)
ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
by: Yang, Ziteng, et al.
Published: (2025)
by: Yang, Ziteng, et al.
Published: (2025)
VS-LLM: Visual-Semantic Depression Assessment based on LLM for Drawing Projection Test
by: Wu, Meiqi, et al.
Published: (2025)
by: Wu, Meiqi, et al.
Published: (2025)
Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction
by: Khan, Muhammad Tayyab, et al.
Published: (2024)
by: Khan, Muhammad Tayyab, et al.
Published: (2024)
Navigating Efficiency in MobileViT through Gaussian Process on Global Architecture Factors
by: Meng, Ke, et al.
Published: (2024)
by: Meng, Ke, et al.
Published: (2024)
CLiViS: Unleashing Cognitive Map through Linguistic-Visual Synergy for Embodied Visual Reasoning
by: Li, Kailing, et al.
Published: (2025)
by: Li, Kailing, et al.
Published: (2025)
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
by: Wu, Linquan, et al.
Published: (2026)
by: Wu, Linquan, et al.
Published: (2026)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
by: Wu, Junfei, et al.
Published: (2025)
by: Wu, Junfei, et al.
Published: (2025)
Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory
by: Li, Quanjiang, et al.
Published: (2026)
by: Li, Quanjiang, et al.
Published: (2026)
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
by: Xiao, Hongcan, et al.
Published: (2026)
by: Xiao, Hongcan, et al.
Published: (2026)
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)
by: Zhang, Juntian, et al.
Published: (2025)
Predictive Reasoning with Augmented Anomaly Contrastive Learning for Compositional Visual Relations
by: Li, Chengtai, et al.
Published: (2026)
by: Li, Chengtai, et al.
Published: (2026)
SFMViT: SlowFast Meet ViT in Chaotic World
by: Lin, Jiaying, et al.
Published: (2024)
by: Lin, Jiaying, et al.
Published: (2024)
ViSA-Enhanced Aerial VLN: A Visual-Spatial Reasoning Enhanced Framework for Aerial Vision-Language Navigation
by: Tong, Haoyu, et al.
Published: (2026)
by: Tong, Haoyu, et al.
Published: (2026)
Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition
by: Zuo, Lin, et al.
Published: (2024)
by: Zuo, Lin, et al.
Published: (2024)
From Drawings to Decisions: A Hybrid Vision-Language Framework for Parsing 2D Engineering Drawings into Structured Manufacturing Knowledge
by: Khan, Muhammad Tayyab, et al.
Published: (2025)
by: Khan, Muhammad Tayyab, et al.
Published: (2025)
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
by: Liao, Bencheng, et al.
Published: (2024)
by: Liao, Bencheng, et al.
Published: (2024)
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
by: Yan, Siming, et al.
Published: (2024)
by: Yan, Siming, et al.
Published: (2024)
Automated Parsing of Engineering Drawings for Structured Information Extraction Using a Fine-tuned Document Understanding Transformer
by: Khan, Muhammad Tayyab, et al.
Published: (2025)
by: Khan, Muhammad Tayyab, et al.
Published: (2025)
Text-Enhanced Panoptic Symbol Spotting in CAD Drawings
by: Liu, Xianlin, et al.
Published: (2025)
by: Liu, Xianlin, et al.
Published: (2025)
Context-Aware Indoor Point Cloud Object Generation through User Instructions
by: Luo, Yiyang, et al.
Published: (2023)
by: Luo, Yiyang, et al.
Published: (2023)
Neural Proteomics Fields for Super-resolved Spatial Proteomics Prediction
by: Zhao, Bokai, et al.
Published: (2025)
by: Zhao, Bokai, et al.
Published: (2025)
Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
by: Luo, Yongdong, et al.
Published: (2024)
by: Luo, Yongdong, et al.
Published: (2024)
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
by: Qin, Luozheng, et al.
Published: (2026)
by: Qin, Luozheng, et al.
Published: (2026)
Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning
by: Xu, Shihao, et al.
Published: (2024)
by: Xu, Shihao, et al.
Published: (2024)
GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
by: Wu, Fengyi, et al.
Published: (2025)
by: Wu, Fengyi, et al.
Published: (2025)
ViTA-PAR: Visual and Textual Attribute Alignment with Attribute Prompting for Pedestrian Attribute Recognition
by: Park, Minjeong, et al.
Published: (2025)
by: Park, Minjeong, et al.
Published: (2025)
ViFP: A Framework for Visual False Positive Detection to Enhance Reasoning Reliability in VLMs
by: Zhang, Ben, et al.
Published: (2025)
by: Zhang, Ben, et al.
Published: (2025)
LaViC: Adapting Large Vision-Language Models to Visually-Aware Conversational Recommendation
by: Jeon, Hyunsik, et al.
Published: (2025)
by: Jeon, Hyunsik, et al.
Published: (2025)
Frequency-Dynamic Attention Modulation for Dense Prediction
by: Chen, Linwei, et al.
Published: (2025)
by: Chen, Linwei, et al.
Published: (2025)
ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters
by: Hansen-Estruch, Philippe, et al.
Published: (2026)
by: Hansen-Estruch, Philippe, et al.
Published: (2026)
Frequency Dynamic Convolution for Dense Image Prediction
by: Chen, Linwei, et al.
Published: (2025)
by: Chen, Linwei, et al.
Published: (2025)
Visual Position Prompt for MLLM based Visual Grounding
by: Tang, Wei, et al.
Published: (2025)
by: Tang, Wei, et al.
Published: (2025)
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
by: Tang, Zicong, et al.
Published: (2025)
by: Tang, Zicong, et al.
Published: (2025)
MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition
by: Kim, Ye-eun, et al.
Published: (2025)
by: Kim, Ye-eun, et al.
Published: (2025)
Mantis: A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
by: Yang, Yi, et al.
Published: (2025)
by: Yang, Yi, et al.
Published: (2025)
VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs
by: Wang, Xiyao, et al.
Published: (2026)
by: Wang, Xiyao, et al.
Published: (2026)
Similar Items
-
ViPO: Visual Preference Optimization at Scale
by: Li, Ming, et al.
Published: (2026) -
ShadowDraw: From Any Object to Shadow-Drawing Compositional Art
by: Luo, Rundong, et al.
Published: (2025) -
ViGoR-Bench: How Far Are Visual Generative Models From Zero-Shot Visual Reasoners?
by: Han, Haonan, et al.
Published: (2026) -
Natural Language Supervision for Low-light Image Enhancement
by: Tang, Jiahui, et al.
Published: (2025) -
ViP$^2$-CLIP: Visual-Perception Prompting with Unified Alignment for Zero-Shot Anomaly Detection
by: Yang, Ziteng, et al.
Published: (2025)