Saved in:
| Main Authors: | Zhou, Yinan, Chen, Yuxin, Lin, Haokun, Wu, Yichen, Yang, Shuyu, Qi, Zhongang, Ma, Chen, Zhu, Li, Shan, Ying |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.17125 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025)
by: Liu, Ye, et al.
Published: (2025)
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
by: Liu, Ye, et al.
Published: (2024)
by: Liu, Ye, et al.
Published: (2024)
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024)
by: Yang, Tao, et al.
Published: (2024)
Scale Up Composed Image Retrieval Learning via Modification Text Generation
by: Zhou, Yinan, et al.
Published: (2025)
by: Zhou, Yinan, et al.
Published: (2025)
Taming Rectified Flow for Inversion and Editing
by: Wang, Jiangshan, et al.
Published: (2024)
by: Wang, Jiangshan, et al.
Published: (2024)
Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
by: Yu, Songsong, et al.
Published: (2025)
by: Yu, Songsong, et al.
Published: (2025)
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
by: Chen, Yuxin, et al.
Published: (2024)
by: Chen, Yuxin, et al.
Published: (2024)
OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)
by: Pu, Junfu, et al.
Published: (2026)
DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers
by: Yang, Lianwei, et al.
Published: (2024)
by: Yang, Lianwei, et al.
Published: (2024)
LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
by: Zheng, Guangcong, et al.
Published: (2023)
by: Zheng, Guangcong, et al.
Published: (2023)
EA-VTR: Event-Aware Video-Text Retrieval
by: Ma, Zongyang, et al.
Published: (2024)
by: Ma, Zongyang, et al.
Published: (2024)
SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
by: Tan, Chaolei, et al.
Published: (2024)
by: Tan, Chaolei, et al.
Published: (2024)
FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
by: Wen, Haokun, et al.
Published: (2026)
by: Wen, Haokun, et al.
Published: (2026)
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
by: Yang, Yuxin, et al.
Published: (2026)
by: Yang, Yuxin, et al.
Published: (2026)
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
by: Li, Xuewei, et al.
Published: (2023)
by: Li, Xuewei, et al.
Published: (2023)
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)
by: Luo, Lingxiao, et al.
Published: (2024)
StyleAdapter: A Unified Stylized Image Generation Model
by: Wang, Zhouxia, et al.
Published: (2023)
by: Wang, Zhouxia, et al.
Published: (2023)
Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)
by: Zhou, Shengchao, et al.
Published: (2025)
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)
by: Zhu, Ziyu, et al.
Published: (2025)
BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
by: Zhou, Haokun
Published: (2026)
by: Zhou, Haokun
Published: (2026)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
by: Liang, Tianming, et al.
Published: (2025)
by: Liang, Tianming, et al.
Published: (2025)
Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding
by: Bao, Weijie, et al.
Published: (2024)
by: Bao, Weijie, et al.
Published: (2024)
AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing
by: Yang, Fan, et al.
Published: (2023)
by: Yang, Fan, et al.
Published: (2023)
VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)
by: Wu, Tao, et al.
Published: (2024)
VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
by: Lin, Yuxin, et al.
Published: (2025)
by: Lin, Yuxin, et al.
Published: (2025)
LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Image and Video Generation
by: Yang, Lianwei, et al.
Published: (2025)
by: Yang, Lianwei, et al.
Published: (2025)
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
by: Huang, Ziqi, et al.
Published: (2024)
by: Huang, Ziqi, et al.
Published: (2024)
Object-centric Video Question Answering with Visual Grounding and Referring
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning
by: Du, Jia-Run, et al.
Published: (2022)
by: Du, Jia-Run, et al.
Published: (2022)
Grounded 3D-LLM with Referent Tokens
by: Chen, Yilun, et al.
Published: (2024)
by: Chen, Yilun, et al.
Published: (2024)
ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding
by: Zheng, Minghang, et al.
Published: (2024)
by: Zheng, Minghang, et al.
Published: (2024)
TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)
by: Lin, Haokun, et al.
Published: (2025)
Multimodal Reference Visual Grounding
by: Lu, Yangxiao, et al.
Published: (2025)
by: Lu, Yangxiao, et al.
Published: (2025)
Singular Value Fine-tuning for Few-Shot Class-Incremental Learning
by: Wang, Zhiwu, et al.
Published: (2025)
by: Wang, Zhiwu, et al.
Published: (2025)
CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding
by: Chen, Zhou, et al.
Published: (2025)
by: Chen, Zhou, et al.
Published: (2025)
Visual Grounding with Multi-modal Conditional Adaptation
by: Yao, Ruilin, et al.
Published: (2024)
by: Yao, Ruilin, et al.
Published: (2024)
See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
by: Zhang, Yongchang, et al.
Published: (2026)
by: Zhang, Yongchang, et al.
Published: (2026)
Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)
by: Ying, Kaining, et al.
Published: (2025)
Similar Items
-
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025) -
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
by: Liu, Ye, et al.
Published: (2024) -
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024) -
Scale Up Composed Image Retrieval Learning via Modification Text Generation
by: Zhou, Yinan, et al.
Published: (2025) -
Taming Rectified Flow for Inversion and Editing
by: Wang, Jiangshan, et al.
Published: (2024)