:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yinan, Chen, Yuxin, Lin, Haokun, Wu, Yichen, Yang, Shuyu, Qi, Zhongang, Ma, Chen, Zhu, Li, Shan, Ying
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2411.17125
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
by: Liu, Ye, et al.
Published: (2025)

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
by: Liu, Ye, et al.
Published: (2024)

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM
by: Yang, Tao, et al.
Published: (2024)

Scale Up Composed Image Retrieval Learning via Modification Text Generation
by: Zhou, Yinan, et al.
Published: (2025)

Taming Rectified Flow for Inversion and Editing
by: Wang, Jiangshan, et al.
Published: (2024)

Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion
by: Yu, Songsong, et al.
Published: (2025)

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
by: Chen, Yuxin, et al.
Published: (2024)

OmniScript: Towards Audio-Visual Script Generation for Long-Form Cinematic Video
by: Pu, Junfu, et al.
Published: (2026)

DopQ-ViT: Towards Distribution-Friendly and Outlier-Aware Post-Training Quantization for Vision Transformers
by: Yang, Lianwei, et al.
Published: (2024)

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
by: Zheng, Guangcong, et al.
Published: (2023)

EA-VTR: Event-Aware Video-Text Retrieval
by: Ma, Zongyang, et al.
Published: (2024)

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses
by: Tan, Chaolei, et al.
Published: (2024)

FashionLens: Toward Versatile Fashion Image Retrieval via Task-Adaptive Learning
by: Wen, Haokun, et al.
Published: (2026)

Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
by: Yang, Yuxin, et al.
Published: (2026)

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities
by: Wu, Tao, et al.
Published: (2024)

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model
by: Wu, Tao, et al.
Published: (2024)

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation
by: Li, Xuewei, et al.
Published: (2023)

VividMed: Vision Language Model with Versatile Visual Grounding for Medicine
by: Luo, Lingxiao, et al.
Published: (2024)

StyleAdapter: A Unified Stylized Image Generation Model
by: Wang, Zhouxia, et al.
Published: (2023)

Learning to Reason in 4D: Dynamic Spatial Understanding for Vision Language Models
by: Zhou, Shengchao, et al.
Published: (2025)

Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
by: Zhu, Ziyu, et al.
Published: (2025)

BINO: Encoder Centric Self Supervised Stereo With Native Pair Input
by: Zhou, Haokun
Published: (2026)

ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
by: Liang, Tianming, et al.
Published: (2025)

Joint Reference Frame Synthesis and Post Filter Enhancement for Versatile Video Coding
by: Bao, Weijie, et al.
Published: (2024)

AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing
by: Yang, Fan, et al.
Published: (2023)

VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models
by: Wu, Tao, et al.
Published: (2024)

VLM-Assisted Continual learning for Visual Question Answering in Self-Driving
by: Lin, Yuxin, et al.
Published: (2025)

LRQ-DiT: Log-Rotation Post-Training Quantization of Diffusion Transformers for Image and Video Generation
by: Yang, Lianwei, et al.
Published: (2025)

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models
by: Huang, Ziqi, et al.
Published: (2024)

Object-centric Video Question Answering with Visual Grounding and Referring
by: Wang, Haochen, et al.
Published: (2025)

Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning
by: Du, Jia-Run, et al.
Published: (2022)

Grounded 3D-LLM with Referent Tokens
by: Chen, Yilun, et al.
Published: (2024)

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding
by: Zheng, Minghang, et al.
Published: (2024)

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
by: Lin, Haokun, et al.
Published: (2025)

Multimodal Reference Visual Grounding
by: Lu, Yangxiao, et al.
Published: (2025)

Singular Value Fine-tuning for Few-Shot Class-Incremental Learning
by: Wang, Zhiwu, et al.
Published: (2025)

CRAFT: A Neuro-Symbolic Framework for Visual Functional Affordance Grounding
by: Chen, Zhou, et al.
Published: (2025)

Visual Grounding with Multi-modal Conditional Adaptation
by: Yao, Ruilin, et al.
Published: (2024)

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs
by: Zhang, Yongchang, et al.
Published: (2026)

Towards Omnimodal Expressions and Reasoning in Referring Audio-Visual Segmentation
by: Ying, Kaining, et al.
Published: (2025)