:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhu, Muzhi, Tian, Yuzhuo, Chen, Hao, Zhou, Chunluan, Guo, Qingpei, Liu, Yang, Yang, Ming, Shen, Chunhua
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.08625
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
by: Dong, Xingning, et al.
Published: (2024)

VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025)

MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
by: Dai, Ming, et al.
Published: (2025)

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
by: Liu, Yang, et al.
Published: (2023)

DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data
by: Fan, Chengxiang, et al.
Published: (2024)

Generative Active Learning for Long-tailed Instance Segmentation
by: Zhu, Muzhi, et al.
Published: (2024)

A Simple Image Segmentation Framework via In-Context Examples
by: Liu, Yang, et al.
Published: (2024)

From Text to Pixel: Advancing Long-Context Understanding in MLLMs
by: Lu, Yujie, et al.
Published: (2024)

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
by: Wu, Weijia, et al.
Published: (2023)

Exploring Spatial Intelligence from a Generative Perspective
by: Zhu, Muzhi, et al.
Published: (2026)

Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation
by: Zhu, Muzhi, et al.
Published: (2024)

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
by: Han, Yudong, et al.
Published: (2024)

Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
by: Xuan, Shiyu, et al.
Published: (2023)

Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
by: Zhu, Muzhi, et al.
Published: (2025)

Referencing Where to Focus: Improving VisualGrounding with Referential Query
by: Wang, Yabing, et al.
Published: (2024)

From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval
by: Wang, Yabing, et al.
Published: (2025)

SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
by: Yu, Xuzheng, et al.
Published: (2024)

Unified Open-World Segmentation with Multi-Modal Prompts
by: Liu, Yang, et al.
Published: (2025)

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
by: Jia, Yiduo, et al.
Published: (2026)

TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)

X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
by: Sun, Peiwen, et al.
Published: (2026)

HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
by: Zhou, Ting, et al.
Published: (2024)

Social Debiasing for Fair Multi-modal LLMs
by: Cheng, Harry, et al.
Published: (2024)

FlattenGPT: Depth Compression for Transformer with Layer Flattening
by: Xu, Ruihan, et al.
Published: (2026)

Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)

Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
by: Li, Liyang, et al.
Published: (2026)

Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus
by: Deng, Bangchao, et al.
Published: (2024)

SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
by: Ma, Ziping, et al.
Published: (2024)

Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
by: Zhong, Hao, et al.
Published: (2025)

OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
by: Lin, Jingli, et al.
Published: (2025)

Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
by: Luo, Zekai, et al.
Published: (2025)

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
by: Chen, Yiming, et al.
Published: (2025)

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
by: Zhao, Canyu, et al.
Published: (2025)

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
by: Chen, Cong, et al.
Published: (2025)

Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization
by: Li, Hanxi, et al.
Published: (2024)

An Improved Social Force Model‐Driven Multi‐Agent Generative Adversarial Imitation Learning Framework for Pedestrian Trajectory Prediction
by: Wen Zhou, et al.
Published: (2025)

HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)

Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert
by: Liu, Mingyu, et al.
Published: (2025)

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
by: Guo, Qingpei, et al.
Published: (2024)

An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)