Saved in:
| Main Authors: | Zhu, Muzhi, Tian, Yuzhuo, Chen, Hao, Zhou, Chunluan, Guo, Qingpei, Liu, Yang, Yang, Ming, Shen, Chunhua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.08625 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
by: Dong, Xingning, et al.
Published: (2024)
by: Dong, Xingning, et al.
Published: (2024)
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025)
by: Zheng, Naishan, et al.
Published: (2025)
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
by: Dai, Ming, et al.
Published: (2025)
by: Dai, Ming, et al.
Published: (2025)
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
by: Liu, Yang, et al.
Published: (2023)
by: Liu, Yang, et al.
Published: (2023)
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data
by: Fan, Chengxiang, et al.
Published: (2024)
by: Fan, Chengxiang, et al.
Published: (2024)
Generative Active Learning for Long-tailed Instance Segmentation
by: Zhu, Muzhi, et al.
Published: (2024)
by: Zhu, Muzhi, et al.
Published: (2024)
A Simple Image Segmentation Framework via In-Context Examples
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs
by: Lu, Yujie, et al.
Published: (2024)
by: Lu, Yujie, et al.
Published: (2024)
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
by: Wu, Weijia, et al.
Published: (2023)
by: Wu, Weijia, et al.
Published: (2023)
Exploring Spatial Intelligence from a Generative Perspective
by: Zhu, Muzhi, et al.
Published: (2026)
by: Zhu, Muzhi, et al.
Published: (2026)
Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation
by: Zhu, Muzhi, et al.
Published: (2024)
by: Zhu, Muzhi, et al.
Published: (2024)
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
by: Han, Yudong, et al.
Published: (2024)
by: Han, Yudong, et al.
Published: (2024)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs
by: Xuan, Shiyu, et al.
Published: (2023)
by: Xuan, Shiyu, et al.
Published: (2023)
Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO
by: Zhu, Muzhi, et al.
Published: (2025)
by: Zhu, Muzhi, et al.
Published: (2025)
Referencing Where to Focus: Improving VisualGrounding with Referential Query
by: Wang, Yabing, et al.
Published: (2024)
by: Wang, Yabing, et al.
Published: (2024)
From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval
by: Wang, Yabing, et al.
Published: (2025)
by: Wang, Yabing, et al.
Published: (2025)
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval
by: Yu, Xuzheng, et al.
Published: (2024)
by: Yu, Xuzheng, et al.
Published: (2024)
Unified Open-World Segmentation with Multi-Modal Prompts
by: Liu, Yang, et al.
Published: (2025)
by: Liu, Yang, et al.
Published: (2025)
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
by: Jia, Yiduo, et al.
Published: (2026)
by: Jia, Yiduo, et al.
Published: (2026)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs
by: Xu, Pengju, et al.
Published: (2025)
by: Xu, Pengju, et al.
Published: (2025)
X-Stream: Exploring MLLMs as Multiplexers for Multi-Stream Understanding
by: Sun, Peiwen, et al.
Published: (2026)
by: Sun, Peiwen, et al.
Published: (2026)
HumanVBench: Probing Human-Centric Video Understanding in MLLMs with Automatically Synthesized Benchmarks
by: Zhou, Ting, et al.
Published: (2024)
by: Zhou, Ting, et al.
Published: (2024)
Social Debiasing for Fair Multi-modal LLMs
by: Cheng, Harry, et al.
Published: (2024)
by: Cheng, Harry, et al.
Published: (2024)
FlattenGPT: Depth Compression for Transformer with Layer Flattening
by: Xu, Ruihan, et al.
Published: (2026)
by: Xu, Ruihan, et al.
Published: (2026)
Video-MSR: Benchmarking Multi-hop Spatial Reasoning Capabilities of MLLMs
by: Zhu, Rui, et al.
Published: (2026)
by: Zhu, Rui, et al.
Published: (2026)
Where to Look: Can Foundation Models Reach a Target Viewpoint Through Active Exploration?
by: Li, Liyang, et al.
Published: (2026)
by: Li, Liyang, et al.
Published: (2026)
Revisiting Synthetic Human Trajectories: Imitative Generation and Benchmarks Beyond Datasaurus
by: Deng, Bangchao, et al.
Published: (2024)
by: Deng, Bangchao, et al.
Published: (2024)
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
by: Ma, Ziping, et al.
Published: (2024)
by: Ma, Ziping, et al.
Published: (2024)
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration
by: Zhong, Hao, et al.
Published: (2025)
by: Zhong, Hao, et al.
Published: (2025)
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding
by: Lin, Jingli, et al.
Published: (2025)
by: Lin, Jingli, et al.
Published: (2025)
Preserving Source Video Realism: High-Fidelity Face Swapping for Cinematic Quality
by: Luo, Zekai, et al.
Published: (2025)
by: Luo, Zekai, et al.
Published: (2025)
From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
by: Chen, Yiming, et al.
Published: (2025)
by: Chen, Yiming, et al.
Published: (2025)
DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
by: Zhao, Canyu, et al.
Published: (2025)
by: Zhao, Canyu, et al.
Published: (2025)
HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation
by: Chen, Cong, et al.
Published: (2025)
by: Chen, Cong, et al.
Published: (2025)
Towards Efficient Pixel Labeling for Industrial Anomaly Detection and Localization
by: Li, Hanxi, et al.
Published: (2024)
by: Li, Hanxi, et al.
Published: (2024)
An Improved Social Force Model‐Driven Multi‐Agent Generative Adversarial Imitation Learning Framework for Pedestrian Trajectory Prediction
by: Wen Zhou, et al.
Published: (2025)
by: Wen Zhou, et al.
Published: (2025)
HumanVideo-MME: Benchmarking MLLMs for Human-Centric Video Understanding
by: Cai, Yuxuan, et al.
Published: (2025)
by: Cai, Yuxuan, et al.
Published: (2025)
Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert
by: Liu, Mingyu, et al.
Published: (2025)
by: Liu, Mingyu, et al.
Published: (2025)
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
by: Guo, Qingpei, et al.
Published: (2024)
by: Guo, Qingpei, et al.
Published: (2024)
An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability
by: Wu, Daiqing, et al.
Published: (2025)
by: Wu, Daiqing, et al.
Published: (2025)
Similar Items
-
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
by: Dong, Xingning, et al.
Published: (2024) -
VideoScaffold: Elastic-Scale Visual Hierarchies for Streaming Video Understanding in MLLMs
by: Zheng, Naishan, et al.
Published: (2025) -
MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
by: Dai, Ming, et al.
Published: (2025) -
Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching
by: Liu, Yang, et al.
Published: (2023) -
DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data
by: Fan, Chengxiang, et al.
Published: (2024)