Saved in:
| Main Authors: | Tang, Shixiang, Wang, Yizhou, Chen, Lu, Wang, Yuan, Peng, Sida, Xu, Dan, Ouyang, Wanli |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.08556 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
by: Chen, Lu, et al.
Published: (2025)
by: Chen, Lu, et al.
Published: (2025)
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
by: Wu, Haoning, et al.
Published: (2023)
by: Wu, Haoning, et al.
Published: (2023)
Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?
by: Huang, Yuesheng, et al.
Published: (2025)
by: Huang, Yuesheng, et al.
Published: (2025)
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
by: Chen, Liyang, et al.
Published: (2025)
by: Chen, Liyang, et al.
Published: (2025)
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)
by: Ma, Jianzhe, et al.
Published: (2026)
Lumos-1: On Autoregressive Video Generation with Discrete Diffusion from a Unified Model Perspective
by: Yuan, Hangjie, et al.
Published: (2025)
by: Yuan, Hangjie, et al.
Published: (2025)
NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching
by: Luo, Run, et al.
Published: (2025)
by: Luo, Run, et al.
Published: (2025)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
by: Yang, Jialiang, et al.
Published: (2026)
by: Yang, Jialiang, et al.
Published: (2026)
Video-EM: Event-Centric Episodic Memory for Long-Form Video Understanding
by: Wang, Yun, et al.
Published: (2025)
by: Wang, Yun, et al.
Published: (2025)
Scaling Spatial Intelligence with Multimodal Foundation Models
by: Cai, Zhongang, et al.
Published: (2025)
by: Cai, Zhongang, et al.
Published: (2025)
A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis
by: Hei, Nailei, et al.
Published: (2024)
by: Hei, Nailei, et al.
Published: (2024)
Distilling Vision-Language Foundation Models: A Data-Free Approach via Prompt Diversification
by: Xuan, Yunyi, et al.
Published: (2024)
by: Xuan, Yunyi, et al.
Published: (2024)
HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer
by: Cai, Qi, et al.
Published: (2025)
by: Cai, Qi, et al.
Published: (2025)
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
by: Li, Liupeng, et al.
Published: (2026)
by: Li, Liupeng, et al.
Published: (2026)
Q-Ponder: A Unified Training Pipeline for Reasoning-based Visual Quality Assessment
by: Cai, Zhuoxuan, et al.
Published: (2025)
by: Cai, Zhuoxuan, et al.
Published: (2025)
TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
by: Shi, Chuancheng, et al.
Published: (2026)
by: Shi, Chuancheng, et al.
Published: (2026)
Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward
by: Tang, Yolo Yunlong, et al.
Published: (2022)
by: Tang, Yolo Yunlong, et al.
Published: (2022)
Hulk: A Universal Knowledge Translator for Human-Centric Tasks
by: Wang, Yizhou, et al.
Published: (2023)
by: Wang, Yizhou, et al.
Published: (2023)
Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision
by: Yin, Kangsheng, et al.
Published: (2025)
by: Yin, Kangsheng, et al.
Published: (2025)
MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
by: Zhang, Yuang, et al.
Published: (2024)
by: Zhang, Yuang, et al.
Published: (2024)
LongInsightBench: A Comprehensive Benchmark for Evaluating Omni-Modal Models on Human-Centric Long-Video Understanding
by: Han, ZhaoYang, et al.
Published: (2025)
by: Han, ZhaoYang, et al.
Published: (2025)
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
by: Sun, Zeyi, et al.
Published: (2024)
by: Sun, Zeyi, et al.
Published: (2024)
Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
by: Cao, Pu, et al.
Published: (2023)
by: Cao, Pu, et al.
Published: (2023)
Robust Fuzzy Multi-view Learning under View Conflict
by: Duan, Siyuan, et al.
Published: (2026)
by: Duan, Siyuan, et al.
Published: (2026)
HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
by: Ouyang, Shuyi, et al.
Published: (2024)
by: Ouyang, Shuyi, et al.
Published: (2024)
HiDream-O1-Image: A Natively Unified Image Generative Foundation Model with Pixel-level Unified Transformer
by: Cai, Qi, et al.
Published: (2026)
by: Cai, Qi, et al.
Published: (2026)
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation
by: Qu, Qiang, et al.
Published: (2025)
by: Qu, Qiang, et al.
Published: (2025)
Phi-Ground Tech Report: Advancing Perception in GUI Grounding
by: Zhang, Miaosen, et al.
Published: (2025)
by: Zhang, Miaosen, et al.
Published: (2025)
VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents
by: Yi, Hongzhu, et al.
Published: (2026)
by: Yi, Hongzhu, et al.
Published: (2026)
Can Large Language Models Grasp Event Signals? Exploring Pure Zero-Shot Event-based Recognition
by: Yu, Zongyou, et al.
Published: (2024)
by: Yu, Zongyou, et al.
Published: (2024)
SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation
by: Chen, Tong, et al.
Published: (2024)
by: Chen, Tong, et al.
Published: (2024)
Crab$^{+}$: A Scalable and Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Cai, Dongnuan, et al.
Published: (2026)
by: Cai, Dongnuan, et al.
Published: (2026)
Audio-Guided Visual Perception for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
PanMatch: Unleashing the Potential of Large Vision Models for Unified Matching Models
by: Zhang, Yongjian, et al.
Published: (2025)
by: Zhang, Yongjian, et al.
Published: (2025)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
by: Yu, Lijun, et al.
Published: (2023)
by: Yu, Lijun, et al.
Published: (2023)
Navigating the Mirage: A Dual-Path Agentic Framework for Robust Misleading Chart Question Answering
by: Zhang, Yanjie, et al.
Published: (2026)
by: Zhang, Yanjie, et al.
Published: (2026)
Smart Transfer: Leveraging Vision Foundation Model for Rapid Building Damage Mapping with Post-Earthquake VHR Imagery
by: Li, Hao, et al.
Published: (2026)
by: Li, Hao, et al.
Published: (2026)
One Size, Many Fits: Aligning Diverse Group-Wise Click Preferences in Large-Scale Advertising Image Generation
by: Lu, Shuo, et al.
Published: (2026)
by: Lu, Shuo, et al.
Published: (2026)
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
by: Li, Xinyue, et al.
Published: (2026)
by: Li, Xinyue, et al.
Published: (2026)
TIDE : Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation
by: Huang, Victor Shea-Jay, et al.
Published: (2025)
by: Huang, Victor Shea-Jay, et al.
Published: (2025)
Similar Items
-
EgoAgent: A Joint Predictive Agent Model in Egocentric Worlds
by: Chen, Lu, et al.
Published: (2025) -
Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision
by: Wu, Haoning, et al.
Published: (2023) -
Synthetic Perception: Can Generated Images Unlock Latent Visual Prior for Text-Centric Reasoning?
by: Huang, Yuesheng, et al.
Published: (2025) -
HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning
by: Chen, Liyang, et al.
Published: (2025) -
EgoEsportsQA: An Egocentric Video Benchmark for Perception and Reasoning in Esports
by: Ma, Jianzhe, et al.
Published: (2026)