Saved in:
| Main Authors: | Guo, Jiahao, Du, Sinan, Yao, Jingfeng, Liu, Wenyu, Li, Bo, Cao, Haoxiang, Gai, Kun, Yuan, Chun, Wu, Kai, Wang, Xinggang |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.23469 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
by: Du, Sinan, et al.
Published: (2025)
by: Du, Sinan, et al.
Published: (2025)
Matte Anything: Interactive Natural Image Matting with Segment Anything Models
by: Yao, Jingfeng, et al.
Published: (2023)
by: Yao, Jingfeng, et al.
Published: (2023)
FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification
by: Yao, Jingfeng, et al.
Published: (2024)
by: Yao, Jingfeng, et al.
Published: (2024)
Towards Scalable Pre-training of Visual Tokenizers for Generation
by: Yao, Jingfeng, et al.
Published: (2025)
by: Yao, Jingfeng, et al.
Published: (2025)
EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning
by: Yao, Jingfeng, et al.
Published: (2024)
by: Yao, Jingfeng, et al.
Published: (2024)
Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
by: Zou, Ya, et al.
Published: (2025)
by: Zou, Ya, et al.
Published: (2025)
ViTGaze: Gaze Following with Interaction Features in Vision Transformers
by: Song, Yuehao, et al.
Published: (2024)
by: Song, Yuehao, et al.
Published: (2024)
DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
by: Zeng, Lunbin, et al.
Published: (2025)
by: Zeng, Lunbin, et al.
Published: (2025)
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
by: Yao, Jingfeng, et al.
Published: (2025)
by: Yao, Jingfeng, et al.
Published: (2025)
XS-VID: An Extremely Small Video Object Detection Dataset
by: Guo, Jiahao, et al.
Published: (2024)
by: Guo, Jiahao, et al.
Published: (2024)
LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels
by: Cui, Ziwei, et al.
Published: (2024)
by: Cui, Ziwei, et al.
Published: (2024)
Visual Text Generation in the Wild
by: Zhu, Yuanzhi, et al.
Published: (2024)
by: Zhu, Yuanzhi, et al.
Published: (2024)
DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
by: Xia, Tianze, et al.
Published: (2025)
by: Xia, Tianze, et al.
Published: (2025)
MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
by: Zhang, Shuai, et al.
Published: (2025)
by: Zhang, Shuai, et al.
Published: (2025)
PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification
by: Hu, Bin, et al.
Published: (2024)
by: Hu, Bin, et al.
Published: (2024)
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
by: Xu, Zhengzhuo, et al.
Published: (2023)
by: Xu, Zhengzhuo, et al.
Published: (2023)
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
by: Li, Yingyue, et al.
Published: (2025)
by: Li, Yingyue, et al.
Published: (2025)
SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild
by: Liu, Jiawei, et al.
Published: (2025)
by: Liu, Jiawei, et al.
Published: (2025)
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
by: Wu, Xianfeng, et al.
Published: (2025)
by: Wu, Xianfeng, et al.
Published: (2025)
Gait Recognition via Collaborating Discriminative and Generative Diffusion Models
by: Xiong, Haijun, et al.
Published: (2025)
by: Xiong, Haijun, et al.
Published: (2025)
AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
by: Jiang, Bo, et al.
Published: (2025)
by: Jiang, Bo, et al.
Published: (2025)
OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
by: Zou, Jialv, et al.
Published: (2025)
by: Zou, Jialv, et al.
Published: (2025)
2D Gaussians Meet Visual Tokenizer
by: Shi, Yiang, et al.
Published: (2025)
by: Shi, Yiang, et al.
Published: (2025)
MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning
by: Zhang, Wenrui, et al.
Published: (2025)
by: Zhang, Wenrui, et al.
Published: (2025)
GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images
by: Xu, Ziyang, et al.
Published: (2024)
by: Xu, Ziyang, et al.
Published: (2024)
Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition
by: Xiong, Haijun, et al.
Published: (2024)
by: Xiong, Haijun, et al.
Published: (2024)
Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation
by: Cheng, Tianheng, et al.
Published: (2026)
by: Cheng, Tianheng, et al.
Published: (2026)
Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
by: Li, Yongkang, et al.
Published: (2024)
by: Li, Yongkang, et al.
Published: (2024)
Fast High Dynamic Range Radiance Fields for Dynamic Scenes
by: Wu, Guanjun, et al.
Published: (2024)
by: Wu, Guanjun, et al.
Published: (2024)
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
by: Zhu, Lianghui, et al.
Published: (2024)
by: Zhu, Lianghui, et al.
Published: (2024)
TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation
by: Cao, Yiyang, et al.
Published: (2026)
by: Cao, Yiyang, et al.
Published: (2026)
TransLight: Image-Guided Customized Lighting Control with Generative Decoupling
by: Li, Zongming, et al.
Published: (2025)
by: Li, Zongming, et al.
Published: (2025)
Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
by: Jiang, Haoyi, et al.
Published: (2026)
by: Jiang, Haoyi, et al.
Published: (2026)
Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects
by: Zhang, Shuai, et al.
Published: (2024)
by: Zhang, Shuai, et al.
Published: (2024)
Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting
by: Zhao, Zhengqi, et al.
Published: (2024)
by: Zhao, Zhengqi, et al.
Published: (2024)
GaitGS: Temporal Feature Learning in Granularity and Span Dimension for Gait Recognition
by: Xiong, Haijun, et al.
Published: (2023)
by: Xiong, Haijun, et al.
Published: (2023)
DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
by: Zhu, Yueting, et al.
Published: (2025)
by: Zhu, Yueting, et al.
Published: (2025)
STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting
by: Deng, Yunze, et al.
Published: (2025)
by: Deng, Yunze, et al.
Published: (2025)
MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning
by: Zou, Jialv, et al.
Published: (2024)
by: Zou, Jialv, et al.
Published: (2024)
Boosting Latent Diffusion Models via Disentangled Representation Alignment
by: Page, John, et al.
Published: (2026)
by: Page, John, et al.
Published: (2026)
Similar Items
-
VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
by: Du, Sinan, et al.
Published: (2025) -
Matte Anything: Interactive Natural Image Matting with Segment Anything Models
by: Yao, Jingfeng, et al.
Published: (2023) -
FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification
by: Yao, Jingfeng, et al.
Published: (2024) -
Towards Scalable Pre-training of Visual Tokenizers for Generation
by: Yao, Jingfeng, et al.
Published: (2025) -
EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning
by: Yao, Jingfeng, et al.
Published: (2024)