Guardado en:
| Autores principales: | Shi, Ruixiao, Feng, Fu, Xie, Yucheng, Yang, Xu, Wang, Jing, Geng, Xin |
|---|---|
| Formato: | Preprint |
| Publicado: |
2026
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2603.17895 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Self-Supervised Weight Templates for Scalable Vision Model Initialization
por: Xie, Yucheng, et al.
Publicado: (2026)
por: Xie, Yucheng, et al.
Publicado: (2026)
Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
por: Feng, Fu, et al.
Publicado: (2024)
por: Feng, Fu, et al.
Publicado: (2024)
FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning
por: Shi, Ruixiao, et al.
Publicado: (2025)
por: Shi, Ruixiao, et al.
Publicado: (2025)
Distribution-Conditional Generation: From Class Distribution to Creative Generation
por: Feng, Fu, et al.
Publicado: (2025)
por: Feng, Fu, et al.
Publicado: (2025)
KIND: Knowledge Integration and Diversion for Training Decomposable Models
por: Xie, Yucheng, et al.
Publicado: (2024)
por: Xie, Yucheng, et al.
Publicado: (2024)
DivControl: Knowledge Diversion for Controllable Image Generation
por: Xie, Yucheng, et al.
Publicado: (2025)
por: Xie, Yucheng, et al.
Publicado: (2025)
FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
por: Xie, Yucheng, et al.
Publicado: (2024)
por: Xie, Yucheng, et al.
Publicado: (2024)
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
por: Wang, Feng, et al.
Publicado: (2025)
por: Wang, Feng, et al.
Publicado: (2025)
An Image is Worth 32 Tokens for Reconstruction and Generation
por: Yu, Qihang, et al.
Publicado: (2024)
por: Yu, Qihang, et al.
Publicado: (2024)
A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
por: Kerssies, Tommie, et al.
Publicado: (2026)
por: Kerssies, Tommie, et al.
Publicado: (2026)
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
por: Hu, Lianyu, et al.
Publicado: (2024)
por: Hu, Lianyu, et al.
Publicado: (2024)
Equivariant Image Modeling
por: Dong, Ruixiao, et al.
Publicado: (2025)
por: Dong, Ruixiao, et al.
Publicado: (2025)
Tokenize Image as a Set
por: Geng, Zigang, et al.
Publicado: (2025)
por: Geng, Zigang, et al.
Publicado: (2025)
Images are Worth Variable Length of Representations
por: Mao, Lingjun, et al.
Publicado: (2025)
por: Mao, Lingjun, et al.
Publicado: (2025)
Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
por: Liu, Shuo, et al.
Publicado: (2026)
por: Liu, Shuo, et al.
Publicado: (2026)
Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
por: Wei, Dongxu, et al.
Publicado: (2026)
por: Wei, Dongxu, et al.
Publicado: (2026)
CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
por: Yang, Hongji, et al.
Publicado: (2026)
por: Yang, Hongji, et al.
Publicado: (2026)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
por: Wang, Jiayu, et al.
Publicado: (2024)
por: Wang, Jiayu, et al.
Publicado: (2024)
Metadata-Driven Federated Learning of Connectional Brain Templates in Non-IID Multi-Domain Scenarios
por: Chen, Geng, et al.
Publicado: (2024)
por: Chen, Geng, et al.
Publicado: (2024)
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
por: Urbanek, Jack, et al.
Publicado: (2023)
por: Urbanek, Jack, et al.
Publicado: (2023)
Spectral-Structured Diffusion for Single-Image Rain Removal
por: Xing, Yucheng, et al.
Publicado: (2026)
por: Xing, Yucheng, et al.
Publicado: (2026)
Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge
por: Chen, Ruiming, et al.
Publicado: (2025)
por: Chen, Ruiming, et al.
Publicado: (2025)
Vript: A Video Is Worth Thousands of Words
por: Yang, Dongjie, et al.
Publicado: (2024)
por: Yang, Dongjie, et al.
Publicado: (2024)
A Video Is Not Worth a Thousand Words
por: Pollard, Sam, et al.
Publicado: (2025)
por: Pollard, Sam, et al.
Publicado: (2025)
An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion
por: Yan, Xingguang, et al.
Publicado: (2024)
por: Yan, Xingguang, et al.
Publicado: (2024)
When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
por: Li, Tianqi, et al.
Publicado: (2026)
por: Li, Tianqi, et al.
Publicado: (2026)
VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
por: Wang, Feng, et al.
Publicado: (2026)
por: Wang, Feng, et al.
Publicado: (2026)
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
por: Guo, Yiwei, et al.
Publicado: (2026)
por: Guo, Yiwei, et al.
Publicado: (2026)
An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control
por: Feng, Aosong, et al.
Publicado: (2024)
por: Feng, Aosong, et al.
Publicado: (2024)
A LoRA is Worth a Thousand Pictures
por: Liu, Chenxi, et al.
Publicado: (2024)
por: Liu, Chenxi, et al.
Publicado: (2024)
RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets
por: Liu, Isabella, et al.
Publicado: (2025)
por: Liu, Isabella, et al.
Publicado: (2025)
Concept-Centric Token Interpretation for Vector-Quantized Generative Models
por: Yang, Tianze, et al.
Publicado: (2025)
por: Yang, Tianze, et al.
Publicado: (2025)
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
por: Nguyen, Phat, et al.
Publicado: (2026)
por: Nguyen, Phat, et al.
Publicado: (2026)
Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
por: Yang, Huzheng, et al.
Publicado: (2025)
por: Yang, Huzheng, et al.
Publicado: (2025)
FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
por: Cai, Kaitong, et al.
Publicado: (2025)
por: Cai, Kaitong, et al.
Publicado: (2025)
Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning
por: Lin, Yukang, et al.
Publicado: (2025)
por: Lin, Yukang, et al.
Publicado: (2025)
MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation
por: Wang, Yucheng, et al.
Publicado: (2025)
por: Wang, Yucheng, et al.
Publicado: (2025)
Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
por: Xu, Zhou, et al.
Publicado: (2026)
por: Xu, Zhou, et al.
Publicado: (2026)
Multimodal-Enhanced Objectness Learner for Corner Case Detection in Autonomous Driving
por: Xiao, Lixing, et al.
Publicado: (2024)
por: Xiao, Lixing, et al.
Publicado: (2024)
NormAUG: Normalization-guided Augmentation for Domain Generalization
por: Qi, Lei, et al.
Publicado: (2023)
por: Qi, Lei, et al.
Publicado: (2023)
Ejemplares similares
-
Self-Supervised Weight Templates for Scalable Vision Model Initialization
por: Xie, Yucheng, et al.
Publicado: (2026) -
Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
por: Feng, Fu, et al.
Publicado: (2024) -
FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning
por: Shi, Ruixiao, et al.
Publicado: (2025) -
Distribution-Conditional Generation: From Class Distribution to Creative Generation
por: Feng, Fu, et al.
Publicado: (2025) -
KIND: Knowledge Integration and Diversion for Training Decomposable Models
por: Xie, Yucheng, et al.
Publicado: (2024)