:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Shi, Ruixiao, Feng, Fu, Xie, Yucheng, Yang, Xu, Wang, Jing, Geng, Xin
Formato:	Preprint
Publicado:	2026
Materias:	Computer Vision and Pattern Recognition
Acceso en línea:	https://arxiv.org/abs/2603.17895
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Self-Supervised Weight Templates for Scalable Vision Model Initialization
por: Xie, Yucheng, et al.
Publicado: (2026)

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation
por: Feng, Fu, et al.
Publicado: (2024)

FAD: Frequency Adaptation and Diversion for Cross-domain Few-shot Learning
por: Shi, Ruixiao, et al.
Publicado: (2025)

Distribution-Conditional Generation: From Class Distribution to Creative Generation
por: Feng, Fu, et al.
Publicado: (2025)

KIND: Knowledge Integration and Diversion for Training Decomposable Models
por: Xie, Yucheng, et al.
Publicado: (2024)

DivControl: Knowledge Diversion for Controllable Image Generation
por: Xie, Yucheng, et al.
Publicado: (2025)

FINE: Factorizing Knowledge for Initialization of Variable-sized Diffusion Models
por: Xie, Yucheng, et al.
Publicado: (2024)

Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
por: Wang, Feng, et al.
Publicado: (2025)

An Image is Worth 32 Tokens for Reconstruction and Generation
por: Yu, Qihang, et al.
Publicado: (2024)

A Frame is Worth One Token: Efficient Generative World Modeling with Delta Tokens
por: Kerssies, Tommie, et al.
Publicado: (2026)

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
por: Hu, Lianyu, et al.
Publicado: (2024)

Equivariant Image Modeling
por: Dong, Ruixiao, et al.
Publicado: (2025)

Tokenize Image as a Set
por: Geng, Zigang, et al.
Publicado: (2025)

Images are Worth Variable Length of Representations
por: Mao, Lingjun, et al.
Publicado: (2025)

Information Coordination as a Bridge: A Neuro-Symbolic Architecture for Reliable Autonomous Driving Scene Understanding
por: Liu, Shuo, et al.
Publicado: (2026)

Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale
por: Wei, Dongxu, et al.
Publicado: (2026)

CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
por: Yang, Hongji, et al.
Publicado: (2026)

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
por: Wang, Jiayu, et al.
Publicado: (2024)

Metadata-Driven Federated Learning of Connectional Brain Templates in Non-IID Multi-Domain Scenarios
por: Chen, Geng, et al.
Publicado: (2024)

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
por: Urbanek, Jack, et al.
Publicado: (2023)

Spectral-Structured Diffusion for Single-Image Rain Removal
por: Xing, Yucheng, et al.
Publicado: (2026)

Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge
por: Chen, Ruiming, et al.
Publicado: (2025)

Vript: A Video Is Worth Thousands of Words
por: Yang, Dongjie, et al.
Publicado: (2024)

A Video Is Not Worth a Thousand Words
por: Pollard, Sam, et al.
Publicado: (2025)

An Object is Worth 64x64 Pixels: Generating 3D Object via Image Diffusion
por: Yan, Xingguang, et al.
Publicado: (2024)

When W4A4 Breaks Camouflaged Object Detection: Token-Group Dual-Constraint Activation Quantization
por: Li, Tianqi, et al.
Publicado: (2026)

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
por: Wang, Feng, et al.
Publicado: (2026)

WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
por: Guo, Yiwei, et al.
Publicado: (2026)

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control
por: Feng, Aosong, et al.
Publicado: (2024)

A LoRA is Worth a Thousand Pictures
por: Liu, Chenxi, et al.
Publicado: (2024)

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets
por: Liu, Isabella, et al.
Publicado: (2025)

Concept-Centric Token Interpretation for Vector-Quantized Generative Models
por: Yang, Tianze, et al.
Publicado: (2025)

Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
por: Nguyen, Phat, et al.
Publicado: (2026)

Vibe Spaces for Creatively Connecting and Expressing Visual Concepts
por: Yang, Huzheng, et al.
Publicado: (2025)

FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models
por: Cai, Kaitong, et al.
Publicado: (2025)

Creative4U: MLLMs-based Advertising Creative Image Selector with Comparative Reasoning
por: Lin, Yukang, et al.
Publicado: (2025)

MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation
por: Wang, Yucheng, et al.
Publicado: (2025)

Spatio-Temporal Token Pruning for Efficient High-Resolution GUI Agents
por: Xu, Zhou, et al.
Publicado: (2026)

Multimodal-Enhanced Objectness Learner for Corner Case Detection in Autonomous Driving
por: Xiao, Lixing, et al.
Publicado: (2024)

NormAUG: Normalization-guided Augmentation for Domain Generalization
por: Qi, Lei, et al.
Publicado: (2023)