Saved in:
| Main Authors: | Ren, Sucheng, Chen, Chen, Wang, Zhenbang, Song, Liangchen, Zhu, Xiangxin, Yuille, Alan, Chen, Liang-Chieh, Lu, Jiasen |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.04040 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Autoregressive Video Generation beyond Next Frames Prediction
by: Ren, Sucheng, et al.
Published: (2025)
by: Ren, Sucheng, et al.
Published: (2025)
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
by: Ren, Sucheng, et al.
Published: (2025)
by: Ren, Sucheng, et al.
Published: (2025)
Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers
by: Ren, Sucheng, et al.
Published: (2025)
by: Ren, Sucheng, et al.
Published: (2025)
Frequency-Aware Flow Matching for High-Quality Image Generation
by: Ren, Sucheng, et al.
Published: (2026)
by: Ren, Sucheng, et al.
Published: (2026)
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Rejuvenating image-GPT as Strong Visual Representation Learners
by: Ren, Sucheng, et al.
Published: (2023)
by: Ren, Sucheng, et al.
Published: (2023)
ViMix-14M: A Curated Multi-Source Video-Text Dataset with Long-Form, High-Quality Captions and Crawl-Free Access
by: Yang, Timing, et al.
Published: (2025)
by: Yang, Timing, et al.
Published: (2025)
ARVideo: Autoregressive Pretraining for Self-Supervised Video Representation Learning
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
SPFormer: Enhancing Vision Transformer with Superpixel Representation
by: Mei, Jieru, et al.
Published: (2024)
by: Mei, Jieru, et al.
Published: (2024)
ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling
by: Liu, Qihao, et al.
Published: (2025)
by: Liu, Qihao, et al.
Published: (2025)
Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane
by: Liu, Haoyu, et al.
Published: (2026)
by: Liu, Haoyu, et al.
Published: (2026)
ViT-5: Vision Transformers for The Mid-2020s
by: Wang, Feng, et al.
Published: (2026)
by: Wang, Feng, et al.
Published: (2026)
Efficient Large Multi-modal Models via Visual Context Compression
by: Chen, Jieneng, et al.
Published: (2024)
by: Chen, Jieneng, et al.
Published: (2024)
AToken: A Unified Tokenizer for Vision
by: Lu, Jiasen, et al.
Published: (2025)
by: Lu, Jiasen, et al.
Published: (2025)
Medical Vision Generalist: Unifying Medical Imaging Tasks in Context
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Mamba-R: Vision Mamba ALSO Needs Registers
by: Wang, Feng, et al.
Published: (2024)
by: Wang, Feng, et al.
Published: (2024)
ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning
by: Wang, Yuxuan, et al.
Published: (2024)
by: Wang, Yuxuan, et al.
Published: (2024)
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
by: Guan, Yaohan, et al.
Published: (2026)
by: Guan, Yaohan, et al.
Published: (2026)
WorldEdit: Towards Open-World Image Editing with a Knowledge-Informed Benchmark
by: Lin, Wang, et al.
Published: (2026)
by: Lin, Wang, et al.
Published: (2026)
CAR-Flow: Condition-Aware Reparameterization Aligns Source and Target for Better Flow Matching
by: Chen, Chen, et al.
Published: (2025)
by: Chen, Chen, et al.
Published: (2025)
Generative World Explorer
by: Lu, Taiming, et al.
Published: (2024)
by: Lu, Taiming, et al.
Published: (2024)
HResFormer: Hybrid Residual Transformer for Volumetric Medical Image Segmentation
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models
by: Wang, Xingrui, et al.
Published: (2025)
by: Wang, Xingrui, et al.
Published: (2025)
Large-Scale Label Quality Assessment for Medical Segmentation via a Vision-Language Judge and Synthetic Data
by: Chen, Yixiong, et al.
Published: (2026)
by: Chen, Yixiong, et al.
Published: (2026)
A Simple Video Segmenter by Tracking Objects Along Axial Trajectories
by: He, Ju, et al.
Published: (2023)
by: He, Ju, et al.
Published: (2023)
Adventurer: Optimizing Vision Mamba Architecture Designs for Efficiency
by: Wang, Feng, et al.
Published: (2024)
by: Wang, Feng, et al.
Published: (2024)
Quality Sentinel: Estimating Label Quality and Errors in Medical Segmentation Datasets
by: Chen, Yixiong, et al.
Published: (2024)
by: Chen, Yixiong, et al.
Published: (2024)
Play to Generalize: Learning to Reason Through Game Play
by: Xie, Yunfei, et al.
Published: (2025)
by: Xie, Yunfei, et al.
Published: (2025)
RNN as Linear Transformer: A Closer Investigation into Representational Potentials of Visual Mamba Models
by: Yang, Timing, et al.
Published: (2025)
by: Yang, Timing, et al.
Published: (2025)
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
by: Wang, Feng, et al.
Published: (2023)
by: Wang, Feng, et al.
Published: (2023)
Thinking with Spatial Code for Physical-World Video Reasoning
by: Chen, Jieneng, et al.
Published: (2026)
by: Chen, Jieneng, et al.
Published: (2026)
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
by: Qian, Yusu, et al.
Published: (2025)
by: Qian, Yusu, et al.
Published: (2025)
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
by: Ma, Wufei, et al.
Published: (2025)
by: Ma, Wufei, et al.
Published: (2025)
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models
by: Zhang, Tiezheng, et al.
Published: (2025)
by: Zhang, Tiezheng, et al.
Published: (2025)
Randomized Autoregressive Visual Generation
by: Yu, Qihang, et al.
Published: (2024)
by: Yu, Qihang, et al.
Published: (2024)
From Perception to Reasoning: Deep Thinking Empowers Multimodal Large Language Models
by: Zhu, Wenxin, et al.
Published: (2025)
by: Zhu, Wenxin, et al.
Published: (2025)
Autoregressive Pretraining with Mamba in Vision
by: Ren, Sucheng, et al.
Published: (2024)
by: Ren, Sucheng, et al.
Published: (2024)
Similar Items
-
Autoregressive Video Generation beyond Next Frames Prediction
by: Ren, Sucheng, et al.
Published: (2025) -
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
by: Ren, Sucheng, et al.
Published: (2025) -
Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers
by: Ren, Sucheng, et al.
Published: (2025) -
Frequency-Aware Flow Matching for High-Quality Image Generation
by: Ren, Sucheng, et al.
Published: (2026) -
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
by: Ren, Sucheng, et al.
Published: (2024)