Saved in:
| Main Authors: | Wu, Pingyu, Zhu, Kai, Liu, Yu, Tang, Longxiang, Yang, Jian, Peng, Yansong, Zhai, Wei, Cao, Yang, Zha, Zheng-Jun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.05289 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Improved Video VAE for Latent Video Diffusion Model
by: Wu, Pingyu, et al.
Published: (2024)
by: Wu, Pingyu, et al.
Published: (2024)
FACM: Flow-Anchored Consistency Models
by: Peng, Yansong, et al.
Published: (2025)
by: Peng, Yansong, et al.
Published: (2025)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024)
by: Yang, Jian, et al.
Published: (2024)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)
by: Yang, Jian, et al.
Published: (2025)
ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
by: Fang, Zixun, et al.
Published: (2025)
by: Fang, Zixun, et al.
Published: (2025)
Exploiting Discriminative Codebook Prior for Autoregressive Image Generation
by: Tang, Longxiang, et al.
Published: (2025)
by: Tang, Longxiang, et al.
Published: (2025)
HERO: Human Reaction Generation from Videos
by: Yu, Chengjun, et al.
Published: (2025)
by: Yu, Chengjun, et al.
Published: (2025)
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
by: Wang, Chenfeng, et al.
Published: (2026)
by: Wang, Chenfeng, et al.
Published: (2026)
SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets
by: Yang, Yuhang, et al.
Published: (2025)
by: Yang, Yuhang, et al.
Published: (2025)
MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer
by: Cao, Jianjian, et al.
Published: (2024)
by: Cao, Jianjian, et al.
Published: (2024)
Anchoring Values in Temporal and Group Dimensions for Flow Matching Model Alignment
by: Shao, Yawen, et al.
Published: (2025)
by: Shao, Yawen, et al.
Published: (2025)
LEMON: Learning 3D Human-Object Interaction Relation from 2D Images
by: Yang, Yuhang, et al.
Published: (2023)
by: Yang, Yuhang, et al.
Published: (2023)
End-to-End Spatial-Temporal Transformer for Real-time 4D HOI Reconstruction
by: Zhang, Haoyu, et al.
Published: (2026)
by: Zhang, Haoyu, et al.
Published: (2026)
TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
by: Han, Guangyi, et al.
Published: (2025)
by: Han, Guangyi, et al.
Published: (2025)
EMoTive: Event-guided Trajectory Modeling for 3D Motion Estimation
by: Wan, Zengyu, et al.
Published: (2025)
by: Wan, Zengyu, et al.
Published: (2025)
ViViD: Video Virtual Try-on using Diffusion Models
by: Fang, Zixun, et al.
Published: (2024)
by: Fang, Zixun, et al.
Published: (2024)
EgoChoir: Capturing 3D Human-Object Interaction Regions from Egocentric Views
by: Yang, Yuhang, et al.
Published: (2024)
by: Yang, Yuhang, et al.
Published: (2024)
Event Stream Filtering via Probability Flux Estimation
by: Chen, Jinze, et al.
Published: (2025)
by: Chen, Jinze, et al.
Published: (2025)
Visual-Geometric Collaborative Guidance for Affordance Learning
by: Luo, Hongchen, et al.
Published: (2024)
by: Luo, Hongchen, et al.
Published: (2024)
Leverage Task Context for Object Affordance Ranking
by: Huang, Haojie, et al.
Published: (2024)
by: Huang, Haojie, et al.
Published: (2024)
Event-based Visual Deformation Measurement
by: Wu, Yuliang, et al.
Published: (2026)
by: Wu, Yuliang, et al.
Published: (2026)
VanGogh: A Unified Multimodal Diffusion-based Framework for Video Colorization
by: Fang, Zixun, et al.
Published: (2025)
by: Fang, Zixun, et al.
Published: (2025)
EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning
by: Yu, Chengjun, et al.
Published: (2026)
by: Yu, Chengjun, et al.
Published: (2026)
Event-based Asynchronous HDR Imaging by Temporal Incident Light Modulation
by: Wu, Yuliang, et al.
Published: (2024)
by: Wu, Yuliang, et al.
Published: (2024)
GREAT: Geometry-Intention Collaborative Inference for Open-Vocabulary 3D Object Affordance Grounding
by: Shao, Yawen, et al.
Published: (2024)
by: Shao, Yawen, et al.
Published: (2024)
Towards Better De-raining Generalization via Rainy Characteristics Memorization and Replay
by: Wang, Kunyu, et al.
Published: (2025)
by: Wang, Kunyu, et al.
Published: (2025)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning
by: Lu, Fan, et al.
Published: (2024)
by: Lu, Fan, et al.
Published: (2024)
HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
by: Guo, Yansong, et al.
Published: (2026)
by: Guo, Yansong, et al.
Published: (2026)
TokenAR: Multiple Subject Generation via Autoregressive Token-level enhancement
by: Sun, Haiyue, et al.
Published: (2025)
by: Sun, Haiyue, et al.
Published: (2025)
Does Your Vision-Language Model Get Lost in the Long Video Sampling Dilemma?
by: Qu, Tianyuan, et al.
Published: (2025)
by: Qu, Tianyuan, et al.
Published: (2025)
HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models
by: Zhou, Ziqin, et al.
Published: (2025)
by: Zhou, Ziqin, et al.
Published: (2025)
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking
by: Han, Han, et al.
Published: (2024)
by: Han, Han, et al.
Published: (2024)
GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images
by: Wang, Chengfeng, et al.
Published: (2025)
by: Wang, Chengfeng, et al.
Published: (2025)
$\text{S}^{3}$Mamba: Arbitrary-Scale Super-Resolution via Scaleable State Space Model
by: Xia, Peizhe, et al.
Published: (2024)
by: Xia, Peizhe, et al.
Published: (2024)
Grounding 3D Scene Affordance From Egocentric Interactions
by: Liu, Cuiyu, et al.
Published: (2024)
by: Liu, Cuiyu, et al.
Published: (2024)
NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale
by: NextStep Team, et al.
Published: (2025)
by: NextStep Team, et al.
Published: (2025)
FC3DNet: A Fully Connected Encoder-Decoder for Efficient Demoir'eing
by: Du, Zhibo, et al.
Published: (2024)
by: Du, Zhibo, et al.
Published: (2024)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models
by: Ma, Xu, et al.
Published: (2025)
by: Ma, Xu, et al.
Published: (2025)
Rethinking Discrete Tokens: Treating Them as Conditions for Continuous Autoregressive Image Synthesis
by: Zheng, Peng, et al.
Published: (2025)
by: Zheng, Peng, et al.
Published: (2025)
Unbiased Gradient Estimation for Event Binning via Functional Backpropagation
by: Chen, Jinze, et al.
Published: (2026)
by: Chen, Jinze, et al.
Published: (2026)
Similar Items
-
Improved Video VAE for Latent Video Diffusion Model
by: Wu, Pingyu, et al.
Published: (2024) -
FACM: Flow-Anchored Consistency Models
by: Peng, Yansong, et al.
Published: (2025) -
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024) -
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025) -
ViewPoint: Panoramic Video Generation with Pretrained Diffusion Models
by: Fang, Zixun, et al.
Published: (2025)