:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Guo, Jiahao, Du, Sinan, Yao, Jingfeng, Liu, Wenyu, Li, Bo, Cao, Haoxiang, Gai, Kun, Yuan, Chun, Wu, Kai, Wang, Xinggang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.23469
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction
by: Du, Sinan, et al.
Published: (2025)

Matte Anything: Interactive Natural Image Matting with Segment Anything Models
by: Yao, Jingfeng, et al.
Published: (2023)

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification
by: Yao, Jingfeng, et al.
Published: (2024)

Towards Scalable Pre-training of Visual Tokenizers for Generation
by: Yao, Jingfeng, et al.
Published: (2025)

EVA-X: A Foundation Model for General Chest X-ray Analysis with Self-supervised Learning
by: Yao, Jingfeng, et al.
Published: (2024)

Turbo-VAED: Fast and Stable Transfer of Video-VAEs to Mobile Devices
by: Zou, Ya, et al.
Published: (2025)

ViTGaze: Gaze Following with Interaction Features in Vision Transformers
by: Song, Yuehao, et al.
Published: (2024)

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models
by: Zeng, Lunbin, et al.
Published: (2025)

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
by: Yao, Jingfeng, et al.
Published: (2025)

XS-VID: An Extremely Small Video Object Detection Dataset
by: Guo, Jiahao, et al.
Published: (2024)

LKCell: Efficient Cell Nuclei Instance Segmentation with Large Convolution Kernels
by: Cui, Ziwei, et al.
Published: (2024)

Visual Text Generation in the Wild
by: Zhu, Yuanzhi, et al.
Published: (2024)

DriveLaW:Unifying Planning and Video Generation in a Latent Driving World
by: Xia, Tianze, et al.
Published: (2025)

MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices
by: Zhang, Shuai, et al.
Published: (2025)

PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification
by: Hu, Bin, et al.
Published: (2024)

ChartBench: A Benchmark for Complex Visual Reasoning in Charts
by: Xu, Zhengzhuo, et al.
Published: (2023)

MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling
by: Li, Yingyue, et al.
Published: (2025)

SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild
by: Liu, Jiawei, et al.
Published: (2025)

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
by: Wu, Xianfeng, et al.
Published: (2025)

Gait Recognition via Collaborating Discriminative and Generative Diffusion Models
by: Xiong, Haijun, et al.
Published: (2025)

AlphaDrive: Unleashing the Power of VLMs in Autonomous Driving via Reinforcement Learning and Reasoning
by: Jiang, Bo, et al.
Published: (2025)

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models
by: Zou, Jialv, et al.
Published: (2025)

2D Gaussians Meet Visual Tokenizer
by: Shi, Yiang, et al.
Published: (2025)

MolSight: Optical Chemical Structure Recognition with SMILES Pretraining, Multi-Granularity Learning and Reinforcement Learning
by: Zhang, Wenrui, et al.
Published: (2025)

GaraMoSt: Parallel Multi-Granularity Motion and Structural Modeling for Efficient Multi-Frame Interpolation in DSA Images
by: Xu, Ziyang, et al.
Published: (2024)

Causality-inspired Discriminative Feature Learning in Triple Domains for Gait Recognition
by: Xiong, Haijun, et al.
Published: (2024)

Cross-Layer Attentive Feature Upsampling for Low-latency Semantic Segmentation
by: Cheng, Tianheng, et al.
Published: (2026)

Mask-Adapter: The Devil is in the Masks for Open-Vocabulary Segmentation
by: Li, Yongkang, et al.
Published: (2024)

Fast High Dynamic Range Radiance Fields for Dynamic Scenes
by: Wu, Guanjun, et al.
Published: (2024)

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
by: Zhu, Lianghui, et al.
Published: (2024)

TriC-Motion: Tri-Domain Causal Modeling Grounded Text-to-Motion Generation
by: Cao, Yiyang, et al.
Published: (2026)

TransLight: Image-Guided Customized Lighting Control with Generative Decoupling
by: Li, Zongming, et al.
Published: (2025)

Spa3R: Predictive Spatial Field Modeling for 3D Visual Reasoning
by: Jiang, Haoyi, et al.
Published: (2026)

Dynamic 2D Gaussians: Geometrically Accurate Radiance Fields for Dynamic Objects
by: Zhang, Shuai, et al.
Published: (2024)

Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting
by: Zhao, Zhengqi, et al.
Published: (2024)

GaitGS: Temporal Feature Learning in Granularity and Span Dimension for Gait Recognition
by: Xiong, Haijun, et al.
Published: (2023)

DeltaMIL: Gated Memory Integration for Efficient and Discriminative Whole Slide Image Analysis
by: Zhu, Yueting, et al.
Published: (2025)

STP4D: Spatio-Temporal-Prompt Consistent Modeling for Text-to-4D Gaussian Splatting
by: Deng, Yunze, et al.
Published: (2025)

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning
by: Zou, Jialv, et al.
Published: (2024)

Boosting Latent Diffusion Models via Disentangled Representation Alignment
by: Page, John, et al.
Published: (2026)