Saved in:
| Main Authors: | Jin, Xin, Li, Siyuan, Jian, Siyong, Yu, Kai, Wang, Huan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.23479 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
by: Jian, Siyong, et al.
Published: (2025)
by: Jian, Siyong, et al.
Published: (2025)
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
by: Jian, Siyong, et al.
Published: (2026)
by: Jian, Siyong, et al.
Published: (2026)
MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging
by: Zhang, Luyuan, et al.
Published: (2026)
by: Zhang, Luyuan, et al.
Published: (2026)
Mix-Modality Person Re-Identification: A New and Practical Paradigm
by: Liu, Wei, et al.
Published: (2024)
by: Liu, Wei, et al.
Published: (2024)
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
by: Li, Siyuan, et al.
Published: (2025)
by: Li, Siyuan, et al.
Published: (2025)
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
by: Jia, Zi-Yi, et al.
Published: (2026)
by: Jia, Zi-Yi, et al.
Published: (2026)
Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation
by: Li, Xinyao, et al.
Published: (2024)
by: Li, Xinyao, et al.
Published: (2024)
Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking
by: Chen, Xin, et al.
Published: (2023)
by: Chen, Xin, et al.
Published: (2023)
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
by: Wang, Wei, et al.
Published: (2024)
by: Wang, Wei, et al.
Published: (2024)
MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding
by: Wang, Jiaze, et al.
Published: (2024)
by: Wang, Jiaze, et al.
Published: (2024)
EarlyTom: Early Token Compression Completes Fast Video Understanding
by: Wang, Hesong, et al.
Published: (2026)
by: Wang, Hesong, et al.
Published: (2026)
AV-Unified: A Unified Framework for Audio-visual Scene Understanding
by: Li, Guangyao, et al.
Published: (2026)
by: Li, Guangyao, et al.
Published: (2026)
UniFork: Exploring Modality Alignment for Unified Multimodal Understanding and Generation
by: Li, Teng, et al.
Published: (2025)
by: Li, Teng, et al.
Published: (2025)
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
by: Zhang, Yiwei, et al.
Published: (2026)
by: Zhang, Yiwei, et al.
Published: (2026)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)
by: Wu, Size, et al.
Published: (2025)
Adversarial Robustness for Unified Multi-Modal Encoders via Efficient Calibration
by: Liao, Chih-Ting, et al.
Published: (2025)
by: Liao, Chih-Ting, et al.
Published: (2025)
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning
by: Zhang, Liang, et al.
Published: (2024)
by: Zhang, Liang, et al.
Published: (2024)
Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
by: Huang, Yuqing, et al.
Published: (2026)
by: Huang, Yuqing, et al.
Published: (2026)
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
by: Dai, Yifan, et al.
Published: (2026)
by: Dai, Yifan, et al.
Published: (2026)
TerraGen: A Unified Multi-Task Layout Generation Framework for Remote Sensing Data Augmentation
by: Tang, Datao, et al.
Published: (2025)
by: Tang, Datao, et al.
Published: (2025)
Unifying Heterogeneous Multi-Modal Remote Sensing Detection Via Language-Pivoted Pretraining
by: Li, Yuxuan, et al.
Published: (2026)
by: Li, Yuxuan, et al.
Published: (2026)
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies
by: Song, Wei, et al.
Published: (2025)
by: Song, Wei, et al.
Published: (2025)
Unified Autoregressive Visual Generation and Understanding with Continuous Tokens
by: Fan, Lijie, et al.
Published: (2025)
by: Fan, Lijie, et al.
Published: (2025)
VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
by: Chen, Jian, et al.
Published: (2025)
by: Chen, Jian, et al.
Published: (2025)
CloudEye: A New Paradigm of Video Analysis System for Mobile Visual Scenarios
by: Cui, Huan, et al.
Published: (2024)
by: Cui, Huan, et al.
Published: (2024)
GaussianDWM: 3D Gaussian Driving World Model for Unified Scene Understanding and Multi-Modal Generation
by: Deng, Tianchen, et al.
Published: (2025)
by: Deng, Tianchen, et al.
Published: (2025)
Dataset Augmentation by Mixing Visual Concepts
by: Rahat, Abdullah Al, et al.
Published: (2024)
by: Rahat, Abdullah Al, et al.
Published: (2024)
UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025)
by: Ma, Chuofan, et al.
Published: (2025)
Training-Free Model Merging for Multi-target Domain Adaptation
by: Li, Wenyi, et al.
Published: (2024)
by: Li, Wenyi, et al.
Published: (2024)
The Shape of Sight: A Homological Framework for Unifying Visual Perception
by: Li, Xin
Published: (2018)
by: Li, Xin
Published: (2018)
Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
by: Li, Yuanbo, et al.
Published: (2026)
by: Li, Yuanbo, et al.
Published: (2026)
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm
by: Jin, Jiandong, et al.
Published: (2023)
by: Jin, Jiandong, et al.
Published: (2023)
Towards More Unified In-context Visual Understanding
by: Sheng, Dianmo, et al.
Published: (2023)
by: Sheng, Dianmo, et al.
Published: (2023)
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
USegMix: Unsupervised Segment Mix for Efficient Data Augmentation in Pathology Images
by: Wang, Jiamu, et al.
Published: (2025)
by: Wang, Jiamu, et al.
Published: (2025)
DynImg: Key Frames with Visual Prompts are Good Representation for Multi-Modal Video Understanding
by: Bao, Xiaoyi, et al.
Published: (2025)
by: Bao, Xiaoyi, et al.
Published: (2025)
UniMoCo: Unified Modality Completion for Robust Multi-Modal Embeddings
by: Qin, Jiajun, et al.
Published: (2025)
by: Qin, Jiajun, et al.
Published: (2025)
StreamingAssistant: Efficient Visual Token Pruning for Accelerating Online Video Understanding
by: Jin, Xinqi, et al.
Published: (2025)
by: Jin, Xinqi, et al.
Published: (2025)
VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis
by: Yu, Jian, et al.
Published: (2026)
by: Yu, Jian, et al.
Published: (2026)
MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
by: Zhan, Chenlu, et al.
Published: (2024)
by: Zhan, Chenlu, et al.
Published: (2024)
Similar Items
-
SSD: Spatial-Semantic Head Decoupling for Efficient Autoregressive Image Generation
by: Jian, Siyong, et al.
Published: (2025) -
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution
by: Jian, Siyong, et al.
Published: (2026) -
MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging
by: Zhang, Luyuan, et al.
Published: (2026) -
Mix-Modality Person Re-Identification: A New and Practical Paradigm
by: Liu, Wei, et al.
Published: (2024) -
MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization
by: Li, Siyuan, et al.
Published: (2025)