Saved in:
| Main Authors: | Xiao, Yicheng, Song, Lin, Huang, Shaoli, Wang, Jiangshan, Song, Siyu, Ge, Yixiao, Li, Xiu, Shan, Ying |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.02395 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025)
by: Xiao, Yicheng, et al.
Published: (2025)
LoRA-Gen: Specializing Large Language Model via Online LoRA Generation
by: Xiao, Yicheng, et al.
Published: (2025)
by: Xiao, Yicheng, et al.
Published: (2025)
Memory augment is All You Need for image restoration
by: Zhang, Xiao Feng, et al.
Published: (2023)
by: Zhang, Xiao Feng, et al.
Published: (2023)
Realistic Human Motion Generation with Cross-Diffusion Models
by: Ren, Zeping, et al.
Published: (2023)
by: Ren, Zeping, et al.
Published: (2023)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing
by: Wang, Jiangshan, et al.
Published: (2024)
by: Wang, Jiangshan, et al.
Published: (2024)
Aligning Latent Spaces with Flow Priors
by: Li, Yizhuo, et al.
Published: (2025)
by: Li, Yizhuo, et al.
Published: (2025)
YOLO-World: Real-Time Open-Vocabulary Object Detection
by: Cheng, Tianheng, et al.
Published: (2024)
by: Cheng, Tianheng, et al.
Published: (2024)
Meta-Adapter: An Online Few-shot Learner for Vision-Language Model
by: Cheng, Cheng, et al.
Published: (2023)
by: Cheng, Cheng, et al.
Published: (2023)
MaTe: Images Are All You Need for Material Transfer via Diffusion Transformer
by: Huang, Nisha, et al.
Published: (2026)
by: Huang, Nisha, et al.
Published: (2026)
AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction
by: Cheng, Junhao, et al.
Published: (2025)
by: Cheng, Junhao, et al.
Published: (2025)
Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
by: Cao, Pu, et al.
Published: (2023)
by: Cao, Pu, et al.
Published: (2023)
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
by: Li, Yizhuo, et al.
Published: (2024)
by: Li, Yizhuo, et al.
Published: (2024)
UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
by: Ding, Xiaohan, et al.
Published: (2023)
by: Ding, Xiaohan, et al.
Published: (2023)
From Prediction to Perfection: Introducing Refinement to Autoregressive Image Generation
by: Cheng, Cheng, et al.
Published: (2025)
by: Cheng, Cheng, et al.
Published: (2025)
VL-Mamba: Exploring State Space Models for Multimodal Learning
by: Qiao, Yanyuan, et al.
Published: (2024)
by: Qiao, Yanyuan, et al.
Published: (2024)
ST-LLM: Large Language Models Are Effective Temporal Learners
by: Liu, Ruyang, et al.
Published: (2024)
by: Liu, Ruyang, et al.
Published: (2024)
GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers
by: Ma, Shijie, et al.
Published: (2025)
by: Ma, Shijie, et al.
Published: (2025)
SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing
by: Ge, Yuying, et al.
Published: (2024)
by: Ge, Yuying, et al.
Published: (2024)
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension
by: Li, Bohao, et al.
Published: (2024)
by: Li, Bohao, et al.
Published: (2024)
Is Hyperbolic Space All You Need for Medical Anomaly Detection?
by: Gonzalez-Jimenez, Alvaro, et al.
Published: (2025)
by: Gonzalez-Jimenez, Alvaro, et al.
Published: (2025)
Supervised Fine-tuning in turn Improves Visual Foundation Models
by: Jiang, Xiaohu, et al.
Published: (2024)
by: Jiang, Xiaohu, et al.
Published: (2024)
[MASK] is All You Need
by: Hu, Vincent Tao, et al.
Published: (2024)
by: Hu, Vincent Tao, et al.
Published: (2024)
Is Intermediate Fusion All You Need for UAV-based Collaborative Perception?
by: Hao, Jiuwu, et al.
Published: (2025)
by: Hao, Jiuwu, et al.
Published: (2025)
Programmable Motion Generation for Open-Set Motion Control Tasks
by: Liu, Hanchao, et al.
Published: (2024)
by: Liu, Hanchao, et al.
Published: (2024)
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
by: Qiu, Lu, et al.
Published: (2024)
by: Qiu, Lu, et al.
Published: (2024)
Distraction is All You Need for Multimodal Large Language Model Jailbreaking
by: Yang, Zuopeng, et al.
Published: (2025)
by: Yang, Zuopeng, et al.
Published: (2025)
Boosting Domain Incremental Learning: Selecting the Optimal Parameters is All You Need
by: Wang, Qiang, et al.
Published: (2025)
by: Wang, Qiang, et al.
Published: (2025)
Pairwise Comparisons Are All You Need
by: Chahine, Nicolas, et al.
Published: (2024)
by: Chahine, Nicolas, et al.
Published: (2024)
Positive Label Is All You Need for Multi-Label Classification
by: Yuan, Zhixiang, et al.
Published: (2023)
by: Yuan, Zhixiang, et al.
Published: (2023)
CORDIC Is All You Need
by: Kokane, Omkar, et al.
Published: (2025)
by: Kokane, Omkar, et al.
Published: (2025)
SEED-Story: Multimodal Long Story Generation with Large Language Model
by: Yang, Shuai, et al.
Published: (2024)
by: Yang, Shuai, et al.
Published: (2024)
Zoom and Shift are All You Need
by: Qin, Jiahao
Published: (2024)
by: Qin, Jiahao
Published: (2024)
Large Language Models for Lossless Image Compression: Next-Pixel Prediction in Language Space is All You Need
by: Chen, Kecheng, et al.
Published: (2024)
by: Chen, Kecheng, et al.
Published: (2024)
One Snapshot is All You Need: A Generalized Method for mmWave Signal Generation
by: Huang, Teng, et al.
Published: (2025)
by: Huang, Teng, et al.
Published: (2025)
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
by: Pu, Junfu, et al.
Published: (2025)
by: Pu, Junfu, et al.
Published: (2025)
AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation
by: Qiu, Lu, et al.
Published: (2025)
by: Qiu, Lu, et al.
Published: (2025)
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
by: Cheng, Junhao, et al.
Published: (2025)
by: Cheng, Junhao, et al.
Published: (2025)
Similar Items
-
HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding
by: Yang, Rui, et al.
Published: (2025) -
HaploOmni: Unified Single Transformer for Multimodal Video Understanding and Generation
by: Xiao, Yicheng, et al.
Published: (2025) -
LoRA-Gen: Specializing Large Language Model via Online LoRA Generation
by: Xiao, Yicheng, et al.
Published: (2025) -
Memory augment is All You Need for image restoration
by: Zhang, Xiao Feng, et al.
Published: (2023) -
Realistic Human Motion Generation with Cross-Diffusion Models
by: Ren, Zeping, et al.
Published: (2023)