Saved in:
| Main Authors: | Jin, Peng, Takanobu, Ryuichi, Zhang, Wancai, Cao, Xiaochun, Yuan, Li |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.08046 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
by: Sheng, Yuan, et al.
Published: (2025)
by: Sheng, Yuan, et al.
Published: (2025)
UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025)
by: Ma, Chuofan, et al.
Published: (2025)
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
by: Qin, Luozheng, et al.
Published: (2026)
by: Qin, Luozheng, et al.
Published: (2026)
UniVideo: Unified Understanding, Generation, and Editing for Videos
by: Wei, Cong, et al.
Published: (2025)
by: Wei, Cong, et al.
Published: (2025)
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
by: Wang, Peiyu, et al.
Published: (2025)
by: Wang, Peiyu, et al.
Published: (2025)
UniViTAR: Unified Vision Transformer with Native Resolution
by: Qiao, Limeng, et al.
Published: (2025)
by: Qiao, Limeng, et al.
Published: (2025)
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation
by: Zhang, Chi, et al.
Published: (2025)
by: Zhang, Chi, et al.
Published: (2025)
ViUniT: Visual Unit Tests for More Robust Visual Programming
by: Panagopoulou, Artemis, et al.
Published: (2024)
by: Panagopoulou, Artemis, et al.
Published: (2024)
RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
by: Liu, Fanfan, et al.
Published: (2024)
by: Liu, Fanfan, et al.
Published: (2024)
UniVBench: Towards Unified Evaluation for Video Foundation Models
by: Wei, Jianhui, et al.
Published: (2026)
by: Wei, Jianhui, et al.
Published: (2026)
LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model
by: AI, Inclusion, et al.
Published: (2026)
by: AI, Inclusion, et al.
Published: (2026)
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
by: Maaz, Muhammad, et al.
Published: (2023)
by: Maaz, Muhammad, et al.
Published: (2023)
UniMesh: Unifying 3D Mesh Understanding and Generation
by: Huang, Peng, et al.
Published: (2026)
by: Huang, Peng, et al.
Published: (2026)
UniLumos: Fast and Unified Image and Video Relighting with Physics-Plausible Feedback
by: Liu, Ropeway, et al.
Published: (2025)
by: Liu, Ropeway, et al.
Published: (2025)
LinVT: Empower Your Image-level Large Language Model to Understand Videos
by: Gao, Lishuai, et al.
Published: (2024)
by: Gao, Lishuai, et al.
Published: (2024)
Uni-Sign: Toward Unified Sign Language Understanding at Scale
by: Li, Zecheng, et al.
Published: (2025)
by: Li, Zecheng, et al.
Published: (2025)
UniGeo: Taming Video Diffusion for Unified Consistent Geometry Estimation
by: Sun, Yang-Tian, et al.
Published: (2025)
by: Sun, Yang-Tian, et al.
Published: (2025)
Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation
by: Ding, Hongyu, et al.
Published: (2026)
by: Ding, Hongyu, et al.
Published: (2026)
ViSpeak: Visual Instruction Feedback in Streaming Videos
by: Fu, Shenghao, et al.
Published: (2025)
by: Fu, Shenghao, et al.
Published: (2025)
Harmonizing Visual Representations for Unified Multimodal Understanding and Generation
by: Wu, Size, et al.
Published: (2025)
by: Wu, Size, et al.
Published: (2025)
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
by: Yue, Zhengrong, et al.
Published: (2025)
by: Yue, Zhengrong, et al.
Published: (2025)
ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model
by: Zhang, Juntian, et al.
Published: (2025)
by: Zhang, Juntian, et al.
Published: (2025)
UniParser: Multi-Human Parsing with Unified Correlation Representation Learning
by: Chu, Jiaming, et al.
Published: (2023)
by: Chu, Jiaming, et al.
Published: (2023)
UniVid: The Open-Source Unified Video Model
by: Luo, Jiabin, et al.
Published: (2025)
by: Luo, Jiabin, et al.
Published: (2025)
ViKey: Enhancing Temporal Understanding in Videos via Visual Prompting
by: Lee, Yeonkyung, et al.
Published: (2026)
by: Lee, Yeonkyung, et al.
Published: (2026)
UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions
by: Zhang, Guozhen, et al.
Published: (2025)
by: Zhang, Guozhen, et al.
Published: (2025)
CogVLM2: Visual Language Models for Image and Video Understanding
by: Hong, Wenyi, et al.
Published: (2024)
by: Hong, Wenyi, et al.
Published: (2024)
ViLLa: Video Reasoning Segmentation with Large Language Model
by: Zheng, Rongkun, et al.
Published: (2024)
by: Zheng, Rongkun, et al.
Published: (2024)
UniCompress: Token Compression for Unified Vision-Language Understanding and Generation
by: Wang, Ziyao, et al.
Published: (2026)
by: Wang, Ziyao, et al.
Published: (2026)
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation
by: Wang, Xiang, et al.
Published: (2024)
by: Wang, Xiang, et al.
Published: (2024)
UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction
by: Cao, Jin, et al.
Published: (2025)
by: Cao, Jin, et al.
Published: (2025)
CAT: Enhancing Multimodal Large Language Model to Answer Questions in Dynamic Audio-Visual Scenarios
by: Ye, Qilang, et al.
Published: (2024)
by: Ye, Qilang, et al.
Published: (2024)
UniCustom: Unified Visual Conditioning for Multi-Reference Image Generation
by: Xu, Yiyan, et al.
Published: (2026)
by: Xu, Yiyan, et al.
Published: (2026)
ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models
by: Zhou, Kaiwen, et al.
Published: (2023)
by: Zhou, Kaiwen, et al.
Published: (2023)
UniLight: A Unified Representation for Lighting
by: Zhang, Zitian, et al.
Published: (2025)
by: Zhang, Zitian, et al.
Published: (2025)
UniFormer: Unifying Convolution and Self-attention for Visual Recognition
by: Li, Kunchang, et al.
Published: (2022)
by: Li, Kunchang, et al.
Published: (2022)
UniSVG: A Unified Dataset for Vector Graphic Understanding and Generation with Multimodal Large Language Models
by: Li, Jinke, et al.
Published: (2025)
by: Li, Jinke, et al.
Published: (2025)
UniNote: A Unified Embedding Model for Multimodal Representation and Ranking
by: Zhao, Jinghan, et al.
Published: (2026)
by: Zhao, Jinghan, et al.
Published: (2026)
TBAC-UniImage: Unified Understanding and Generation by Ladder-Side Diffusion Tuning
by: Xu, Junzhe, et al.
Published: (2025)
by: Xu, Junzhe, et al.
Published: (2025)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception
by: Song, Xinyang, et al.
Published: (2025)
by: Song, Xinyang, et al.
Published: (2025)
Similar Items
-
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding
by: Sheng, Yuan, et al.
Published: (2025) -
UniTok: A Unified Tokenizer for Visual Generation and Understanding
by: Ma, Chuofan, et al.
Published: (2025) -
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
by: Qin, Luozheng, et al.
Published: (2026) -
UniVideo: Unified Understanding, Generation, and Editing for Videos
by: Wei, Cong, et al.
Published: (2025) -
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding and Generation
by: Wang, Peiyu, et al.
Published: (2025)