Saved in:
| Main Authors: | He, Xuehai, Feng, Weixi, Zheng, Kaizhi, Lu, Yujie, Zhu, Wanrong, Li, Jiachen, Fan, Yue, Wang, Jianfeng, Li, Linjie, Yang, Zhengyuan, Lin, Kevin, Wang, William Yang, Wang, Lijuan, Wang, Xin Eric |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.08407 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
by: Zheng, Kaizhi, et al.
Published: (2024)
by: Zheng, Kaizhi, et al.
Published: (2024)
LiVOS: Light Video Object Segmentation with Gated Linear Matching
by: Liu, Qin, et al.
Published: (2024)
by: Liu, Qin, et al.
Published: (2024)
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024)
by: Yan, An, et al.
Published: (2024)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2023)
by: Yu, Weihao, et al.
Published: (2023)
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
by: Liu, Fuxiao, et al.
Published: (2023)
by: Liu, Fuxiao, et al.
Published: (2023)
Idea2Img: Iterative Self-Refinement with GPT-4V(ision) for Automatic Image Design and Generation
by: Yang, Zhengyuan, et al.
Published: (2023)
by: Yang, Zhengyuan, et al.
Published: (2023)
Bring Metric Functions into Diffusion Models
by: An, Jie, et al.
Published: (2024)
by: An, Jie, et al.
Published: (2024)
Entity6K: A Large Open-Domain Evaluation Dataset for Real-World Entity Recognition
by: Qiu, Jielin, et al.
Published: (2024)
by: Qiu, Jielin, et al.
Published: (2024)
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
by: Zhai, Yuanhao, et al.
Published: (2024)
by: Zhai, Yuanhao, et al.
Published: (2024)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
by: Hao, Yunzhuo, et al.
Published: (2025)
by: Hao, Yunzhuo, et al.
Published: (2025)
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
by: Wang, Alex Jinpeng, et al.
Published: (2025)
by: Wang, Alex Jinpeng, et al.
Published: (2025)
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
by: Wang, Alex Jinpeng, et al.
Published: (2024)
by: Wang, Alex Jinpeng, et al.
Published: (2024)
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
by: Zhai, Yuanhao, et al.
Published: (2024)
by: Zhai, Yuanhao, et al.
Published: (2024)
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
by: Zheng, Kaizhi, et al.
Published: (2023)
by: Zheng, Kaizhi, et al.
Published: (2023)
MM-Vet v2: A Challenging Benchmark to Evaluate Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2024)
by: Yu, Weihao, et al.
Published: (2024)
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
by: Liao, Jiaqi, et al.
Published: (2025)
by: Liao, Jiaqi, et al.
Published: (2025)
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation
by: Cho, Jaemin, et al.
Published: (2023)
by: Cho, Jaemin, et al.
Published: (2023)
Computer-Use Agents as Judges for Generative User Interface
by: Lin, Kevin Qinghong, et al.
Published: (2025)
by: Lin, Kevin Qinghong, et al.
Published: (2025)
GenXD: Generating Any 3D and 4D Scenes
by: Zhao, Yuyang, et al.
Published: (2024)
by: Zhao, Yuyang, et al.
Published: (2024)
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
by: Hong, Yining, et al.
Published: (2024)
by: Hong, Yining, et al.
Published: (2024)
Self-Evolving 3D Scene Generation from a Single Image
by: Zheng, Kaizhi, et al.
Published: (2025)
by: Zheng, Kaizhi, et al.
Published: (2025)
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
by: Lin, Yan-Bo, et al.
Published: (2025)
by: Lin, Yan-Bo, et al.
Published: (2025)
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning
by: Ni, Minheng, et al.
Published: (2025)
by: Ni, Minheng, et al.
Published: (2025)
VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents
by: Wang, Kangrui, et al.
Published: (2025)
by: Wang, Kangrui, et al.
Published: (2025)
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
by: Wang, Xiyao, et al.
Published: (2024)
by: Wang, Xiyao, et al.
Published: (2024)
JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents
by: Zheng, Kaizhi, et al.
Published: (2022)
by: Zheng, Kaizhi, et al.
Published: (2022)
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
by: Lin, Yiqi, et al.
Published: (2025)
by: Lin, Yiqi, et al.
Published: (2025)
TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation
by: Feng, Weixi, et al.
Published: (2024)
by: Feng, Weixi, et al.
Published: (2024)
MorphoSim: An Interactive, Controllable, and Editable Language-guided 4D World Simulator
by: He, Xuehai, et al.
Published: (2025)
by: He, Xuehai, et al.
Published: (2025)
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement
by: Wang, Xiyao, et al.
Published: (2025)
by: Wang, Xiyao, et al.
Published: (2025)
Reward Guided Latent Consistency Distillation
by: Li, Jiachen, et al.
Published: (2024)
by: Li, Jiachen, et al.
Published: (2024)
DisCo: Disentangled Control for Realistic Human Dance Generation
by: Wang, Tan, et al.
Published: (2023)
by: Wang, Tan, et al.
Published: (2023)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
ExpStar: Towards Automatic Commentary Generation for Multi-discipline Scientific Experiments
by: Chen, Jiali, et al.
Published: (2025)
by: Chen, Jiali, et al.
Published: (2025)
Glance: Accelerating Diffusion Models with 1 Sample
by: Dong, Zhuobai, et al.
Published: (2025)
by: Dong, Zhuobai, et al.
Published: (2025)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models
by: Zheng, Xiangxi, et al.
Published: (2025)
by: Zheng, Xiangxi, et al.
Published: (2025)
Interfacing Foundation Models' Embeddings
by: Zou, Xueyan, et al.
Published: (2023)
by: Zou, Xueyan, et al.
Published: (2023)
Measurement of LLM's Philosophies of Human Nature
by: Ni, Minheng, et al.
Published: (2025)
by: Ni, Minheng, et al.
Published: (2025)
TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
by: Mao, Dongxing, et al.
Published: (2026)
by: Mao, Dongxing, et al.
Published: (2026)
Similar Items
-
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
by: Zheng, Kaizhi, et al.
Published: (2024) -
LiVOS: Light Video Object Segmentation with Gated Linear Matching
by: Liu, Qin, et al.
Published: (2024) -
List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
by: Yan, An, et al.
Published: (2024) -
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
by: Yu, Weihao, et al.
Published: (2023) -
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
by: Liu, Fuxiao, et al.
Published: (2023)