Saved in:
| Main Authors: | Wang, Guanqun, Wei, Xinyu, Liu, Jiaming, Zhang, Ray, Zhang, Yichi, Zhang, Kevin, Chong, Maurice, Zhang, Shanghang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.15768 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer
by: Liu, Jiaming, et al.
Published: (2022)
by: Liu, Jiaming, et al.
Published: (2022)
A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation
by: Li, Chenxuan, et al.
Published: (2024)
by: Li, Chenxuan, et al.
Published: (2024)
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
by: Wen, Junjie, et al.
Published: (2025)
by: Wen, Junjie, et al.
Published: (2025)
MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning
by: Zhang, Qizhe, et al.
Published: (2023)
by: Zhang, Qizhe, et al.
Published: (2023)
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
by: Zhang, Renrui, et al.
Published: (2024)
by: Zhang, Renrui, et al.
Published: (2024)
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
by: Chen, Xinyan, et al.
Published: (2023)
by: Chen, Xinyan, et al.
Published: (2023)
RenderOcc: Vision-Centric 3D Occupancy Prediction with 2D Rendering Supervision
by: Pan, Mingjie, et al.
Published: (2023)
by: Pan, Mingjie, et al.
Published: (2023)
RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation
by: Liu, Jiaming, et al.
Published: (2024)
by: Liu, Jiaming, et al.
Published: (2024)
NTO3D: Neural Target Object 3D Reconstruction with Segment Anything
by: Wei, Xiaobao, et al.
Published: (2023)
by: Wei, Xiaobao, et al.
Published: (2023)
SCBench: A Sports Commentary Benchmark for Video LLMs
by: Ge, Kuangzhi, et al.
Published: (2024)
by: Ge, Kuangzhi, et al.
Published: (2024)
FreeKD: Knowledge Distillation via Semantic Frequency Prompt
by: Zhang, Yuan, et al.
Published: (2023)
by: Zhang, Yuan, et al.
Published: (2023)
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning
by: Huang, Qihan, et al.
Published: (2025)
by: Huang, Qihan, et al.
Published: (2025)
DIVA: Harnessing the Representation Divergence in Unified Multimodal Models for Mutual Reinforcement
by: Lu, Renjie, et al.
Published: (2026)
by: Lu, Renjie, et al.
Published: (2026)
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
by: Li, Xiaotong, et al.
Published: (2024)
by: Li, Xiaotong, et al.
Published: (2024)
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM
by: Gao, Timin, et al.
Published: (2024)
by: Gao, Timin, et al.
Published: (2024)
A Vanilla Multi-Task Framework for Dense Visual Prediction Solution to 1st VCL Challenge -- Multi-Task Robustness Track
by: Chen, Zehui, et al.
Published: (2024)
by: Chen, Zehui, et al.
Published: (2024)
Abstractive Visual Understanding of Multi-modal Structured Knowledge: A New Perspective for MLLM Evaluation
by: Zhang, Yichi, et al.
Published: (2025)
by: Zhang, Yichi, et al.
Published: (2025)
Follow-Your-Instruction: A Comprehensive MLLM Agent for World Data Synthesis
by: Feng, Kunyu, et al.
Published: (2025)
by: Feng, Kunyu, et al.
Published: (2025)
MLLM-Enhanced Face Forgery Detection: A Vision-Language Fusion Solution
by: Peng, Siran, et al.
Published: (2025)
by: Peng, Siran, et al.
Published: (2025)
Can Large Vision-Language Models Understand Multimodal Sarcasm?
by: Wang, Xinyu, et al.
Published: (2025)
by: Wang, Xinyu, et al.
Published: (2025)
ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation
by: Liu, Jiaming, et al.
Published: (2023)
by: Liu, Jiaming, et al.
Published: (2023)
A Conditional Generative Framework for Synthetic Data Augmentation in Segmenting Thin and Elongated Structures in Biological Images
by: Liu, Yi, et al.
Published: (2025)
by: Liu, Yi, et al.
Published: (2025)
OmniIndoor3D: Comprehensive Indoor 3D Reconstruction
by: Wei, Xiaobao, et al.
Published: (2025)
by: Wei, Xiaobao, et al.
Published: (2025)
LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model
by: Luo, Yulin, et al.
Published: (2024)
by: Luo, Yulin, et al.
Published: (2024)
Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation
by: Liu, Jiaming, et al.
Published: (2023)
by: Liu, Jiaming, et al.
Published: (2023)
CAD-MLLM: Unifying Multimodality-Conditioned CAD Generation With MLLM
by: Xu, Jingwei, et al.
Published: (2024)
by: Xu, Jingwei, et al.
Published: (2024)
M$^{2}$Chat: Empowering VLM for Multimodal LLM Interleaved Text-Image Generation
by: Chi, Xiaowei, et al.
Published: (2023)
by: Chi, Xiaowei, et al.
Published: (2023)
SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics
by: Liu, Mengzhen, et al.
Published: (2026)
by: Liu, Mengzhen, et al.
Published: (2026)
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
by: Zhang, Yin, et al.
Published: (2026)
by: Zhang, Yin, et al.
Published: (2026)
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought
by: Zhang, Shuyi, et al.
Published: (2025)
by: Zhang, Shuyi, et al.
Published: (2025)
BEVUDA++: Geometric-aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection
by: Zhang, Rongyu, et al.
Published: (2025)
by: Zhang, Rongyu, et al.
Published: (2025)
Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
by: Luo, Yulin, et al.
Published: (2025)
by: Luo, Yulin, et al.
Published: (2025)
MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection
by: Wang, Shuyu, et al.
Published: (2025)
by: Wang, Shuyu, et al.
Published: (2025)
GAP-MLLM: Geometry-Aligned Pre-training for Activating 3D Spatial Perception in Multimodal Large Language Models
by: Zhang, Jiaxin, et al.
Published: (2026)
by: Zhang, Jiaxin, et al.
Published: (2026)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
by: Tan, Huajie, et al.
Published: (2025)
by: Tan, Huajie, et al.
Published: (2025)
BEVUDA: Multi-geometric Space Alignments for Domain Adaptive BEV 3D Object Detection
by: Liu, Jiaming, et al.
Published: (2022)
by: Liu, Jiaming, et al.
Published: (2022)
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
by: Zhao, Haozhe, et al.
Published: (2024)
by: Zhao, Haozhe, et al.
Published: (2024)
RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video
by: Xun, Shuhang, et al.
Published: (2025)
by: Xun, Shuhang, et al.
Published: (2025)
UV-M3TL: A Unified and Versatile Multimodal Multi-Task Learning Framework for Assistive Driving Perception
by: Liu, Wenzhuo, et al.
Published: (2026)
by: Liu, Wenzhuo, et al.
Published: (2026)
Similar Items
-
Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer
by: Liu, Jiaming, et al.
Published: (2022) -
A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation
by: Li, Chenxuan, et al.
Published: (2024) -
dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought
by: Wen, Junjie, et al.
Published: (2025) -
MoSA: Mixture of Sparse Adapters for Visual Efficient Tuning
by: Zhang, Qizhe, et al.
Published: (2023) -
MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine
by: Zhang, Renrui, et al.
Published: (2024)