Saved in:
| Main Authors: | Lu, Jianglin, Wang, Hailing, Xu, Yi, Wang, Yizhou, Yang, Kuo, Fu, Yun |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.05184 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
The Indra Representation Hypothesis for Multimodal Alignment
by: Lu, Jianglin, et al.
Published: (2026)
by: Lu, Jianglin, et al.
Published: (2026)
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
by: Dong, Qihua, et al.
Published: (2026)
by: Dong, Qihua, et al.
Published: (2026)
Scale-Free Graph-Language Models
by: Lu, Jianglin, et al.
Published: (2025)
by: Lu, Jianglin, et al.
Published: (2025)
Embodied Representation Alignment with Mirror Neurons
by: Zhu, Wentao, et al.
Published: (2025)
by: Zhu, Wentao, et al.
Published: (2025)
A Survey of Resource-efficient LLM and Multimodal Foundation Models
by: Xu, Mengwei, et al.
Published: (2024)
by: Xu, Mengwei, et al.
Published: (2024)
Unveiling the Unseen: A Comprehensive Survey on Explainable Anomaly Detection in Images and Videos
by: Wang, Yizhou, et al.
Published: (2023)
by: Wang, Yizhou, et al.
Published: (2023)
A Theoretical Survey on Foundation Models
by: Fu, Shi, et al.
Published: (2024)
by: Fu, Shi, et al.
Published: (2024)
Don't Judge by the Look: Towards Motion Coherent Video Representation
by: Zhang, Yitian, et al.
Published: (2024)
by: Zhang, Yitian, et al.
Published: (2024)
Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
by: Chen, Zhawnen, et al.
Published: (2024)
by: Chen, Zhawnen, et al.
Published: (2024)
Learning to Rotate: Temporal and Semantic Rotary Encoding for Sequential Modeling
by: Cheng, Hailing, et al.
Published: (2026)
by: Cheng, Hailing, et al.
Published: (2026)
AI Alignment: A Comprehensive Survey
by: Ji, Jiaming, et al.
Published: (2023)
by: Ji, Jiaming, et al.
Published: (2023)
D-CoDe: Scaling Image-Pretrained VLMs to Video via Dynamic Compression and Question Decomposition
by: Huang, Yiyang, et al.
Published: (2025)
by: Huang, Yiyang, et al.
Published: (2025)
Trajectory Prediction Meets Large Language Models: A Survey
by: Xu, Yi, et al.
Published: (2025)
by: Xu, Yi, et al.
Published: (2025)
Distorted or Fabricated? A Survey on Hallucination in Video LLMs
by: Huang, Yiyang, et al.
Published: (2026)
by: Huang, Yiyang, et al.
Published: (2026)
IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
by: Fu, Deqing, et al.
Published: (2024)
by: Fu, Deqing, et al.
Published: (2024)
RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models
by: Wu, Hao, et al.
Published: (2026)
by: Wu, Hao, et al.
Published: (2026)
Human-Centric Foundation Models: Perception, Generation and Agentic Modeling
by: Tang, Shixiang, et al.
Published: (2025)
by: Tang, Shixiang, et al.
Published: (2025)
BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning
by: Wu, Yizhou, et al.
Published: (2026)
by: Wu, Yizhou, et al.
Published: (2026)
MIO: A Foundation Model on Multimodal Tokens
by: Wang, Zekun, et al.
Published: (2024)
by: Wang, Zekun, et al.
Published: (2024)
Multimodal Representation Alignment for Cross-modal Information Retrieval
by: Xu, Fan, et al.
Published: (2025)
by: Xu, Fan, et al.
Published: (2025)
Beyond Interleaving: Causal Attention Reformulations for Generative Recommender Systems
by: Cheng, Hailing
Published: (2026)
by: Cheng, Hailing
Published: (2026)
A Survey of Multimodal Mathematical Reasoning: From Perception, Alignment to Reasoning
by: Yang, Tianyu, et al.
Published: (2026)
by: Yang, Tianyu, et al.
Published: (2026)
Session-Level Spoken Language Assessment with a Multimodal Foundation Model via Multi-Target Learning
by: Lin, Hong-Yun, et al.
Published: (2025)
by: Lin, Hong-Yun, et al.
Published: (2025)
Understanding the Emergence of Multimodal Representation Alignment
by: Tjandrasuwita, Megan, et al.
Published: (2025)
by: Tjandrasuwita, Megan, et al.
Published: (2025)
A Survey on Benchmarks of Multimodal Large Language Models
by: Li, Jian, et al.
Published: (2024)
by: Li, Jian, et al.
Published: (2024)
When Tabular Foundation Models Meet Strategic Tabular Data: A Prior Alignment Approach
by: Lv, Xinpeng, et al.
Published: (2026)
by: Lv, Xinpeng, et al.
Published: (2026)
Reinforced MLLM: A Survey on RL-Based Reasoning in Multimodal Large Language Models
by: Zhou, Guanghao, et al.
Published: (2025)
by: Zhou, Guanghao, et al.
Published: (2025)
Foundations and Recent Trends in Multimodal Mobile Agents: A Survey
by: Wu, Biao, et al.
Published: (2024)
by: Wu, Biao, et al.
Published: (2024)
A Survey of Scientific Large Language Models: From Data Foundations to Agent Frontiers
by: Hu, Ming, et al.
Published: (2025)
by: Hu, Ming, et al.
Published: (2025)
From Efficient Multimodal Models to World Models: A Survey
by: Mai, Xinji, et al.
Published: (2024)
by: Mai, Xinji, et al.
Published: (2024)
Accessing Vision Foundation Models via ImageNet-1K
by: Zhang, Yitian, et al.
Published: (2024)
by: Zhang, Yitian, et al.
Published: (2024)
Endogenous Reprompting: Self-Evolving Cognitive Alignment for Unified Multimodal Models
by: Tang, Zhenchen, et al.
Published: (2026)
by: Tang, Zhenchen, et al.
Published: (2026)
Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment
by: Yi, Lingjie, et al.
Published: (2025)
by: Yi, Lingjie, et al.
Published: (2025)
Deploying Foundation Model Powered Agent Services: A Survey
by: Xu, Wenchao, et al.
Published: (2024)
by: Xu, Wenchao, et al.
Published: (2024)
Synergizing Foundation Models and Federated Learning: A Survey
by: Li, Shenghui, et al.
Published: (2024)
by: Li, Shenghui, et al.
Published: (2024)
EGRA:Toward Enhanced Behavior Graphs and Representation Alignment for Multimodal Recommendation
by: Zhang, Xiaoxiong, et al.
Published: (2025)
by: Zhang, Xiaoxiong, et al.
Published: (2025)
Revisiting Model Stitching In the Foundation Model Era
by: Mai, Zheda, et al.
Published: (2026)
by: Mai, Zheda, et al.
Published: (2026)
Boosting Large Language Models with Mask Fine-Tuning
by: Zhang, Mingyuan, et al.
Published: (2025)
by: Zhang, Mingyuan, et al.
Published: (2025)
ECG-MoE: Mixture-of-Expert Electrocardiogram Foundation Model
by: Xu, Yuhao, et al.
Published: (2026)
by: Xu, Yuhao, et al.
Published: (2026)
From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models
by: Zhou, Chenyue, et al.
Published: (2025)
by: Zhou, Chenyue, et al.
Published: (2025)
Similar Items
-
The Indra Representation Hypothesis for Multimodal Alignment
by: Lu, Jianglin, et al.
Published: (2026) -
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
by: Dong, Qihua, et al.
Published: (2026) -
Scale-Free Graph-Language Models
by: Lu, Jianglin, et al.
Published: (2025) -
Embodied Representation Alignment with Mirror Neurons
by: Zhu, Wentao, et al.
Published: (2025) -
A Survey of Resource-efficient LLM and Multimodal Foundation Models
by: Xu, Mengwei, et al.
Published: (2024)