Saved in:
| Main Authors: | Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Duan, Haonan, Pu, Hengjun, Tong, Ronglei, Zhao, Chengyang, Zhu, Xizhou, Qiao, Yu, Dai, Jifeng, Chen, Yuntao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.19757 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Diffusion Transformer Policy
by: Hou, Zhi, et al.
Published: (2024)
by: Hou, Zhi, et al.
Published: (2024)
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
by: Zhang, Tianyi, et al.
Published: (2025)
by: Zhang, Tianyi, et al.
Published: (2025)
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
by: Xiong, Yuwen, et al.
Published: (2024)
by: Xiong, Yuwen, et al.
Published: (2024)
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
by: Luo, Gen, et al.
Published: (2025)
by: Luo, Gen, et al.
Published: (2025)
big.LITTLE Vision Transformer for Efficient Visual Recognition
by: Guo, He, et al.
Published: (2024)
by: Guo, He, et al.
Published: (2024)
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)
by: Wu, Jiannan, et al.
Published: (2024)
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)
by: Tian, Changyao, et al.
Published: (2024)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)
by: Duan, Yuchen, et al.
Published: (2024)
CSU-PCAST: A Dual-Branch Transformer Framework for medium-range ensemble Precipitation Forecasting
by: Xiong, Tianyi, et al.
Published: (2025)
by: Xiong, Tianyi, et al.
Published: (2025)
CoMemo: LVLMs Need Image Context with Image Memory
by: Liu, Shi, et al.
Published: (2025)
by: Liu, Shi, et al.
Published: (2025)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)
by: Chen, Zhe, et al.
Published: (2023)
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
by: Ge, Junqi, et al.
Published: (2024)
by: Ge, Junqi, et al.
Published: (2024)
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)
by: Tao, Chenxin, et al.
Published: (2024)
FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
by: Reuss, Moritz, et al.
Published: (2025)
by: Reuss, Moritz, et al.
Published: (2025)
Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
by: Yang, Ganlin, et al.
Published: (2025)
by: Yang, Ganlin, et al.
Published: (2025)
Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives
by: He, Chengyang, et al.
Published: (2025)
by: He, Chengyang, et al.
Published: (2025)
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)
by: Luo, Gen, et al.
Published: (2024)
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
by: Wang, Yi, et al.
Published: (2026)
by: Wang, Yi, et al.
Published: (2026)
LangBridge: Interpreting Image as a Combination of Language Embeddings
by: Liao, Jiaqi, et al.
Published: (2025)
by: Liao, Jiaqi, et al.
Published: (2025)
Learning A Low-Level Vision Generalist via Visual Task Prompt
by: Chen, Xiangyu, et al.
Published: (2024)
by: Chen, Xiangyu, et al.
Published: (2024)
Demystify Transformers & Convolutions in Modern Image Deep Networks
by: Hu, Xiaowei, et al.
Published: (2022)
by: Hu, Xiaowei, et al.
Published: (2022)
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
by: Meng, Fanqing, et al.
Published: (2024)
by: Meng, Fanqing, et al.
Published: (2024)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)
by: Wu, Zhiyong, et al.
Published: (2024)
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
by: Chen, Xinyi, et al.
Published: (2025)
by: Chen, Xinyi, et al.
Published: (2025)
HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies
by: Du, Zhiying, et al.
Published: (2025)
by: Du, Zhiying, et al.
Published: (2025)
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
Data and code used for paper entitled "The Bear Attack as a Warning: From Clouded Skies to Collapsing Ecosystems"
by: Xiao, Hengjun
Published: (2026)
by: Xiao, Hengjun
Published: (2026)
Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)
by: Zhu, Xizhou, et al.
Published: (2024)
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)
by: Yang, Chenyu, et al.
Published: (2024)
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023)
by: Tian, Changyao, et al.
Published: (2023)
A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks
by: Ji, Ronglei, et al.
Published: (2024)
by: Ji, Ronglei, et al.
Published: (2024)
Multi-Field De-interlacing using Deformable Convolution Residual Blocks and Self-Attention
by: Ji, Ronglei, et al.
Published: (2022)
by: Ji, Ronglei, et al.
Published: (2022)
DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
by: Wang, Hebaixu, et al.
Published: (2025)
by: Wang, Hebaixu, et al.
Published: (2025)
MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)
by: Liu, Yangzhou, et al.
Published: (2024)
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)
by: Li, Hao, et al.
Published: (2023)
Turning Video Models into Generalist Robot Policies
by: Li, Sizhe Lester, et al.
Published: (2026)
by: Li, Sizhe Lester, et al.
Published: (2026)
ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge
by: Dai, Yuntao, et al.
Published: (2025)
by: Dai, Yuntao, et al.
Published: (2025)
SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones
by: Adang, Maximilian, et al.
Published: (2025)
by: Adang, Maximilian, et al.
Published: (2025)
UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning
by: Ma, Oubo, et al.
Published: (2025)
by: Ma, Oubo, et al.
Published: (2025)
What Matters in Building Vision-Language-Action Models for Generalist Robots
by: Li, Xinghang, et al.
Published: (2024)
by: Li, Xinghang, et al.
Published: (2024)
Similar Items
-
Diffusion Transformer Policy
by: Hou, Zhi, et al.
Published: (2024) -
Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
by: Zhang, Tianyi, et al.
Published: (2025) -
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
by: Xiong, Yuwen, et al.
Published: (2024) -
Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
by: Luo, Gen, et al.
Published: (2025) -
big.LITTLE Vision Transformer for Efficient Visual Recognition
by: Guo, He, et al.
Published: (2024)