:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hou, Zhi, Zhang, Tianyi, Xiong, Yuwen, Duan, Haonan, Pu, Hengjun, Tong, Ronglei, Zhao, Chengyang, Zhu, Xizhou, Qiao, Yu, Dai, Jifeng, Chen, Yuntao
Format:	Preprint
Published:	2025
Subjects:	Robotics Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2503.19757
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Diffusion Transformer Policy
by: Hou, Zhi, et al.
Published: (2024)

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy
by: Zhang, Tianyi, et al.
Published: (2025)

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
by: Xiong, Yuwen, et al.
Published: (2024)

Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces
by: Luo, Gen, et al.
Published: (2025)

big.LITTLE Vision Transformer for Efficient Visual Recognition
by: Guo, He, et al.
Published: (2024)

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
by: Wu, Jiannan, et al.
Published: (2024)

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
by: Tian, Changyao, et al.
Published: (2024)

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
by: Duan, Yuchen, et al.
Published: (2024)

CSU-PCAST: A Dual-Branch Transformer Framework for medium-range ensemble Precipitation Forecasting
by: Xiong, Tianyi, et al.
Published: (2025)

CoMemo: LVLMs Need Image Context with Image Memory
by: Liu, Shi, et al.
Published: (2025)

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
by: Chen, Zhe, et al.
Published: (2023)

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
by: Ge, Junqi, et al.
Published: (2024)

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
by: Tao, Chenxin, et al.
Published: (2024)

FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies
by: Reuss, Moritz, et al.
Published: (2025)

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning
by: Yang, Ganlin, et al.
Published: (2025)

Demystifying Diffusion Policies: Action Memorization and Simple Lookup Table Alternatives
by: He, Chengyang, et al.
Published: (2025)

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
by: Luo, Gen, et al.
Published: (2024)

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies
by: Wang, Yi, et al.
Published: (2026)

LangBridge: Interpreting Image as a Combination of Language Embeddings
by: Liao, Jiaqi, et al.
Published: (2025)

Learning A Low-Level Vision Generalist via Visual Task Prompt
by: Chen, Xiangyu, et al.
Published: (2024)

Demystify Transformers & Convolutions in Modern Image Deep Networks
by: Hu, Xiaowei, et al.
Published: (2022)

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
by: Meng, Fanqing, et al.
Published: (2024)

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy
by: Chen, Xinyi, et al.
Published: (2025)

HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist Vision-Language-Action Policies
by: Du, Zhiying, et al.
Published: (2025)

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
by: Yang, Chenyu, et al.
Published: (2024)

Data and code used for paper entitled "The Bear Attack as a Warning: From Clouded Skies to Collapsing Ecosystems"
by: Xiao, Hengjun
Published: (2026)

Parameter-Inverted Image Pyramid Networks
by: Zhu, Xizhou, et al.
Published: (2024)

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
by: Yang, Chenyu, et al.
Published: (2024)

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
by: Tian, Changyao, et al.
Published: (2023)

A New Multi-Picture Architecture for Learned Video Deinterlacing and Demosaicing with Parallel Deformable Convolution and Self-Attention Blocks
by: Ji, Ronglei, et al.
Published: (2024)

Multi-Field De-interlacing using Deformable Convolution Residual Blocks and Self-Attention
by: Ji, Ronglei, et al.
Published: (2022)

DGSolver: Diffusion Generalist Solver with Universal Posterior Sampling for Image Restoration
by: Wang, Hebaixu, et al.
Published: (2025)

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity
by: Liu, Yangzhou, et al.
Published: (2024)

Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
by: Li, Hao, et al.
Published: (2023)

Turning Video Models into Generalist Robot Policies
by: Li, Sizhe Lester, et al.
Published: (2026)

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge
by: Dai, Yuntao, et al.
Published: (2025)

SINGER: An Onboard Generalist Vision-Language Navigation Policy for Drones
by: Adang, Maximilian, et al.
Published: (2025)

UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning
by: Ma, Oubo, et al.
Published: (2025)

What Matters in Building Vision-Language-Action Models for Generalist Robots
by: Li, Xinghang, et al.
Published: (2024)