Saved in:
| Main Authors: | Sarch, Gabriel, Somani, Sahil, Kapoor, Raghav, Tarr, Michael J., Fragkiadaki, Katerina |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.19065 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
by: Sarch, Gabriel, et al.
Published: (2024)
by: Sarch, Gabriel, et al.
Published: (2024)
Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)
by: Sarch, Gabriel, et al.
Published: (2025)
Unifying 2D and 3D Vision-Language Understanding
by: Jain, Ayush, et al.
Published: (2025)
by: Jain, Ayush, et al.
Published: (2025)
ODIN: A Single Model for 2D and 3D Segmentation
by: Jain, Ayush, et al.
Published: (2024)
by: Jain, Ayush, et al.
Published: (2024)
Revealing the Inherent Instructability of Pre-Trained Language Models
by: An, Seokhyun, et al.
Published: (2024)
by: An, Seokhyun, et al.
Published: (2024)
Reanimating Images using Neural Representations of Dynamic Stimuli
by: Yeung, Jacob, et al.
Published: (2024)
by: Yeung, Jacob, et al.
Published: (2024)
Unified Multimodal Discrete Diffusion
by: Swerdlow, Alexander, et al.
Published: (2025)
by: Swerdlow, Alexander, et al.
Published: (2025)
Scaling Instructable Agents Across Many Simulated Worlds
by: SIMA Team, et al.
Published: (2024)
by: SIMA Team, et al.
Published: (2024)
DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos
by: Chu, Wen-Hsuan, et al.
Published: (2024)
by: Chu, Wen-Hsuan, et al.
Published: (2024)
Neurosymbolic AI for Enhancing Instructability in Generative AI
by: Sheth, Amit, et al.
Published: (2024)
by: Sheth, Amit, et al.
Published: (2024)
SALMON: Self-Alignment with Instructable Reward Models
by: Sun, Zhiqing, et al.
Published: (2023)
by: Sun, Zhiqing, et al.
Published: (2023)
Environmental Understanding Vision-Language Model for Embodied Agent
by: Bang, Jinsik, et al.
Published: (2026)
by: Bang, Jinsik, et al.
Published: (2026)
Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
by: Galliena, Tommaso, et al.
Published: (2026)
by: Galliena, Tommaso, et al.
Published: (2026)
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
by: Ke, Tsung-Wei, et al.
Published: (2024)
by: Ke, Tsung-Wei, et al.
Published: (2024)
Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
by: Kuang, Yuxuan, et al.
Published: (2026)
by: Kuang, Yuxuan, et al.
Published: (2026)
Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
by: Shibata, Yuto, et al.
Published: (2026)
by: Shibata, Yuto, et al.
Published: (2026)
RetroMotion: Retrocausal Motion Forecasting Models are Instructable
by: Wagner, Royden, et al.
Published: (2025)
by: Wagner, Royden, et al.
Published: (2025)
Aligning Text-to-Image Diffusion Models with Reward Backpropagation
by: Prabhudesai, Mihir, et al.
Published: (2023)
by: Prabhudesai, Mihir, et al.
Published: (2023)
TAPIP3D: Tracking Any Point in Persistent 3D Geometry
by: Zhang, Bowei, et al.
Published: (2025)
by: Zhang, Bowei, et al.
Published: (2025)
Ella: Embodied Social Agents with Lifelong Memory
by: Zhang, Hongxin, et al.
Published: (2025)
by: Zhang, Hongxin, et al.
Published: (2025)
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
by: Ling, Yiran, et al.
Published: (2026)
by: Ling, Yiran, et al.
Published: (2026)
Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
by: Liu, Ying, et al.
Published: (2026)
by: Liu, Ying, et al.
Published: (2026)
AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
by: Qian, Kangan, et al.
Published: (2025)
by: Qian, Kangan, et al.
Published: (2025)
Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)
by: Brock, James, et al.
Published: (2026)
LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics
by: Glocker, Marc, et al.
Published: (2025)
by: Glocker, Marc, et al.
Published: (2025)
Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
by: Xu, Huilin, et al.
Published: (2025)
by: Xu, Huilin, et al.
Published: (2025)
EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
by: Cai, Xinyan, et al.
Published: (2025)
by: Cai, Xinyan, et al.
Published: (2025)
An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance
by: Yang, Hsuan-Kung, et al.
Published: (2025)
by: Yang, Hsuan-Kung, et al.
Published: (2025)
Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
by: Zhang, Zhizhen, et al.
Published: (2025)
by: Zhang, Zhizhen, et al.
Published: (2025)
Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation
by: Ding, Hongyu, et al.
Published: (2026)
by: Ding, Hongyu, et al.
Published: (2026)
Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
by: Huang, Saffron, et al.
Published: (2025)
by: Huang, Saffron, et al.
Published: (2025)
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
by: Zawar, Rushikesh, et al.
Published: (2024)
by: Zawar, Rushikesh, et al.
Published: (2024)
Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
by: Gkanatsios, Nikolaos, et al.
Published: (2023)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)
by: Kapoor, Raghav, et al.
Published: (2024)
Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)
by: Sarch, Gabriel, et al.
Published: (2026)
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
by: Wan, Zhongwei, et al.
Published: (2022)
by: Wan, Zhongwei, et al.
Published: (2022)
Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
by: Sohn, Tin Stribor, et al.
Published: (2025)
by: Sohn, Tin Stribor, et al.
Published: (2025)
Video Diffusion Alignment via Reward Gradients
by: Prabhudesai, Mihir, et al.
Published: (2024)
by: Prabhudesai, Mihir, et al.
Published: (2024)
Diffusion Beats Autoregressive in Data-Constrained Settings
by: Prabhudesai, Mihir, et al.
Published: (2025)
by: Prabhudesai, Mihir, et al.
Published: (2025)
Similar Items
-
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
by: Sarch, Gabriel, et al.
Published: (2024) -
Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025) -
Unifying 2D and 3D Vision-Language Understanding
by: Jain, Ayush, et al.
Published: (2025) -
ODIN: A Single Model for 2D and 3D Segmentation
by: Jain, Ayush, et al.
Published: (2024) -
Revealing the Inherent Instructability of Pre-Trained Language Models
by: An, Seokhyun, et al.
Published: (2024)