:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sarch, Gabriel, Somani, Sahil, Kapoor, Raghav, Tarr, Michael J., Fragkiadaki, Katerina
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2404.19065
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
by: Sarch, Gabriel, et al.
Published: (2024)

Grounded Reinforcement Learning for Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2025)

Unifying 2D and 3D Vision-Language Understanding
by: Jain, Ayush, et al.
Published: (2025)

ODIN: A Single Model for 2D and 3D Segmentation
by: Jain, Ayush, et al.
Published: (2024)

Revealing the Inherent Instructability of Pre-Trained Language Models
by: An, Seokhyun, et al.
Published: (2024)

Reanimating Images using Neural Representations of Dynamic Stimuli
by: Yeung, Jacob, et al.
Published: (2024)

Unified Multimodal Discrete Diffusion
by: Swerdlow, Alexander, et al.
Published: (2025)

Scaling Instructable Agents Across Many Simulated Worlds
by: SIMA Team, et al.
Published: (2024)

DreamScene4D: Dynamic Multi-Object Scene Generation from Monocular Videos
by: Chu, Wen-Hsuan, et al.
Published: (2024)

Neurosymbolic AI for Enhancing Instructability in Generative AI
by: Sheth, Amit, et al.
Published: (2024)

SALMON: Self-Alignment with Instructable Reward Models
by: Sun, Zhiqing, et al.
Published: (2023)

Environmental Understanding Vision-Language Model for Embodied Agent
by: Bang, Jinsik, et al.
Published: (2026)

Memory-Augmented Vision-Language Agents for Persistent and Semantically Consistent Object Captioning
by: Galliena, Tommaso, et al.
Published: (2026)

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents
by: Yang, Rui, et al.
Published: (2025)

3D Diffuser Actor: Policy Diffusion with 3D Scene Representations
by: Ke, Tsung-Wei, et al.
Published: (2024)

Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation
by: Kuang, Yuxuan, et al.
Published: (2026)

Learning to Assist: Physics-Grounded Human-Human Control via Multi-Agent Reinforcement Learning
by: Shibata, Yuto, et al.
Published: (2026)

RetroMotion: Retrocausal Motion Forecasting Models are Instructable
by: Wagner, Royden, et al.
Published: (2025)

Aligning Text-to-Image Diffusion Models with Reward Backpropagation
by: Prabhudesai, Mihir, et al.
Published: (2023)

TAPIP3D: Tracking Any Point in Persistent 3D Geometry
by: Zhang, Bowei, et al.
Published: (2025)

Ella: Embodied Social Agents with Lifelong Memory
by: Zhang, Hongxin, et al.
Published: (2025)

Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
by: Ling, Yiran, et al.
Published: (2026)

Mema: Memory-Augmented Adapter for Enhanced Vision-Language Understanding
by: Liu, Ying, et al.
Published: (2026)

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving
by: Qian, Kangan, et al.
Published: (2025)

Vision-Language Agents for Interactive Forest Change Analysis
by: Brock, James, et al.
Published: (2026)

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics
by: Glocker, Marc, et al.
Published: (2025)

Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning
by: Xu, Huilin, et al.
Published: (2025)

EVLP:Learning Unified Embodied Vision-Language Planner with Reinforced Supervised Fine-Tuning
by: Cai, Xinyan, et al.
Published: (2025)

An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance
by: Yang, Hsuan-Kung, et al.
Published: (2025)

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents
by: Zhang, Zhizhen, et al.
Published: (2025)

Uni-LaViRA: Language-Vision-Robot Actions Translation for Unified Embodied Navigation
by: Ding, Hongyu, et al.
Published: (2026)

Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions
by: Huang, Saffron, et al.
Published: (2025)

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
by: Zawar, Rushikesh, et al.
Published: (2024)

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement
by: Gkanatsios, Nikolaos, et al.
Published: (2023)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)

Vero: An Open RL Recipe for General Visual Reasoning
by: Sarch, Gabriel, et al.
Published: (2026)

G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
by: Wan, Zhongwei, et al.
Published: (2022)

Embodied4C: Measuring What Matters for Embodied Vision-Language Navigation
by: Sohn, Tin Stribor, et al.
Published: (2025)

Video Diffusion Alignment via Reward Gradients
by: Prabhudesai, Mihir, et al.
Published: (2024)

Diffusion Beats Autoregressive in Data-Constrained Settings
by: Prabhudesai, Mihir, et al.
Published: (2025)