:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wu, Jing, Barretto, Daphne, Chen, Yiye, Gydé, Nicholas, Jian, Yanan, He, Yuhang, Vineet, Vibhav
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.20650
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)

Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)

StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)

GASP: Gaussian Avatars with Synthetic Priors
by: Saunders, Jack, et al.
Published: (2024)

On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)

Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)

Grounding Task Assistance with Multimodal Cues from a Single Demonstration
by: Sarch, Gabriel, et al.
Published: (2025)

Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System
by: Chen, Yiye, et al.
Published: (2025)

A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)

OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)

PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)

CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
by: Ghosh, Akash, et al.
Published: (2026)

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
by: Zhang, Yue, et al.
Published: (2024)

CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)

Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)

MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
by: Joshi, Siddharth, et al.
Published: (2025)

BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)

Fara-7B: An Efficient Agentic Model for Computer Use
by: Awadallah, Ahmed, et al.
Published: (2025)

VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
by: Chen, Yiye, et al.
Published: (2026)

DreamDistribution: Learning Prompt Distribution for Diverse In-distribution Generation
by: Zhao, Brian Nlong, et al.
Published: (2023)

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
by: Ravi, Sahithya, et al.
Published: (2025)

ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
by: Abaskohi, Amirhossein, et al.
Published: (2026)

A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
by: Hu, Xueyu, et al.
Published: (2025)

Robustness Analysis on Foundational Segmentation Models
by: Schiappa, Madeline Chantry, et al.
Published: (2023)

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
by: Jia, Hongrui, et al.
Published: (2025)

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)

SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
by: Zhang, Shiduo, et al.
Published: (2024)

EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
by: Liu, Lulin, et al.
Published: (2026)

Generalizable Dense Reward for Long-Horizon Robotic Tasks
by: Yong, Silong, et al.
Published: (2026)

Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
by: Guo, Xinying, et al.
Published: (2026)

Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
by: Song, Xinshuai, et al.
Published: (2024)

ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
by: Wang, Kaijun, et al.
Published: (2025)

Future Predictive Success-or-Failure Classification for Long-Horizon Robotic Tasks
by: Sogi, Naoya, et al.
Published: (2024)

Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
by: Kar, Indrajit, et al.
Published: (2025)

VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
by: Zhang, Ruiyang, et al.
Published: (2026)

Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation
by: Xu, Wenbo, et al.
Published: (2024)

AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
by: Shi, Yibo, et al.
Published: (2026)

CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
by: Sun, Zeyi, et al.
Published: (2025)