Saved in:
| Main Authors: | Wu, Jing, Barretto, Daphne, Chen, Yiye, Gydé, Nicholas, Jian, Yanan, He, Yuhang, Vineet, Vibhav |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.20650 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025)
by: Azad, Shehreen, et al.
Published: (2025)
Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
by: Bagdonaviciute, Ieva, et al.
Published: (2025)
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026)
by: Azad, Shehreen, et al.
Published: (2026)
GASP: Gaussian Avatars with Synthetic Priors
by: Saunders, Jack, et al.
Published: (2024)
by: Saunders, Jack, et al.
Published: (2024)
On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)
by: Modi, Rajat, et al.
Published: (2024)
Navigating Hallucinations for Reasoning of Unintentional Activities
by: Grover, Shresth, et al.
Published: (2024)
by: Grover, Shresth, et al.
Published: (2024)
Grounding Task Assistance with Multimodal Cues from a Single Demonstration
by: Sarch, Gabriel, et al.
Published: (2025)
by: Sarch, Gabriel, et al.
Published: (2025)
Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System
by: Chen, Yiye, et al.
Published: (2025)
by: Chen, Yiye, et al.
Published: (2025)
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
by: Kumar, Akash, et al.
Published: (2025)
by: Kumar, Akash, et al.
Published: (2025)
OmViD: Omni-supervised active learning for video action detection
by: Rana, Aayush, et al.
Published: (2025)
by: Rana, Aayush, et al.
Published: (2025)
PEEKABOO: Interactive Video Generation via Masked-Diffusion
by: Jain, Yash, et al.
Published: (2023)
by: Jain, Yash, et al.
Published: (2023)
CarePilot: A Multi-Agent Framework for Long-Horizon Computer Task Automation in Healthcare
by: Ghosh, Akash, et al.
Published: (2026)
by: Ghosh, Akash, et al.
Published: (2026)
Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
CoSPlan: Corrective Sequential Planning via Scene Graph Incremental Updates
by: Grover, Shresth, et al.
Published: (2025)
by: Grover, Shresth, et al.
Published: (2025)
Understanding Depth and Height Perception in Large Visual-Language Models
by: Azad, Shehreen, et al.
Published: (2024)
by: Azad, Shehreen, et al.
Published: (2024)
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation
by: Joshi, Siddharth, et al.
Published: (2025)
by: Joshi, Siddharth, et al.
Published: (2025)
BOSS: Benchmark for Observation Space Shift in Long-Horizon Task
by: Yang, Yue, et al.
Published: (2025)
by: Yang, Yue, et al.
Published: (2025)
Fara-7B: An Efficient Agentic Model for Computer Use
by: Awadallah, Ahmed, et al.
Published: (2025)
by: Awadallah, Ahmed, et al.
Published: (2025)
VISTA: Enhancing Visual Conditioning via Track-Following Preference Optimization in Vision-Language-Action Models
by: Chen, Yiye, et al.
Published: (2026)
by: Chen, Yiye, et al.
Published: (2026)
DreamDistribution: Learning Prompt Distribution for Diverse In-distribution Generation
by: Zhao, Brian Nlong, et al.
Published: (2023)
by: Zhao, Brian Nlong, et al.
Published: (2023)
Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames
by: Ravi, Sahithya, et al.
Published: (2025)
by: Ravi, Sahithya, et al.
Published: (2025)
ReVision: Scaling Computer-Use Agents via Temporal Visual Redundancy Reduction
by: Abaskohi, Amirhossein, et al.
Published: (2026)
by: Abaskohi, Amirhossein, et al.
Published: (2026)
A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding
by: Zhang, Yue, et al.
Published: (2026)
by: Zhang, Yue, et al.
Published: (2026)
OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
by: Hu, Xueyu, et al.
Published: (2025)
by: Hu, Xueyu, et al.
Published: (2025)
Robustness Analysis on Foundational Segmentation Models
by: Schiappa, Madeline Chantry, et al.
Published: (2023)
by: Schiappa, Madeline Chantry, et al.
Published: (2023)
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
by: Jia, Hongrui, et al.
Published: (2025)
by: Jia, Hongrui, et al.
Published: (2025)
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
by: Wang, Jiayu, et al.
Published: (2024)
by: Wang, Jiayu, et al.
Published: (2024)
SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
by: Zhang, Shiduo, et al.
Published: (2024)
by: Zhang, Shiduo, et al.
Published: (2024)
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
by: Liu, Lulin, et al.
Published: (2026)
by: Liu, Lulin, et al.
Published: (2026)
Generalizable Dense Reward for Long-Horizon Robotic Tasks
by: Yong, Silong, et al.
Published: (2026)
by: Yong, Silong, et al.
Published: (2026)
Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation
by: Guo, Xinying, et al.
Published: (2026)
by: Guo, Xinying, et al.
Published: (2026)
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
by: Song, Xinshuai, et al.
Published: (2024)
by: Song, Xinshuai, et al.
Published: (2024)
ODYSSEY: Open-World Quadrupeds Exploration and Manipulation for Long-Horizon Tasks
by: Wang, Kaijun, et al.
Published: (2025)
by: Wang, Kaijun, et al.
Published: (2025)
Future Predictive Success-or-Failure Classification for Long-Horizon Robotic Tasks
by: Sogi, Naoya, et al.
Published: (2024)
by: Sogi, Naoya, et al.
Published: (2024)
Curriculum Guided Massive Multi Agent System Solving For Robust Long Horizon Tasks
by: Kar, Indrajit, et al.
Published: (2025)
by: Kar, Indrajit, et al.
Published: (2025)
VSearcher: Long-Horizon Multimodal Search Agent via Reinforcement Learning
by: Zhang, Ruiyang, et al.
Published: (2026)
by: Zhang, Ruiyang, et al.
Published: (2026)
Task Consistent Prototype Learning for Incremental Few-shot Semantic Segmentation
by: Xu, Wenbo, et al.
Published: (2024)
by: Xu, Wenbo, et al.
Published: (2024)
AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents
by: Shi, Yibo, et al.
Published: (2026)
by: Shi, Yibo, et al.
Published: (2026)
CODA: Coordinating the Cerebrum and Cerebellum for a Dual-Brain Computer Use Agent with Decoupled Reinforcement Learning
by: Sun, Zeyi, et al.
Published: (2025)
by: Sun, Zeyi, et al.
Published: (2025)
Similar Items
-
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
by: Azad, Shehreen, et al.
Published: (2025) -
Physics Knowledge in Frontier Models: A Diagnostic Study of Failure Modes
by: Bagdonaviciute, Ieva, et al.
Published: (2025) -
StreamReady: Learning What to Answer and When in Long Streaming Videos
by: Azad, Shehreen, et al.
Published: (2026) -
GASP: Gaussian Avatars with Synthetic Priors
by: Saunders, Jack, et al.
Published: (2024) -
On Occlusions in Video Action Detection: Benchmark Datasets And Training Recipes
by: Modi, Rajat, et al.
Published: (2024)