:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zheng, Xinyue, Lin, Haowei, He, Kaichen, Wang, Zihao, Zheng, Zilong, Liang, Yitao
Format:	Preprint
Published:	2023
Subjects:	Artificial Intelligence Computation and Language Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2310.08367
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
by: Li, Muyao, et al.
Published: (2025)

PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
by: Li, Hengzhi, et al.
Published: (2025)

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)

Open-World Skill Discovery from Unsegmented Demonstrations
by: Deng, Jingwen, et al.
Published: (2025)

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
by: He, Zheqi, et al.
Published: (2025)

In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
by: Earle, Sam, et al.
Published: (2026)

Agent S: An Open Agentic Framework that Uses Computers Like a Human
by: Agashe, Saaket, et al.
Published: (2024)

Contextual Experience Replay for Self-Improvement of Language Agents
by: Liu, Yitao, et al.
Published: (2025)

Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
by: Zhang, Fan, et al.
Published: (2024)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)

JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents
by: Zheng, Kaizhi, et al.
Published: (2022)

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
by: Zhang, Kai, et al.
Published: (2024)

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)

OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)

Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
by: Ye, Junyan, et al.
Published: (2025)

WebGuard: Building a Generalizable Guardrail for Web Agents
by: Zheng, Boyuan, et al.
Published: (2025)

Holistic Evaluation for Interleaved Text-and-Image Generation
by: Liu, Minqian, et al.
Published: (2024)

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
by: Zhao, Haozhe, et al.
Published: (2026)

Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)

Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment
by: Chen, Lu, et al.
Published: (2024)

ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
by: Cai, Shaofei, et al.
Published: (2024)

GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
by: Ouyang, Mingyu, et al.
Published: (2026)

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
by: Wang, Yunzhe, et al.
Published: (2026)

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
by: Shi, Zhengpeng, et al.
Published: (2025)

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning
by: Liu, Daixian, et al.
Published: (2026)

ADAM: An Embodied Causal Agent in Open-World Environments
by: Yu, Shu, et al.
Published: (2024)

VITA: Towards Open-Source Interactive Omni Multimodal LLM
by: Fu, Chaoyou, et al.
Published: (2024)

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
by: Zhao, Yilun, et al.
Published: (2025)

Revisiting the Role of Language Priors in Vision-Language Models
by: Lin, Zhiqiu, et al.
Published: (2023)

UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection
by: Sun, Fuxiang, et al.
Published: (2026)

Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
by: Yi, Hanling, et al.
Published: (2026)

The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
by: Hu, Siyuan, et al.
Published: (2024)

CLoG: Benchmarking Continual Learning of Image Generation Models
by: Zhang, Haotian, et al.
Published: (2024)

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
by: Liu, Xiao, et al.
Published: (2024)

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
by: He, Xuehai, et al.
Published: (2024)

From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)

DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
by: Yim, Wen-wai, et al.
Published: (2025)

HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
by: Zhang, Haowei, et al.
Published: (2026)

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
by: Ahn, Jaewoo, et al.
Published: (2025)

VC4VG: Optimizing Video Captions for Text-to-Video Generation
by: Du, Yang, et al.
Published: (2025)