Saved in:
| Main Authors: | Zheng, Xinyue, Lin, Haowei, He, Kaichen, Wang, Zihao, Zheng, Zilong, Liang, Yitao |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2310.08367 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
by: Li, Muyao, et al.
Published: (2025)
by: Li, Muyao, et al.
Published: (2025)
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
by: Li, Hengzhi, et al.
Published: (2025)
by: Li, Hengzhi, et al.
Published: (2025)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025)
by: Yao, Huanjin, et al.
Published: (2025)
Open-World Skill Discovery from Unsegmented Demonstrations
by: Deng, Jingwen, et al.
Published: (2025)
by: Deng, Jingwen, et al.
Published: (2025)
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
by: He, Zheqi, et al.
Published: (2025)
by: He, Zheqi, et al.
Published: (2025)
In Search of the Ingredients of Open-Endedness: Replicating Picbreeder with Large Vision-Language Models
by: Earle, Sam, et al.
Published: (2026)
by: Earle, Sam, et al.
Published: (2026)
Agent S: An Open Agentic Framework that Uses Computers Like a Human
by: Agashe, Saaket, et al.
Published: (2024)
by: Agashe, Saaket, et al.
Published: (2024)
Contextual Experience Replay for Self-Improvement of Language Agents
by: Liu, Yitao, et al.
Published: (2025)
by: Liu, Yitao, et al.
Published: (2025)
Evaluation Agent: Efficient and Promptable Evaluation Framework for Visual Generative Models
by: Zhang, Fan, et al.
Published: (2024)
by: Zhang, Fan, et al.
Published: (2024)
Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents
by: Zheng, Kaizhi, et al.
Published: (2022)
by: Zheng, Kaizhi, et al.
Published: (2022)
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
by: Zhang, Kai, et al.
Published: (2024)
by: Zhang, Kai, et al.
Published: (2024)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis
by: Cheng, Kanzhi, et al.
Published: (2026)
by: Cheng, Kanzhi, et al.
Published: (2026)
Echo-4o: Harnessing the Power of GPT-4o Synthetic Images for Improved Image Generation
by: Ye, Junyan, et al.
Published: (2025)
by: Ye, Junyan, et al.
Published: (2025)
WebGuard: Building a Generalizable Guardrail for Web Agents
by: Zheng, Boyuan, et al.
Published: (2025)
by: Zheng, Boyuan, et al.
Published: (2025)
Holistic Evaluation for Interleaved Text-and-Image Generation
by: Liu, Minqian, et al.
Published: (2024)
by: Liu, Minqian, et al.
Published: (2024)
Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs
by: Zhao, Haozhe, et al.
Published: (2026)
by: Zhao, Haozhe, et al.
Published: (2026)
Probing and Inducing Combinational Creativity in Vision-Language Models
by: Peng, Yongqian, et al.
Published: (2025)
by: Peng, Yongqian, et al.
Published: (2025)
Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment
by: Chen, Lu, et al.
Published: (2024)
by: Chen, Lu, et al.
Published: (2024)
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
by: Cai, Shaofei, et al.
Published: (2024)
by: Cai, Shaofei, et al.
Published: (2024)
GameWorld: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents
by: Ouyang, Mingyu, et al.
Published: (2026)
by: Ouyang, Mingyu, et al.
Published: (2026)
GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents
by: Wang, Yunzhe, et al.
Published: (2026)
by: Wang, Yunzhe, et al.
Published: (2026)
v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound
by: Shi, Zhengpeng, et al.
Published: (2025)
by: Shi, Zhengpeng, et al.
Published: (2025)
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning
by: Liu, Daixian, et al.
Published: (2026)
by: Liu, Daixian, et al.
Published: (2026)
ADAM: An Embodied Causal Agent in Open-World Environments
by: Yu, Shu, et al.
Published: (2024)
by: Yu, Shu, et al.
Published: (2024)
VITA: Towards Open-Source Interactive Omni Multimodal LLM
by: Fu, Chaoyou, et al.
Published: (2024)
by: Fu, Chaoyou, et al.
Published: (2024)
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
by: Zhao, Yilun, et al.
Published: (2025)
by: Zhao, Yilun, et al.
Published: (2025)
Revisiting the Role of Language Priors in Vision-Language Models
by: Lin, Zhiqiu, et al.
Published: (2023)
by: Lin, Zhiqiu, et al.
Published: (2023)
UniPCB: A Unified Vision-Language Benchmark for Open-Ended PCB Quality Inspection
by: Sun, Fuxiang, et al.
Published: (2026)
by: Sun, Fuxiang, et al.
Published: (2026)
Bridging Coarse and Fine Recognition: A Hybrid Approach for Open-Ended Multi-Granularity Object Recognition in Interactive Educational Games
by: Yi, Hanling, et al.
Published: (2026)
by: Yi, Hanling, et al.
Published: (2026)
The Dawn of GUI Agent: A Preliminary Case Study with Claude 3.5 Computer Use
by: Hu, Siyuan, et al.
Published: (2024)
by: Hu, Siyuan, et al.
Published: (2024)
CLoG: Benchmarking Continual Learning of Image Generation Models
by: Zhang, Haotian, et al.
Published: (2024)
by: Zhang, Haotian, et al.
Published: (2024)
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
by: Liu, Xiao, et al.
Published: (2024)
by: Liu, Xiao, et al.
Published: (2024)
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
by: He, Xuehai, et al.
Published: (2024)
by: He, Xuehai, et al.
Published: (2024)
From Summary to Action: Enhancing Large Language Models for Complex Tasks with Open World APIs
by: Liu, Yulong, et al.
Published: (2024)
by: Liu, Yulong, et al.
Published: (2024)
DermaVQA-DAS: Dermatology Assessment Schema (DAS) & Datasets for Closed-Ended Question Answering & Segmentation in Patient-Generated Dermatology Images
by: Yim, Wen-wai, et al.
Published: (2025)
by: Yim, Wen-wai, et al.
Published: (2025)
HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
by: Zhang, Haowei, et al.
Published: (2026)
by: Zhang, Haowei, et al.
Published: (2026)
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
by: Ahn, Jaewoo, et al.
Published: (2025)
by: Ahn, Jaewoo, et al.
Published: (2025)
VC4VG: Optimizing Video Captions for Text-to-Video Generation
by: Du, Yang, et al.
Published: (2025)
by: Du, Yang, et al.
Published: (2025)
Similar Items
-
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
by: Li, Muyao, et al.
Published: (2025) -
PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
by: Li, Hengzhi, et al.
Published: (2025) -
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI
by: Yao, Huanjin, et al.
Published: (2025) -
Open-World Skill Discovery from Unsegmented Demonstrations
by: Deng, Jingwen, et al.
Published: (2025) -
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
by: He, Zheqi, et al.
Published: (2025)