:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Pal, Ankit, Sankarasubbu, Malaikannan
Format:	Preprint
Published:	2024
Subjects:	Computation and Language Artificial Intelligence Computer Vision and Pattern Recognition Human-Computer Interaction Machine Learning
Online Access:	https://arxiv.org/abs/2402.07023
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

A Review on Large Language Models for Visual Analytics
by: Agarwal, Navya Sonal, et al.
Published: (2025)

MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
by: Li, Xiao, et al.
Published: (2025)

VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
by: Mazumdar, Amrita, et al.
Published: (2026)

SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model
by: Cazzaniga, Luca
Published: (2026)

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025)

ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding
by: Pal, Ankit, et al.
Published: (2025)

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)

E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)

How Can Large Language Models Enable Better Socially Assistive Human-Robot Interaction: A Brief Survey
by: Shi, Zhonghao, et al.
Published: (2024)

ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation
by: Xia, Ding, et al.
Published: (2025)

Can Large Language Models Capture Video Game Engagement?
by: Melhart, David, et al.
Published: (2025)

CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
by: Verma, Arnav, et al.
Published: (2025)

Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications
by: Asseri, Bushra, et al.
Published: (2025)

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
by: Luo, Run, et al.
Published: (2025)

Measuring Agreeableness Bias in Multimodal Models
by: Lim, Jaehyuk, et al.
Published: (2024)

AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)

A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
by: Betala, Siddharth, et al.
Published: (2025)

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
by: Lopez-Cardona, Angela, et al.
Published: (2024)

ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
by: Hsiao, Yu-Chung, et al.
Published: (2022)

UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
by: Liu, Xinyi, et al.
Published: (2025)

Learning Multimodal Cues of Children's Uncertainty
by: Cheng, Qi, et al.
Published: (2024)

UIClip: A Data-driven Model for Assessing User Interface Design
by: Wu, Jason, et al.
Published: (2024)

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)

GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
by: Yang, Kaichun, et al.
Published: (2025)

Voting-based Multimodal Automatic Deception Detection
by: Touma, Lana, et al.
Published: (2023)

Detoxifying Large Language Models via Knowledge Editing
by: Wang, Mengru, et al.
Published: (2024)

Knowledge Mechanisms in Large Language Models: A Survey and Perspective
by: Wang, Mengru, et al.
Published: (2024)

ReLearn: Unlearning via Learning for Large Language Models
by: Xu, Haoming, et al.
Published: (2025)

A Comprehensive Study of Knowledge Editing for Large Language Models
by: Zhang, Ningyu, et al.
Published: (2024)

GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)

Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)

ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)

InstructEdit: Instruction-based Knowledge Editing for Large Language Models
by: Zhang, Ningyu, et al.
Published: (2024)

Position and Rotation Invariant Sign Language Recognition from 3D Kinect Data with Recurrent Neural Networks
by: Roy, Prasun, et al.
Published: (2020)

Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
by: da Silva, Neemias, et al.
Published: (2026)

Semantic and Expressive Variation in Image Captions Across Languages
by: Ye, Andre, et al.
Published: (2023)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)

MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways
by: Chen, Zhen, et al.
Published: (2025)

GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT
by: Gao, Nan, et al.
Published: (2023)