Saved in:
| Main Authors: | Pal, Ankit, Sankarasubbu, Malaikannan |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.07023 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
A Review on Large Language Models for Visual Analytics
by: Agarwal, Navya Sonal, et al.
Published: (2025)
by: Agarwal, Navya Sonal, et al.
Published: (2025)
MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
by: Li, Xiao, et al.
Published: (2025)
by: Li, Xiao, et al.
Published: (2025)
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
by: Mazumdar, Amrita, et al.
Published: (2026)
by: Mazumdar, Amrita, et al.
Published: (2026)
SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model
by: Cazzaniga, Luca
Published: (2026)
by: Cazzaniga, Luca
Published: (2026)
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025)
by: Zhao, Henry Hengyuan, et al.
Published: (2025)
ReXVQA: A Large-scale Visual Question Answering Benchmark for Generalist Chest X-ray Understanding
by: Pal, Ankit, et al.
Published: (2025)
by: Pal, Ankit, et al.
Published: (2025)
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
by: You, Keen, et al.
Published: (2024)
by: You, Keen, et al.
Published: (2024)
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
by: Lin, Ronghao, et al.
Published: (2025)
by: Lin, Ronghao, et al.
Published: (2025)
How Can Large Language Models Enable Better Socially Assistive Human-Robot Interaction: A Brief Survey
by: Shi, Zhonghao, et al.
Published: (2024)
by: Shi, Zhonghao, et al.
Published: (2024)
ColorGPT: Leveraging Large Language Models for Multimodal Color Recommendation
by: Xia, Ding, et al.
Published: (2025)
by: Xia, Ding, et al.
Published: (2025)
Can Large Language Models Capture Video Game Engagement?
by: Melhart, David, et al.
Published: (2025)
by: Melhart, David, et al.
Published: (2025)
CHART-6: Human-Centered Evaluation of Data Visualization Understanding in Vision-Language Models
by: Verma, Arnav, et al.
Published: (2025)
by: Verma, Arnav, et al.
Published: (2025)
Deciphering Emotions in Children Storybooks: A Comparative Analysis of Multimodal LLMs in Educational Applications
by: Asseri, Bushra, et al.
Published: (2025)
by: Asseri, Bushra, et al.
Published: (2025)
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
by: Luo, Run, et al.
Published: (2025)
by: Luo, Run, et al.
Published: (2025)
Measuring Agreeableness Bias in Multimodal Models
by: Lim, Jaehyuk, et al.
Published: (2024)
by: Lim, Jaehyuk, et al.
Published: (2024)
AIN: The Arabic INclusive Large Multimodal Model
by: Heakl, Ahmed, et al.
Published: (2025)
by: Heakl, Ahmed, et al.
Published: (2025)
A Picture is Worth a Thousand (Correct) Captions: A Vision-Guided Judge-Corrector System for Multimodal Machine Translation
by: Betala, Siddharth, et al.
Published: (2025)
by: Betala, Siddharth, et al.
Published: (2025)
Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models
by: Lopez-Cardona, Angela, et al.
Published: (2024)
by: Lopez-Cardona, Angela, et al.
Published: (2024)
ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots
by: Hsiao, Yu-Chung, et al.
Published: (2022)
by: Hsiao, Yu-Chung, et al.
Published: (2022)
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis
by: Liu, Xinyi, et al.
Published: (2025)
by: Liu, Xinyi, et al.
Published: (2025)
Learning Multimodal Cues of Children's Uncertainty
by: Cheng, Qi, et al.
Published: (2024)
by: Cheng, Qi, et al.
Published: (2024)
UIClip: A Data-driven Model for Assessing User Interface Design
by: Wu, Jason, et al.
Published: (2024)
by: Wu, Jason, et al.
Published: (2024)
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
by: Wu, Zhiyong, et al.
Published: (2024)
by: Wu, Zhiyong, et al.
Published: (2024)
GPT-5 Model Corrected GPT-4V's Chart Reading Errors, Not Prompting
by: Yang, Kaichun, et al.
Published: (2025)
by: Yang, Kaichun, et al.
Published: (2025)
Voting-based Multimodal Automatic Deception Detection
by: Touma, Lana, et al.
Published: (2023)
by: Touma, Lana, et al.
Published: (2023)
Detoxifying Large Language Models via Knowledge Editing
by: Wang, Mengru, et al.
Published: (2024)
by: Wang, Mengru, et al.
Published: (2024)
Knowledge Mechanisms in Large Language Models: A Survey and Perspective
by: Wang, Mengru, et al.
Published: (2024)
by: Wang, Mengru, et al.
Published: (2024)
ReLearn: Unlearning via Learning for Large Language Models
by: Xu, Haoming, et al.
Published: (2025)
by: Xu, Haoming, et al.
Published: (2025)
A Comprehensive Study of Knowledge Editing for Large Language Models
by: Zhang, Ningyu, et al.
Published: (2024)
by: Zhang, Ningyu, et al.
Published: (2024)
GUICourse: From General Vision Language Models to Versatile GUI Agents
by: Chen, Wentong, et al.
Published: (2024)
by: Chen, Wentong, et al.
Published: (2024)
Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows
by: Sun, Qiushi, et al.
Published: (2025)
by: Sun, Qiushi, et al.
Published: (2025)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
InstructEdit: Instruction-based Knowledge Editing for Large Language Models
by: Zhang, Ningyu, et al.
Published: (2024)
by: Zhang, Ningyu, et al.
Published: (2024)
Position and Rotation Invariant Sign Language Recognition from 3D Kinect Data with Recurrent Neural Networks
by: Roy, Prasun, et al.
Published: (2020)
by: Roy, Prasun, et al.
Published: (2020)
Analyzing Persona Effects in Generated Explanations from Multimodal LLM Agents in Urban Perception
by: da Silva, Neemias, et al.
Published: (2026)
by: da Silva, Neemias, et al.
Published: (2026)
Semantic and Expressive Variation in Image Captions Across Languages
by: Ye, Andre, et al.
Published: (2023)
by: Ye, Andre, et al.
Published: (2023)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)
by: Kapoor, Raghav, et al.
Published: (2024)
MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for Inpatient Pathways
by: Chen, Zhen, et al.
Published: (2025)
by: Chen, Zhen, et al.
Published: (2025)
GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT
by: Gao, Nan, et al.
Published: (2023)
by: Gao, Nan, et al.
Published: (2023)
Similar Items
-
A Review on Large Language Models for Visual Analytics
by: Agarwal, Navya Sonal, et al.
Published: (2025) -
MedFoundationHub: A Lightweight and Secure Toolkit for Deploying Medical Vision Language Foundation Models
by: Li, Xiao, et al.
Published: (2025) -
VideoFDB: Evaluating Full-Duplex Vision-Speech Capabilities in Conversational Agents
by: Mazumdar, Amrita, et al.
Published: (2026) -
SCHEMA for Gemini 3 Pro Image: A Structured Methodology for Controlled AI Image Generation on Google's Native Multimodal Model
by: Cazzaniga, Luca
Published: (2026) -
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
by: Zhao, Henry Hengyuan, et al.
Published: (2025)