:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yongji, Li, Siqi, Gao, Yue, Jiang, Yu
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Human-Computer Interaction
Online Access:	https://arxiv.org/abs/2511.10250
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)

From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
by: Liu, Yu, et al.
Published: (2025)

H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
by: Zhang, Yongji, et al.
Published: (2025)

The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
by: Taylor, Jordan, et al.
Published: (2026)

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
by: Yang, Saelyne, et al.
Published: (2026)

FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
by: Chen, Haodong, et al.
Published: (2024)

Large Language Models estimate fine-grained human color-concept associations
by: Mukherjee, Kushin, et al.
Published: (2024)

ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
by: Li, Zhikai, et al.
Published: (2024)

ControlGUI: Guiding Generative GUI Exploration through Perceptual Visual Flow
by: Garg, Aryan, et al.
Published: (2025)

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing
by: Hu, Haobo, et al.
Published: (2026)

Benchmarking XAI Explanations with Human-Aligned Evaluations
by: Kazmierczak, Rémi, et al.
Published: (2024)

GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation
by: Lu, Haoran, et al.
Published: (2024)

Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding
by: Song, Yaoxian, et al.
Published: (2023)

Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation
by: Jaiswal, Abhishek, et al.
Published: (2025)

ReSemAct: Advancing Fine-Grained Robotic Manipulation via Semantic Structuring and Affordance Refinement
by: Su, Chenyu, et al.
Published: (2025)

Triple Spectral Fusion for Sensor-based Human Activity Recognition
by: Zhang, Ye, et al.
Published: (2026)

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)

Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)

Yume: An Interactive World Generation Model
by: Mao, Xiaofeng, et al.
Published: (2025)

ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
by: Zhu, Ruijie, et al.
Published: (2025)

ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation
by: Yang, Boyin, et al.
Published: (2025)

ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
by: Yin, Kayo, et al.
Published: (2024)

A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)

HandS3C: 3D Hand Mesh Reconstruction with State Space Spatial Channel Attention from RGB images
by: Jiao, Zixun, et al.
Published: (2024)

Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality
by: Cao, Zidong, et al.
Published: (2024)

Design description of Wisdom Computing Persperctive
by: Yu, TianYi
Published: (2025)

Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces
by: Jiang, Yue, et al.
Published: (2024)

Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence
by: Fang, Zhengqing, et al.
Published: (2024)

VerSe: Integrating Multiple Queries as Prompts for Versatile Cardiac MRI Segmentation
by: Guo, Bangwei, et al.
Published: (2024)

Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition
by: Wang, Zaitian, et al.
Published: (2025)

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
by: Zhao, Yiming, et al.
Published: (2026)

EyeTrAES: Fine-grained, Low-Latency Eye Tracking via Adaptive Event Slicing
by: Sen, Argha, et al.
Published: (2024)

Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation
by: Zhang, He, et al.
Published: (2025)

Generative AI for Cel-Animation: A Survey
by: Tang, Yolo Y., et al.
Published: (2025)

Visual Evaluative AI: A Hypothesis-Driven Tool with Concept-Based Explanations and Weight of Evidence
by: Le, Thao, et al.
Published: (2024)

T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
by: Chen, Chieh-Yun, et al.
Published: (2025)

ScreenAgent: A Vision Language Model-driven Computer Control Agent
by: Niu, Runliang, et al.
Published: (2024)

UI-UG: A Unified MLLM for UI Understanding and Generation
by: Yang, Hao, et al.
Published: (2025)

AutoTour: Automatic Photo Tour Guide with Smartphones and LLMs
by: Xu, Huatao, et al.
Published: (2026)