Saved in:
| Main Authors: | Zhang, Yongji, Li, Siqi, Gao, Yue, Jiang, Yu |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.10250 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)
by: Sun, Boyuan, et al.
Published: (2026)
From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
by: Liu, Yu, et al.
Published: (2025)
by: Liu, Yu, et al.
Published: (2025)
H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
by: Zhang, Yongji, et al.
Published: (2025)
by: Zhang, Yongji, et al.
Published: (2025)
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
by: Taylor, Jordan, et al.
Published: (2026)
by: Taylor, Jordan, et al.
Published: (2026)
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
by: Yang, Saelyne, et al.
Published: (2026)
by: Yang, Saelyne, et al.
Published: (2026)
FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs
by: Chen, Haodong, et al.
Published: (2024)
by: Chen, Haodong, et al.
Published: (2024)
Large Language Models estimate fine-grained human color-concept associations
by: Mukherjee, Kushin, et al.
Published: (2024)
by: Mukherjee, Kushin, et al.
Published: (2024)
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
by: Li, Zhikai, et al.
Published: (2024)
by: Li, Zhikai, et al.
Published: (2024)
ControlGUI: Guiding Generative GUI Exploration through Perceptual Visual Flow
by: Garg, Aryan, et al.
Published: (2025)
by: Garg, Aryan, et al.
Published: (2025)
CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing
by: Hu, Haobo, et al.
Published: (2026)
by: Hu, Haobo, et al.
Published: (2026)
Benchmarking XAI Explanations with Human-Aligned Evaluations
by: Kazmierczak, Rémi, et al.
Published: (2024)
by: Kazmierczak, Rémi, et al.
Published: (2024)
GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation
by: Lu, Haoran, et al.
Published: (2024)
by: Lu, Haoran, et al.
Published: (2024)
Learning 6-DoF Fine-grained Grasp Detection Based on Part Affordance Grounding
by: Song, Yaoxian, et al.
Published: (2023)
by: Song, Yaoxian, et al.
Published: (2023)
Real-Time Feedback and Benchmark Dataset for Isometric Pose Evaluation
by: Jaiswal, Abhishek, et al.
Published: (2025)
by: Jaiswal, Abhishek, et al.
Published: (2025)
ReSemAct: Advancing Fine-Grained Robotic Manipulation via Semantic Structuring and Affordance Refinement
by: Su, Chenyu, et al.
Published: (2025)
by: Su, Chenyu, et al.
Published: (2025)
Triple Spectral Fusion for Sensor-based Human Activity Recognition
by: Zhang, Ye, et al.
Published: (2026)
by: Zhang, Ye, et al.
Published: (2026)
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
by: Kapoor, Raghav, et al.
Published: (2024)
by: Kapoor, Raghav, et al.
Published: (2024)
Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
Yume: An Interactive World Generation Model
by: Mao, Xiaofeng, et al.
Published: (2025)
by: Mao, Xiaofeng, et al.
Published: (2025)
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting
by: Zhu, Ruijie, et al.
Published: (2025)
by: Zhu, Ruijie, et al.
Published: (2025)
ImageTalk: Designing a Multimodal AAC Text Generation System Driven by Image Recognition and Natural Language Generation
by: Yang, Boyin, et al.
Published: (2025)
by: Yang, Boyin, et al.
Published: (2025)
ASL STEM Wiki: Dataset and Benchmark for Interpreting STEM Articles
by: Yin, Kayo, et al.
Published: (2024)
by: Yin, Kayo, et al.
Published: (2024)
A Survey on (M)LLM-Based GUI Agents
by: Tang, Fei, et al.
Published: (2025)
by: Tang, Fei, et al.
Published: (2025)
HandS3C: 3D Hand Mesh Reconstruction with State Space Spatial Channel Attention from RGB images
by: Jiao, Zixun, et al.
Published: (2024)
by: Jiao, Zixun, et al.
Published: (2024)
Learning High-Quality Navigation and Zooming on Omnidirectional Images in Virtual Reality
by: Cao, Zidong, et al.
Published: (2024)
by: Cao, Zidong, et al.
Published: (2024)
Design description of Wisdom Computing Persperctive
by: Yu, TianYi
Published: (2025)
by: Yu, TianYi
Published: (2025)
Graph4GUI: Graph Neural Networks for Representing Graphical User Interfaces
by: Jiang, Yue, et al.
Published: (2024)
by: Jiang, Yue, et al.
Published: (2024)
Enabling Collaborative Clinical Diagnosis of Infectious Keratitis by Integrating Expert Knowledge and Interpretable Data-driven Intelligence
by: Fang, Zhengqing, et al.
Published: (2024)
by: Fang, Zhengqing, et al.
Published: (2024)
VerSe: Integrating Multiple Queries as Prompts for Versatile Cardiac MRI Segmentation
by: Guo, Bangwei, et al.
Published: (2024)
by: Guo, Bangwei, et al.
Published: (2024)
Milmer: a Framework for Multiple Instance Learning based Multimodal Emotion Recognition
by: Wang, Zaitian, et al.
Published: (2025)
by: Wang, Zaitian, et al.
Published: (2025)
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
by: Zhao, Yiming, et al.
Published: (2026)
by: Zhao, Yiming, et al.
Published: (2026)
EyeTrAES: Fine-grained, Low-Latency Eye Tracking via Adaptive Event Slicing
by: Sen, Argha, et al.
Published: (2024)
by: Sen, Argha, et al.
Published: (2024)
Augmenting Image Annotation: A Human-LMM Collaborative Framework for Efficient Object Selection and Label Generation
by: Zhang, He, et al.
Published: (2025)
by: Zhang, He, et al.
Published: (2025)
Generative AI for Cel-Animation: A Survey
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
Visual Evaluative AI: A Hypothesis-Driven Tool with Concept-Based Explanations and Weight of Evidence
by: Le, Thao, et al.
Published: (2024)
by: Le, Thao, et al.
Published: (2024)
T2I-Copilot: A Training-Free Multi-Agent Text-to-Image System for Enhanced Prompt Interpretation and Interactive Generation
by: Chen, Chieh-Yun, et al.
Published: (2025)
by: Chen, Chieh-Yun, et al.
Published: (2025)
ScreenAgent: A Vision Language Model-driven Computer Control Agent
by: Niu, Runliang, et al.
Published: (2024)
by: Niu, Runliang, et al.
Published: (2024)
UI-UG: A Unified MLLM for UI Understanding and Generation
by: Yang, Hao, et al.
Published: (2025)
by: Yang, Hao, et al.
Published: (2025)
AutoTour: Automatic Photo Tour Guide with Smartphones and LLMs
by: Xu, Huatao, et al.
Published: (2026)
by: Xu, Huatao, et al.
Published: (2026)
Similar Items
-
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026) -
From Coarse to Nuanced: Cross-Modal Alignment of Fine-Grained Linguistic Cues and Visual Salient Regions for Dynamic Emotion Recognition
by: Liu, Yu, et al.
Published: (2025) -
H3Former: Hypergraph-based Semantic-Aware Aggregation via Hyperbolic Hierarchical Contrastive Loss for Fine-Grained Visual Classification
by: Zhang, Yongji, et al.
Published: (2025) -
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
by: Taylor, Jordan, et al.
Published: (2026) -
GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks
by: Yang, Saelyne, et al.
Published: (2026)