:: Library Catalog

Image de couverture de livre

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Zhou, Weijie, Xiong, Xuantang, Hu, Zhenlin, Zhu, Xiaomeng, Zhao, Chaoyang, Dong, Honghui, Zhang, Zhengyou, Tang, Ming, Wang, Jinqiao
Format:	Preprint
Publié:	2026
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2603.07966
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

Documents similaires

PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments
par: Zhou, Weijie, et autres
Publié: (2025)

ProAct: A Benchmark and Multimodal Framework for Structure-Aware Proactive Response
par: Zhu, Xiaomeng, et autres
Publié: (2026)

LightPlanner: Unleashing the Reasoning Capabilities of Lightweight Large Language Models in Task Planning
par: Zhou, Weijie, et autres
Publié: (2025)

ESearch-R1: Learning Cost-Aware MLLM Agents for Interactive Embodied Search via Reinforcement Learning
par: Zhou, Weijie, et autres
Publié: (2025)

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability
par: Zhou, Weijie, et autres
Publié: (2025)

FOCUS: Fine-grained Optimization with Semantic Guided Understanding for Pedestrian Attributes Recognition
par: An, Hongyan, et autres
Publié: (2025)

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
par: An, Yongqi, et autres
Publié: (2026)

Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
par: Jeong, Jihoon, et autres
Publié: (2026)

Efficient Masked Autoencoders with Self-Consistency
par: Li, Zhaowen, et autres
Publié: (2023)

EgoEdit: Dataset, Real-Time Streaming Model, and Benchmark for Egocentric Video Editing
par: Li, Runjia, et autres
Publié: (2025)

CoSLight: Co-optimizing Collaborator Selection and Decision-making to Enhance Traffic Signal Control
par: Ruan, Jingqing, et autres
Publié: (2024)

GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models
par: Zheng, Shurong, et autres
Publié: (2026)

TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
par: Yang, Fan, et autres
Publié: (2026)

In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
par: Peng, Taiying, et autres
Publié: (2025)

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
par: Lai, Bolin, et autres
Publié: (2023)

Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring
par: Zhan, Yufei, et autres
Publié: (2024)

Listening Across the Cosmic Time: Standard Sirens from Ground- and Space-Based Missions in the Next Decade
par: Salvarese, Alberto, et autres
Publié: (2025)

Fine-grained Spatiotemporal Grounding on Egocentric Videos
par: Liang, Shuo, et autres
Publié: (2025)

Improving Generalization in LLM Structured Pruning via Function-Aware Neuron Grouping
par: Yu, Tao, et autres
Publié: (2025)

Friend or Foe? Harnessing Controllable Overfitting for Anomaly Detection
par: Qian, Long, et autres
Publié: (2024)

Quality-Aware Language-Conditioned Local Auto-Regressive Anomaly Synthesis and Detection
par: Qian, Long, et autres
Publié: (2025)

MathPhys-Guided Coarse-to-Fine Anomaly Synthesis with SQE-Driven Bi-Level Optimization for Anomaly Detection
par: Qian, Long, et autres
Publié: (2025)

A Benchmark for Crime Surveillance Video Analysis with Large Models
par: Chen, Haoran, et autres
Publié: (2025)

MME-Industry: A Cross-Industry Multimodal Evaluation Benchmark
par: Yi, Dongyi, et autres
Publié: (2025)

Ego-Grounding for Personalized Question-Answering in Egocentric Videos
par: Xiao, Junbin, et autres
Publié: (2026)

VoiceAssistant-Eval: Benchmarking AI Assistants across Listening, Speaking, and Viewing
par: Wang, Ke, et autres
Publié: (2025)

Chapter 3 Autophony: Listening to your Eyes Move
par: Harris, Anna
Publié: (2019)

EgoSound: Benchmarking Sound Understanding in Egocentric Videos
par: Zhu, Bingwen, et autres
Publié: (2026)

Can Speech LLMs Think while Listening?
par: Shih, Yi-Jen, et autres
Publié: (2025)

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
par: Yu, Shoubin, et autres
Publié: (2026)

UniBYD: A Unified Framework for Learning Robotic Manipulation Across Embodiments Beyond Imitation of Human Demonstrations
par: Yuan, Tingyu, et autres
Publié: (2025)

Highlights of Research Activities in Advanced Materials at Wuhan University
par: Lei Fu, et autres
Publié: (2024)

FiLo++: Zero-/Few-Shot Anomaly Detection by Fused Fine-Grained Descriptions and Deformable Localization
par: Gu, Zhaopeng, et autres
Publié: (2025)

UniVAD: A Training-free Unified Model for Few-shot Visual Anomaly Detection
par: Gu, Zhaopeng, et autres
Publié: (2024)

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Image Segmentation
par: Zhu, Yuanbing, et autres
Publié: (2024)

An Industry Study on Thermoplastic Elastomer (TPE) Materials: Innovations and Applications by Dongguan Renergy Plastic Technology Co., Ltd.
par: Hossain, Md Anwar, et autres
Publié: (2026)

Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval
par: Zhao, Chuang, et autres
Publié: (2025)

ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
par: Su, Yuejiao, et autres
Publié: (2025)

Visual Intention Grounding for Egocentric Assistants
par: Sun, Pengzhan, et autres
Publié: (2025)

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation
par: Lai, Bolin, et autres
Publié: (2022)