:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Jun, Sejoon, Nguyen-Truong, Hai, Seminara, Luigi, Torresani, Lorenzo
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.20388
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026)

NeIn: Telling What You Don't Want
by: Bui, Nhat-Tan, et al.
Published: (2024)

You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
by: Koley, Subhadeep, et al.
Published: (2024)

Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2025)

Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2024)

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
by: Choi, Yura, et al.
Published: (2026)

Step Differences in Instructional Video
by: Nagarajan, Tushar, et al.
Published: (2024)

ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
by: Seminara, Luigi, et al.
Published: (2026)

Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)

What You Have is What You Track: Adaptive and Robust Multimodal Tracking
by: Tan, Yuedong, et al.
Published: (2025)

Tell Me What You See: Text-Guided Real-World Image Denoising
by: Yosef, Erez, et al.
Published: (2023)

Where Do You Go? Pedestrian Trajectory Prediction using Scene Features
by: Rezaei, Mohammad Ali, et al.
Published: (2025)

What Are You Doing? A Closer Look at Controllable Human Video Generation
by: Bugliarello, Emanuele, et al.
Published: (2025)

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
by: Jung, Minjoon, et al.
Published: (2026)

Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
by: Magid, Salma Abdel, et al.
Published: (2024)

Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
by: Wang, Zan, et al.
Published: (2024)

Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs
by: Chen, Qinyu, et al.
Published: (2025)

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition
by: Lyu, Zonglin, et al.
Published: (2024)

What You See is What You Ask: Evaluating Audio Descriptions
by: Kala, Divy, et al.
Published: (2025)

What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation
by: Cheng, Yihua, et al.
Published: (2024)

Moving Object Segmentation: All You Need Is SAM (and Flow)
by: Xie, Junyu, et al.
Published: (2024)

Smart Feature is What You Need
by: Hu, Zhaoxin, et al.
Published: (2024)

What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities
by: Bahng, Muchang, et al.
Published: (2025)

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
by: Li, Senmao, et al.
Published: (2024)

What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
by: Lin, Jianghang, et al.
Published: (2025)

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
by: Li, Yifan, et al.
Published: (2026)

Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction
by: Zou, Ziqian, et al.
Published: (2024)

Anatomy Might Be All You Need: Forecasting What to Do During Surgery
by: Sarwin, Gary, et al.
Published: (2025)

What You See is What You Classify: Black Box Attributions
by: Stalder, Steven, et al.
Published: (2022)

Fall Forecast: What You'll Be Reading Next.
by: Hoffert, Barbara
Published: (1997)

Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
by: Bora, Maheswar, et al.
Published: (2025)

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)

EgoNav: Egocentric Scene-aware Human Trajectory Prediction
by: Wang, Weizhuo, et al.
Published: (2024)

Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map
by: Yang, Yuguang, et al.
Published: (2023)

Semantic Compositions Enhance Vision-Language Contrastive Learning
by: Aladago, Maxwell, et al.
Published: (2024)

Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation
by: Qian, Yijie, et al.
Published: (2025)

Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation
by: Truong, Hoang M., et al.
Published: (2026)

Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
by: Jiang, Tianchong, et al.
Published: (2025)

Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation
by: Bui-Tran, Quang-Khai, et al.
Published: (2025)

SeTformer is What You Need for Vision and Language
by: Shamsolmoali, Pourya, et al.
Published: (2024)