Saved in:
| Main Authors: | Jun, Sejoon, Nguyen-Truong, Hai, Seminara, Luigi, Torresani, Lorenzo |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.20388 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026)
by: Seminara, Luigi, et al.
Published: (2026)
NeIn: Telling What You Don't Want
by: Bui, Nhat-Tan, et al.
Published: (2024)
by: Bui, Nhat-Tan, et al.
Published: (2024)
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
by: Koley, Subhadeep, et al.
Published: (2024)
by: Koley, Subhadeep, et al.
Published: (2024)
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2025)
by: Seminara, Luigi, et al.
Published: (2025)
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2024)
by: Seminara, Luigi, et al.
Published: (2024)
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
by: Choi, Yura, et al.
Published: (2026)
by: Choi, Yura, et al.
Published: (2026)
Step Differences in Instructional Video
by: Nagarajan, Tushar, et al.
Published: (2024)
by: Nagarajan, Tushar, et al.
Published: (2024)
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
by: Seminara, Luigi, et al.
Published: (2026)
by: Seminara, Luigi, et al.
Published: (2026)
Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)
by: Liu, Xiulong, et al.
Published: (2024)
What You Have is What You Track: Adaptive and Robust Multimodal Tracking
by: Tan, Yuedong, et al.
Published: (2025)
by: Tan, Yuedong, et al.
Published: (2025)
Tell Me What You See: Text-Guided Real-World Image Denoising
by: Yosef, Erez, et al.
Published: (2023)
by: Yosef, Erez, et al.
Published: (2023)
Where Do You Go? Pedestrian Trajectory Prediction using Scene Features
by: Rezaei, Mohammad Ali, et al.
Published: (2025)
by: Rezaei, Mohammad Ali, et al.
Published: (2025)
What Are You Doing? A Closer Look at Controllable Human Video Generation
by: Bugliarello, Emanuele, et al.
Published: (2025)
by: Bugliarello, Emanuele, et al.
Published: (2025)
EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
by: Jung, Minjoon, et al.
Published: (2026)
by: Jung, Minjoon, et al.
Published: (2026)
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
by: Magid, Salma Abdel, et al.
Published: (2024)
by: Magid, Salma Abdel, et al.
Published: (2024)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance
by: Wang, Zan, et al.
Published: (2024)
by: Wang, Zan, et al.
Published: (2024)
Eye Gaze Tells You Where to Compute: Gaze-Driven Efficient VLMs
by: Chen, Qinyu, et al.
Published: (2025)
by: Chen, Qinyu, et al.
Published: (2025)
Tell Me Where You Are: Multimodal LLMs Meet Place Recognition
by: Lyu, Zonglin, et al.
Published: (2024)
by: Lyu, Zonglin, et al.
Published: (2024)
What You See is What You Ask: Evaluating Audio Descriptions
by: Kala, Divy, et al.
Published: (2025)
by: Kala, Divy, et al.
Published: (2025)
What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation
by: Cheng, Yihua, et al.
Published: (2024)
by: Cheng, Yihua, et al.
Published: (2024)
Moving Object Segmentation: All You Need Is SAM (and Flow)
by: Xie, Junyu, et al.
Published: (2024)
by: Xie, Junyu, et al.
Published: (2024)
Smart Feature is What You Need
by: Hu, Zhaoxin, et al.
Published: (2024)
by: Hu, Zhaoxin, et al.
Published: (2024)
What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities
by: Bahng, Muchang, et al.
Published: (2025)
by: Bahng, Muchang, et al.
Published: (2025)
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
by: Li, Senmao, et al.
Published: (2024)
by: Li, Senmao, et al.
Published: (2024)
What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
by: Lin, Jianghang, et al.
Published: (2025)
by: Lin, Jianghang, et al.
Published: (2025)
Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation
by: Li, Yifan, et al.
Published: (2026)
by: Li, Yifan, et al.
Published: (2026)
Who Walks With You Matters: Perceiving Social Interactions with Groups for Pedestrian Trajectory Prediction
by: Zou, Ziqian, et al.
Published: (2024)
by: Zou, Ziqian, et al.
Published: (2024)
Anatomy Might Be All You Need: Forecasting What to Do During Surgery
by: Sarwin, Gary, et al.
Published: (2025)
by: Sarwin, Gary, et al.
Published: (2025)
What You See is What You Classify: Black Box Attributions
by: Stalder, Steven, et al.
Published: (2022)
by: Stalder, Steven, et al.
Published: (2022)
Fall Forecast: What You'll Be Reading Next.
by: Hoffert, Barbara
Published: (1997)
by: Hoffert, Barbara
Published: (1997)
Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
by: Bora, Maheswar, et al.
Published: (2025)
by: Bora, Maheswar, et al.
Published: (2025)
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)
EgoNav: Egocentric Scene-aware Human Trajectory Prediction
by: Wang, Weizhuo, et al.
Published: (2024)
by: Wang, Weizhuo, et al.
Published: (2024)
Decom--CAM: Tell Me What You See, In Details! Feature-Level Interpretation via Decomposition Class Activation Map
by: Yang, Yuguang, et al.
Published: (2023)
by: Yang, Yuguang, et al.
Published: (2023)
Semantic Compositions Enhance Vision-Language Contrastive Learning
by: Aladago, Maxwell, et al.
Published: (2024)
by: Aladago, Maxwell, et al.
Published: (2024)
Think Before You Move: Latent Motion Reasoning for Text-to-Motion Generation
by: Qian, Yijie, et al.
Published: (2025)
by: Qian, Yijie, et al.
Published: (2025)
Semantic Alignment in Hyperbolic Space for Open-Vocabulary Semantic Segmentation
by: Truong, Hoang M., et al.
Published: (2026)
by: Truong, Hoang M., et al.
Published: (2026)
Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
by: Jiang, Tianchong, et al.
Published: (2025)
by: Jiang, Tianchong, et al.
Published: (2025)
Aligning What You Separate: Denoised Patch Mixing for Source-Free Domain Adaptation in Medical Image Segmentation
by: Bui-Tran, Quang-Khai, et al.
Published: (2025)
by: Bui-Tran, Quang-Khai, et al.
Published: (2025)
SeTformer is What You Need for Vision and Language
by: Shamsolmoali, Pourya, et al.
Published: (2024)
by: Shamsolmoali, Pourya, et al.
Published: (2024)
Similar Items
-
RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026) -
NeIn: Telling What You Don't Want
by: Bui, Nhat-Tan, et al.
Published: (2024) -
You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval
by: Koley, Subhadeep, et al.
Published: (2024) -
Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2025) -
Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos
by: Seminara, Luigi, et al.
Published: (2024)