Saved in:
| Main Authors: | Wang, Zihan, Li, Songlin, Hao, Lingyan, Hu, Xinyu, Song, Bowen |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2411.13609 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
by: Ni, Ziqi, et al.
Published: (2025)
by: Ni, Ziqi, et al.
Published: (2025)
What You See is What You Ask: Evaluating Audio Descriptions
by: Kala, Divy, et al.
Published: (2025)
by: Kala, Divy, et al.
Published: (2025)
FrameOracle: Learning What to See and How Much to See in Videos
by: Li, Chaoyu, et al.
Published: (2025)
by: Li, Chaoyu, et al.
Published: (2025)
What You See is What You Classify: Black Box Attributions
by: Stalder, Steven, et al.
Published: (2022)
by: Stalder, Steven, et al.
Published: (2022)
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
by: Pei, Gensheng, et al.
Published: (2025)
by: Pei, Gensheng, et al.
Published: (2025)
Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)
by: Liu, Xiulong, et al.
Published: (2024)
Seeing What You Say: Expressive Image Generation from Speech
by: Lee, Jiyoung, et al.
Published: (2025)
by: Lee, Jiyoung, et al.
Published: (2025)
Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
by: Choi, Yura, et al.
Published: (2026)
by: Choi, Yura, et al.
Published: (2026)
What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities
by: Bahng, Muchang, et al.
Published: (2025)
by: Bahng, Muchang, et al.
Published: (2025)
See What You Are Told: Visual Attention Sink in Large Multimodal Models
by: Kang, Seil, et al.
Published: (2025)
by: Kang, Seil, et al.
Published: (2025)
Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation
by: Corvi, Riccardo, et al.
Published: (2025)
by: Corvi, Riccardo, et al.
Published: (2025)
Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
by: Yi, Ariana, et al.
Published: (2025)
by: Yi, Ariana, et al.
Published: (2025)
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
by: Song, Junha, et al.
Published: (2026)
by: Song, Junha, et al.
Published: (2026)
Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
by: De Nadai, Marco, et al.
Published: (2025)
by: De Nadai, Marco, et al.
Published: (2025)
What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation
by: Cheng, Yihua, et al.
Published: (2024)
by: Cheng, Yihua, et al.
Published: (2024)
Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
by: Bora, Maheswar, et al.
Published: (2025)
by: Bora, Maheswar, et al.
Published: (2025)
Focus On What Matters: Separated Models For Visual-Based RL Generalization
by: Zhang, Di, et al.
Published: (2024)
by: Zhang, Di, et al.
Published: (2024)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
by: Lin, Weifeng, et al.
Published: (2024)
by: Lin, Weifeng, et al.
Published: (2024)
What Matters to You? Towards Visual Representation Alignment for Robot Learning
by: Tian, Ran, et al.
Published: (2023)
by: Tian, Ran, et al.
Published: (2023)
Generating 360° Video is What You Need For a 3D Scene
by: Zhang, Zhaoyang, et al.
Published: (2025)
by: Zhang, Zhaoyang, et al.
Published: (2025)
Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
by: Chen, Seng Nam, et al.
Published: (2026)
by: Chen, Seng Nam, et al.
Published: (2026)
Smart Feature is What You Need
by: Hu, Zhaoxin, et al.
Published: (2024)
by: Hu, Zhaoxin, et al.
Published: (2024)
Visually Dehallucinative Instruction Generation: Know What You Don't Know
by: Cha, Sungguk, et al.
Published: (2024)
by: Cha, Sungguk, et al.
Published: (2024)
Point What You Mean: Visually Grounded Instruction Policy
by: Yu, Hang, et al.
Published: (2025)
by: Yu, Hang, et al.
Published: (2025)
What You Have is What You Track: Adaptive and Robust Multimodal Tracking
by: Tan, Yuedong, et al.
Published: (2025)
by: Tan, Yuedong, et al.
Published: (2025)
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
by: Li, Senmao, et al.
Published: (2024)
by: Li, Senmao, et al.
Published: (2024)
Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
by: Cho, Beomsik, et al.
Published: (2025)
by: Cho, Beomsik, et al.
Published: (2025)
What Are You Doing? A Closer Look at Controllable Human Video Generation
by: Bugliarello, Emanuele, et al.
Published: (2025)
by: Bugliarello, Emanuele, et al.
Published: (2025)
From Sora What We Can See: A Survey of Text-to-Video Generation
by: Sun, Rui, et al.
Published: (2024)
by: Sun, Rui, et al.
Published: (2024)
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)
by: Sun, Boyuan, et al.
Published: (2026)
See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification
by: Han, Xiyu, et al.
Published: (2024)
by: Han, Xiyu, et al.
Published: (2024)
What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
by: Trevithick, Alex, et al.
Published: (2024)
by: Trevithick, Alex, et al.
Published: (2024)
What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
by: Lin, Jianghang, et al.
Published: (2025)
by: Lin, Jianghang, et al.
Published: (2025)
Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
by: Nguyen, Gia Khanh, et al.
Published: (2025)
by: Nguyen, Gia Khanh, et al.
Published: (2025)
NowYouSee Me: Context-Aware Automatic Audio Description
by: Lee, Seon-Ho, et al.
Published: (2024)
by: Lee, Seon-Ho, et al.
Published: (2024)
What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)
Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality
by: Luo, Ge Ya, et al.
Published: (2024)
by: Luo, Ge Ya, et al.
Published: (2024)
Masked Generative Transformer Is What You Need for Image Editing
by: Chow, Wei, et al.
Published: (2026)
by: Chow, Wei, et al.
Published: (2026)
Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
by: Magid, Salma Abdel, et al.
Published: (2024)
by: Magid, Salma Abdel, et al.
Published: (2024)
What Do You See in Common? Learning Hierarchical Prototypes over Tree-of-Life to Discover Evolutionary Traits
by: Manogaran, Harish Babu, et al.
Published: (2024)
by: Manogaran, Harish Babu, et al.
Published: (2024)
Similar Items
-
Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
by: Ni, Ziqi, et al.
Published: (2025) -
What You See is What You Ask: Evaluating Audio Descriptions
by: Kala, Divy, et al.
Published: (2025) -
FrameOracle: Learning What to See and How Much to See in Videos
by: Li, Chaoyu, et al.
Published: (2025) -
What You See is What You Classify: Black Box Attributions
by: Stalder, Steven, et al.
Published: (2022) -
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
by: Pei, Gensheng, et al.
Published: (2025)