:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wang, Zihan, Li, Songlin, Hao, Lingyan, Hu, Xinyu, Song, Bowen
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2411.13609
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Seeing What Matters: Visual Preference Policy Optimization for Visual Generation
by: Ni, Ziqi, et al.
Published: (2025)

What You See is What You Ask: Evaluating Audio Descriptions
by: Kala, Divy, et al.
Published: (2025)

FrameOracle: Learning What to See and How Much to See in Videos
by: Li, Chaoyu, et al.
Published: (2025)

What You See is What You Classify: Black Box Attributions
by: Stalder, Steven, et al.
Published: (2022)

Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
by: Pei, Gensheng, et al.
Published: (2025)

Tell What You Hear From What You See -- Video to Audio Generation Through Text
by: Liu, Xiulong, et al.
Published: (2024)

Seeing What You Say: Expressive Image Generation from Speech
by: Lee, Jiyoung, et al.
Published: (2025)

Do You See What I Am Pointing At? Gesture-Based Egocentric Video Question Answering
by: Choi, Yura, et al.
Published: (2026)

What You See is (Usually) What You Get: Multimodal Prototype Networks that Abstain from Expensive Modalities
by: Bahng, Muchang, et al.
Published: (2025)

See What You Are Told: Visual Attention Sink in Large Multimodal Models
by: Kang, Seil, et al.
Published: (2025)

Seeing What Matters: Generalizable AI-generated Video Detection with Forensic-Oriented Augmentation
by: Corvi, Riccardo, et al.
Published: (2025)

Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection
by: Yi, Ariana, et al.
Published: (2025)

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
by: Song, Junha, et al.
Published: (2026)

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations
by: De Nadai, Marco, et al.
Published: (2025)

What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation
by: Cheng, Yihua, et al.
Published: (2024)

Do You See What I Say? Generalizable Deepfake Detection based on Visual Speech Recognition
by: Bora, Maheswar, et al.
Published: (2025)

Focus On What Matters: Separated Models For Visual-Based RL Generalization
by: Zhang, Di, et al.
Published: (2024)

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want
by: Lin, Weifeng, et al.
Published: (2024)

What Matters to You? Towards Visual Representation Alignment for Robot Learning
by: Tian, Ran, et al.
Published: (2023)

Generating 360° Video is What You Need For a 3D Scene
by: Zhang, Zhaoyang, et al.
Published: (2025)

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark
by: Chen, Seng Nam, et al.
Published: (2026)

Smart Feature is What You Need
by: Hu, Zhaoxin, et al.
Published: (2024)

Visually Dehallucinative Instruction Generation: Know What You Don't Know
by: Cha, Sungguk, et al.
Published: (2024)

Point What You Mean: Visually Grounded Instruction Policy
by: Yu, Hang, et al.
Published: (2025)

What You Have is What You Track: Adaptive and Robust Multimodal Tracking
by: Tan, Yuedong, et al.
Published: (2025)

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
by: Li, Senmao, et al.
Published: (2024)

Revisit What You See: Revealing Visual Semantics in Vision Tokens to Guide LVLM Decoding
by: Cho, Beomsik, et al.
Published: (2025)

What Are You Doing? A Closer Look at Controllable Human Video Generation
by: Bugliarello, Emanuele, et al.
Published: (2025)

From Sora What We Can See: A Survey of Text-to-Video Generation
by: Sun, Rui, et al.
Published: (2024)

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
by: Sun, Boyuan, et al.
Published: (2026)

See What You Seek: Semantic Contextual Integration for Cloth-Changing Person Re-Identification
by: Han, Xiyu, et al.
Published: (2024)

What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs
by: Trevithick, Alex, et al.
Published: (2024)

What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation
by: Lin, Jianghang, et al.
Published: (2025)

Can Current AI Models Count What We Mean, Not What They See? A Benchmark and Systematic Evaluation
by: Nguyen, Gia Khanh, et al.
Published: (2025)

NowYouSee Me: Context-Aware Automatic Audio Description
by: Lee, Seon-Ho, et al.
Published: (2024)

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
by: Abdelhamed, Abdelrahman, et al.
Published: (2024)

Beyond FVD: Enhanced Evaluation Metrics for Video Generation Quality
by: Luo, Ge Ya, et al.
Published: (2024)

Masked Generative Transformer Is What You Need for Image Editing
by: Chow, Wei, et al.
Published: (2026)

Is What You Ask For What You Get? Investigating Concept Associations in Text-to-Image Models
by: Magid, Salma Abdel, et al.
Published: (2024)

What Do You See in Common? Learning Hierarchical Prototypes over Tree-of-Life to Discover Evolutionary Traits
by: Manogaran, Harish Babu, et al.
Published: (2024)