:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hojel, Alberto, Bai, Yutong, Darrell, Trevor, Globerson, Amir, Bar, Amir
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.05729
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)

Lifting Embodied World Models for Planning and Control
by: Wang, Alex N., et al.
Published: (2026)

EgoPet: Egomotion and Interaction Data from an Animal's Perspective
by: Bar, Amir, et al.
Published: (2024)

Whole-Body Conditioned Egocentric Video Prediction
by: Bai, Yutong, et al.
Published: (2025)

Stochastic positional embeddings improve masked image modeling
by: Bar, Amir, et al.
Published: (2023)

Vector Quantized Feature Fields for Fast 3D Semantic Lifting
by: Tang, George, et al.
Published: (2025)

Navigation World Models
by: Bar, Amir, et al.
Published: (2024)

From Generated Human Videos to Physically Plausible Robot Trajectories
by: Ni, James, et al.
Published: (2025)

Analyzing The Language of Visual Tokens
by: Chan, David M., et al.
Published: (2024)

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
by: Assouel, Rim, et al.
Published: (2026)

REOrdering Patches Improves Vision Models
by: Kutscher, Declan, et al.
Published: (2025)

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
by: Huang, Brandon, et al.
Published: (2024)

Recursive Visual Programming
by: Ge, Jiaxin, et al.
Published: (2023)

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
by: Bitton-Guetta, Nitzan, et al.
Published: (2024)

Latent Implicit Visual Reasoning
by: Li, Kelvin, et al.
Published: (2025)

An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence
by: Weissman, Gil, et al.
Published: (2025)

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
by: Golovanevsky, Michal, et al.
Published: (2025)

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
by: Niu, Dantong, et al.
Published: (2024)

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
by: Lian, Long, et al.
Published: (2023)

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
by: Ng, Evonne, et al.
Published: (2024)

Segment Anything without Supervision
by: Wang, XuDong, et al.
Published: (2024)

Scaling Language-Free Visual Representation Learning
by: Fan, David, et al.
Published: (2025)

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
by: Wu, Tsung-Han, et al.
Published: (2024)

Visual Lexicon: Rich Image Features in Language Space
by: Wang, XuDong, et al.
Published: (2024)

When Do We Not Need Larger Vision Models?
by: Shi, Baifeng, et al.
Published: (2024)

Questioning the Stability of Visual Question Answering
by: Rosenfeld, Amir, et al.
Published: (2025)

DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding
by: Ahmadian, Mona, et al.
Published: (2025)

A Dataset for Mechanical Mechanisms
by: Ghezelbash, Farshid, et al.
Published: (2024)

Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
by: Mansourian, Amir M., et al.
Published: (2025)

Readout Guidance: Learning Control from Diffusion Features
by: Luo, Grace, et al.
Published: (2023)

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
by: Luo, Grace, et al.
Published: (2023)

Dual-Process Image Generation
by: Luo, Grace, et al.
Published: (2025)

Solving Vision Tasks with Simple Photoreceptors Instead of Cameras
by: Atanov, Andrei, et al.
Published: (2024)

Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
by: El-Ghoussani, Amir, et al.
Published: (2026)

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
by: Yu, Junwei, et al.
Published: (2025)

DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
by: Huang, Brandon, et al.
Published: (2025)

Fast Image-based Neural Relighting with Translucency-Reflection Modeling
by: Zhu, Shizhan, et al.
Published: (2023)

Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
by: Qin, Yiming, et al.
Published: (2025)

Visual Autoregressive Modelling for Monocular Depth Estimation
by: El-Ghoussani, Amir, et al.
Published: (2025)

Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding
by: Abdollahi, Hamid, et al.
Published: (2025)