Saved in:
| Main Authors: | Hojel, Alberto, Bai, Yutong, Darrell, Trevor, Globerson, Amir, Bar, Amir |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05729 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)
by: Luo, Grace, et al.
Published: (2024)
Lifting Embodied World Models for Planning and Control
by: Wang, Alex N., et al.
Published: (2026)
by: Wang, Alex N., et al.
Published: (2026)
EgoPet: Egomotion and Interaction Data from an Animal's Perspective
by: Bar, Amir, et al.
Published: (2024)
by: Bar, Amir, et al.
Published: (2024)
Whole-Body Conditioned Egocentric Video Prediction
by: Bai, Yutong, et al.
Published: (2025)
by: Bai, Yutong, et al.
Published: (2025)
Stochastic positional embeddings improve masked image modeling
by: Bar, Amir, et al.
Published: (2023)
by: Bar, Amir, et al.
Published: (2023)
Vector Quantized Feature Fields for Fast 3D Semantic Lifting
by: Tang, George, et al.
Published: (2025)
by: Tang, George, et al.
Published: (2025)
Navigation World Models
by: Bar, Amir, et al.
Published: (2024)
by: Bar, Amir, et al.
Published: (2024)
From Generated Human Videos to Physically Plausible Robot Trajectories
by: Ni, James, et al.
Published: (2025)
by: Ni, James, et al.
Published: (2025)
Analyzing The Language of Visual Tokens
by: Chan, David M., et al.
Published: (2024)
by: Chan, David M., et al.
Published: (2024)
PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs
by: Assouel, Rim, et al.
Published: (2026)
by: Assouel, Rim, et al.
Published: (2026)
REOrdering Patches Improves Vision Models
by: Kutscher, Declan, et al.
Published: (2025)
by: Kutscher, Declan, et al.
Published: (2025)
Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
by: Huang, Brandon, et al.
Published: (2024)
by: Huang, Brandon, et al.
Published: (2024)
Recursive Visual Programming
by: Ge, Jiaxin, et al.
Published: (2023)
by: Ge, Jiaxin, et al.
Published: (2023)
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
by: Bitton-Guetta, Nitzan, et al.
Published: (2024)
by: Bitton-Guetta, Nitzan, et al.
Published: (2024)
Latent Implicit Visual Reasoning
by: Li, Kelvin, et al.
Published: (2025)
by: Li, Kelvin, et al.
Published: (2025)
An Automated Tip-and-Cue Framework for Optimized Satellite Tasking and Visual Intelligence
by: Weissman, Gil, et al.
Published: (2025)
by: Weissman, Gil, et al.
Published: (2025)
Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts
by: Golovanevsky, Michal, et al.
Published: (2025)
by: Golovanevsky, Michal, et al.
Published: (2025)
LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
by: Niu, Dantong, et al.
Published: (2024)
by: Niu, Dantong, et al.
Published: (2024)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
by: Lian, Long, et al.
Published: (2023)
by: Lian, Long, et al.
Published: (2023)
From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
by: Ng, Evonne, et al.
Published: (2024)
by: Ng, Evonne, et al.
Published: (2024)
Segment Anything without Supervision
by: Wang, XuDong, et al.
Published: (2024)
by: Wang, XuDong, et al.
Published: (2024)
Scaling Language-Free Visual Representation Learning
by: Fan, David, et al.
Published: (2025)
by: Fan, David, et al.
Published: (2025)
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
by: Wu, Tsung-Han, et al.
Published: (2024)
by: Wu, Tsung-Han, et al.
Published: (2024)
Visual Lexicon: Rich Image Features in Language Space
by: Wang, XuDong, et al.
Published: (2024)
by: Wang, XuDong, et al.
Published: (2024)
When Do We Not Need Larger Vision Models?
by: Shi, Baifeng, et al.
Published: (2024)
by: Shi, Baifeng, et al.
Published: (2024)
Questioning the Stability of Visual Question Answering
by: Rosenfeld, Amir, et al.
Published: (2025)
by: Rosenfeld, Amir, et al.
Published: (2025)
DEL: Dense Event Localization for Multi-modal Audio-Visual Understanding
by: Ahmadian, Mona, et al.
Published: (2025)
by: Ahmadian, Mona, et al.
Published: (2025)
A Dataset for Mechanical Mechanisms
by: Ghezelbash, Farshid, et al.
Published: (2024)
by: Ghezelbash, Farshid, et al.
Published: (2024)
Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
by: Mansourian, Amir M., et al.
Published: (2025)
by: Mansourian, Amir M., et al.
Published: (2025)
Readout Guidance: Learning Control from Diffusion Features
by: Luo, Grace, et al.
Published: (2023)
by: Luo, Grace, et al.
Published: (2023)
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
by: Luo, Grace, et al.
Published: (2023)
by: Luo, Grace, et al.
Published: (2023)
Dual-Process Image Generation
by: Luo, Grace, et al.
Published: (2025)
by: Luo, Grace, et al.
Published: (2025)
Solving Vision Tasks with Simple Photoreceptors Instead of Cameras
by: Atanov, Andrei, et al.
Published: (2024)
by: Atanov, Andrei, et al.
Published: (2024)
Prompt-Guided Image Editing with Masked Logit Nudging in Visual Autoregressive Models
by: El-Ghoussani, Amir, et al.
Published: (2026)
by: El-Ghoussani, Amir, et al.
Published: (2026)
UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
by: Yu, Junwei, et al.
Published: (2025)
by: Yu, Junwei, et al.
Published: (2025)
DAVE: A VLM Vision Encoder for Document Understanding and Web Agents
by: Huang, Brandon, et al.
Published: (2025)
by: Huang, Brandon, et al.
Published: (2025)
Fast Image-based Neural Relighting with Translucency-Reflection Modeling
by: Zhu, Shizhan, et al.
Published: (2023)
by: Zhu, Shizhan, et al.
Published: (2023)
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens
by: Qin, Yiming, et al.
Published: (2025)
by: Qin, Yiming, et al.
Published: (2025)
Visual Autoregressive Modelling for Monocular Depth Estimation
by: El-Ghoussani, Amir, et al.
Published: (2025)
by: El-Ghoussani, Amir, et al.
Published: (2025)
Probing Multimodal Fusion in the Brain: The Dominance of Audiovisual Streams in Naturalistic Encoding
by: Abdollahi, Hamid, et al.
Published: (2025)
by: Abdollahi, Hamid, et al.
Published: (2025)
Similar Items
-
Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024) -
Lifting Embodied World Models for Planning and Control
by: Wang, Alex N., et al.
Published: (2026) -
EgoPet: Egomotion and Interaction Data from an Animal's Perspective
by: Bar, Amir, et al.
Published: (2024) -
Whole-Body Conditioned Egocentric Video Prediction
by: Bai, Yutong, et al.
Published: (2025) -
Stochastic positional embeddings improve masked image modeling
by: Bar, Amir, et al.
Published: (2023)