Saved in:
| Main Authors: | Tong, Shengbang, Fan, David, Nguyen, John, Brown, Ellis, Zhou, Gaoyue, Qian, Shengyi, Zheng, Boyang, Vallaeys, Théophane, Han, Junlin, Fergus, Rob, Murray, Naila, Ghazvininejad, Marjan, Lewis, Mike, Ballas, Nicolas, Bar, Amir, Rabbat, Michael, Verbeek, Jakob, Zettlemoyer, Luke, Sinha, Koustuv, LeCun, Yann, Xie, Saining |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.03276 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scaling Language-Free Visual Representation Learning
by: Fan, David, et al.
Published: (2025)
by: Fan, David, et al.
Published: (2025)
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
by: Balestriero, Randall, et al.
Published: (2025)
by: Balestriero, Randall, et al.
Published: (2025)
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
by: Tong, Shengbang, et al.
Published: (2024)
by: Tong, Shengbang, et al.
Published: (2024)
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
by: Tong, Shengbang, et al.
Published: (2026)
by: Tong, Shengbang, et al.
Published: (2026)
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
by: Mur-Labadia, Lorenzo, et al.
Published: (2026)
by: Mur-Labadia, Lorenzo, et al.
Published: (2026)
SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
by: Vallaeys, Théophane, et al.
Published: (2025)
by: Vallaeys, Théophane, et al.
Published: (2025)
Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks
by: Vallaeys, Théophane, et al.
Published: (2025)
by: Vallaeys, Théophane, et al.
Published: (2025)
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
by: Vallaeys, Théophane, et al.
Published: (2024)
by: Vallaeys, Théophane, et al.
Published: (2024)
Learning Latent Action World Models In The Wild
by: Garrido, Quentin, et al.
Published: (2026)
by: Garrido, Quentin, et al.
Published: (2026)
Navigation World Models
by: Bar, Amir, et al.
Published: (2024)
by: Bar, Amir, et al.
Published: (2024)
Parallel Stochastic Gradient-Based Planning for World Models
by: Psenka, Michael, et al.
Published: (2026)
by: Psenka, Michael, et al.
Published: (2026)
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)
by: Tong, Shengbang, et al.
Published: (2024)
Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
by: Yasunaga, Michihiro, et al.
Published: (2025)
by: Yasunaga, Michihiro, et al.
Published: (2025)
VUGEN: Visual Understanding priors for GENeration
by: Chen, Xiangyi, et al.
Published: (2025)
by: Chen, Xiangyi, et al.
Published: (2025)
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
by: Zhou, Gaoyue, et al.
Published: (2024)
by: Zhou, Gaoyue, et al.
Published: (2024)
A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
by: Terver, Basile, et al.
Published: (2026)
by: Terver, Basile, et al.
Published: (2026)
World Models for Learning Dexterous Hand-Object Interactions from Human Videos
by: Goswami, Raktim Gautam, et al.
Published: (2025)
by: Goswami, Raktim Gautam, et al.
Published: (2025)
Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
by: Denton, Remi, et al.
Published: (2014)
by: Denton, Remi, et al.
Published: (2014)
Revisiting Feature Prediction for Learning Visual Representations from Video
by: Bardes, Adrien, et al.
Published: (2024)
by: Bardes, Adrien, et al.
Published: (2024)
Intuitive physics understanding emerges from self-supervised pretraining on natural videos
by: Garrido, Quentin, et al.
Published: (2025)
by: Garrido, Quentin, et al.
Published: (2025)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)
by: Tong, Shengbang, et al.
Published: (2024)
Cambrian-S: Towards Spatial Supersensing in Video
by: Yang, Shusheng, et al.
Published: (2025)
by: Yang, Shusheng, et al.
Published: (2025)
Diffusion Transformers with Representation Autoencoders
by: Zheng, Boyang, et al.
Published: (2025)
by: Zheng, Boyang, et al.
Published: (2025)
Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
by: Brown, Ellis, et al.
Published: (2025)
by: Brown, Ellis, et al.
Published: (2025)
Learning and Leveraging World Models in Visual Representation Learning
by: Garrido, Quentin, et al.
Published: (2024)
by: Garrido, Quentin, et al.
Published: (2024)
Fast and Exact Enumeration of Deep Networks Partitions Regions
by: Balestriero, Randall, et al.
Published: (2024)
by: Balestriero, Randall, et al.
Published: (2024)
Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence
by: Dawid, Anna, et al.
Published: (2023)
by: Dawid, Anna, et al.
Published: (2023)
LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
by: Balestriero, Randall, et al.
Published: (2025)
by: Balestriero, Randall, et al.
Published: (2025)
Learning by Reconstruction Produces Uninformative Features For Perception
by: Balestriero, Randall, et al.
Published: (2024)
by: Balestriero, Randall, et al.
Published: (2024)
Stochastic positional embeddings improve masked image modeling
by: Bar, Amir, et al.
Published: (2023)
by: Bar, Amir, et al.
Published: (2023)
Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
by: Han, Junlin, et al.
Published: (2025)
by: Han, Junlin, et al.
Published: (2025)
Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)
by: Zhai, Yuexiang, et al.
Published: (2024)
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
by: Brown, Ellis, et al.
Published: (2025)
by: Brown, Ellis, et al.
Published: (2025)
PaintBench: Deterministic Evaluation of Precise Visual Editing
by: Xu, Kai, et al.
Published: (2026)
by: Xu, Kai, et al.
Published: (2026)
Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
by: Hu, Yushi, et al.
Published: (2025)
by: Hu, Yushi, et al.
Published: (2025)
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
by: Kamath, Amita, et al.
Published: (2025)
by: Kamath, Amita, et al.
Published: (2025)
Temporal Straightening for Latent Planning
by: Wang, Ying, et al.
Published: (2026)
by: Wang, Ying, et al.
Published: (2026)
LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
by: Huang, Hai, et al.
Published: (2025)
by: Huang, Hai, et al.
Published: (2025)
Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
by: Huang, Hai, et al.
Published: (2026)
by: Huang, Hai, et al.
Published: (2026)
Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science
by: Dupoux, Emmanuel, et al.
Published: (2026)
by: Dupoux, Emmanuel, et al.
Published: (2026)
Similar Items
-
Scaling Language-Free Visual Representation Learning
by: Fan, David, et al.
Published: (2025) -
Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
by: Balestriero, Randall, et al.
Published: (2025) -
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
by: Tong, Shengbang, et al.
Published: (2024) -
Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
by: Tong, Shengbang, et al.
Published: (2026) -
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
by: Mur-Labadia, Lorenzo, et al.
Published: (2026)