:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Tong, Shengbang, Fan, David, Nguyen, John, Brown, Ellis, Zhou, Gaoyue, Qian, Shengyi, Zheng, Boyang, Vallaeys, Théophane, Han, Junlin, Fergus, Rob, Murray, Naila, Ghazvininejad, Marjan, Lewis, Mike, Ballas, Nicolas, Bar, Amir, Rabbat, Michael, Verbeek, Jakob, Zettlemoyer, Luke, Sinha, Koustuv, LeCun, Yann, Xie, Saining
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.03276
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Scaling Language-Free Visual Representation Learning
by: Fan, David, et al.
Published: (2025)

Gaussian Embeddings: How JEPAs Secretly Learn Your Data Density
by: Balestriero, Randall, et al.
Published: (2025)

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
by: Tong, Shengbang, et al.
Published: (2024)

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders
by: Tong, Shengbang, et al.
Published: (2026)

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
by: Mur-Labadia, Lorenzo, et al.
Published: (2026)

SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
by: Vallaeys, Théophane, et al.
Published: (2025)

Qinco2: Vector Compression and Search with Improved Implicit Neural Codebooks
by: Vallaeys, Théophane, et al.
Published: (2025)

Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
by: Vallaeys, Théophane, et al.
Published: (2024)

Learning Latent Action World Models In The Wild
by: Garrido, Quentin, et al.
Published: (2026)

Navigation World Models
by: Bar, Amir, et al.
Published: (2024)

Parallel Stochastic Gradient-Based Planning for World Models
by: Psenka, Michael, et al.
Published: (2026)

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)

Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models
by: Yasunaga, Michihiro, et al.
Published: (2025)

VUGEN: Visual Understanding priors for GENeration
by: Chen, Xiangyi, et al.
Published: (2025)

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning
by: Zhou, Gaoyue, et al.
Published: (2024)

A Lightweight Library for Energy-Based Joint-Embedding Predictive Architectures
by: Terver, Basile, et al.
Published: (2026)

World Models for Learning Dexterous Hand-Object Interactions from Human Videos
by: Goswami, Raktim Gautam, et al.
Published: (2025)

Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation
by: Denton, Remi, et al.
Published: (2014)

Revisiting Feature Prediction for Learning Visual Representations from Video
by: Bardes, Adrien, et al.
Published: (2024)

Intuitive physics understanding emerges from self-supervised pretraining on natural videos
by: Garrido, Quentin, et al.
Published: (2025)

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
by: Tong, Shengbang, et al.
Published: (2024)

Cambrian-S: Towards Spatial Supersensing in Video
by: Yang, Shusheng, et al.
Published: (2025)

Diffusion Transformers with Representation Autoencoders
by: Zheng, Boyang, et al.
Published: (2025)

Benchmark Designers Should "Train on the Test Set" to Expose Exploitable Non-Visual Shortcuts
by: Brown, Ellis, et al.
Published: (2025)

Learning and Leveraging World Models in Visual Representation Learning
by: Garrido, Quentin, et al.
Published: (2024)

Fast and Exact Enumeration of Deep Networks Partitions Regions
by: Balestriero, Randall, et al.
Published: (2024)

Introduction to Latent Variable Energy-Based Models: A Path Towards Autonomous Machine Intelligence
by: Dawid, Anna, et al.
Published: (2023)

LeJEPA: Provable and Scalable Self-Supervised Learning Without the Heuristics
by: Balestriero, Randall, et al.
Published: (2025)

Learning by Reconstruction Produces Uninformative Features For Perception
by: Balestriero, Randall, et al.
Published: (2024)

Stochastic positional embeddings improve masked image modeling
by: Bar, Amir, et al.
Published: (2023)

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training
by: Han, Junlin, et al.
Published: (2025)

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning
by: Zhai, Yuexiang, et al.
Published: (2024)

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
by: Brown, Ellis, et al.
Published: (2025)

PaintBench: Deterministic Evaluation of Precise Visual Editing
by: Xu, Kai, et al.
Published: (2026)

Multimodal RewardBench 2: Evaluating Omni Reward Models for Interleaved Text and Image
by: Hu, Yushi, et al.
Published: (2025)

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
by: Kamath, Amita, et al.
Published: (2025)

Temporal Straightening for Latent Planning
by: Wang, Ying, et al.
Published: (2026)

LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures
by: Huang, Hai, et al.
Published: (2025)

Semantic Tube Prediction: Beating LLM Data Efficiency with JEPA
by: Huang, Hai, et al.
Published: (2026)

Why AI systems don't learn and what to do about it: Lessons on autonomous learning from cognitive science
by: Dupoux, Emmanuel, et al.
Published: (2026)