Saved in:
| Main Authors: | Dunlap, Lisa, Gonzalez, Joseph E., Darrell, Trevor, Heilbron, Fabian Caba, Sivic, Josef, Russell, Bryan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2509.08940 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
EditDuet: A Multi-Agent System for Video Non-Linear Editing
by: Sandoval-Castaneda, Marcelo, et al.
Published: (2025)
by: Sandoval-Castaneda, Marcelo, et al.
Published: (2025)
ResidualViT for Efficient Temporally Dense Video Encoding
by: Soldan, Mattia, et al.
Published: (2025)
by: Soldan, Mattia, et al.
Published: (2025)
Improving Personalized Search with Regularized Low-Rank Parameter Updates
by: Ryan, Fiona, et al.
Published: (2025)
by: Ryan, Fiona, et al.
Published: (2025)
Generative Timelines for Instructed Visual Assembly
by: Pardo, Alejandro, et al.
Published: (2024)
by: Pardo, Alejandro, et al.
Published: (2024)
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
by: Cheng, Jiacheng, et al.
Published: (2024)
by: Cheng, Jiacheng, et al.
Published: (2024)
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
by: Kwon, Gihyun, et al.
Published: (2024)
by: Kwon, Gihyun, et al.
Published: (2024)
Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
by: Luo, Grace, et al.
Published: (2023)
by: Luo, Grace, et al.
Published: (2023)
Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
by: Dave, Ishan Rajendrakumar, et al.
Published: (2024)
by: Dave, Ishan Rajendrakumar, et al.
Published: (2024)
Describing Differences in Image Sets with Natural Language
by: Dunlap, Lisa, et al.
Published: (2023)
by: Dunlap, Lisa, et al.
Published: (2023)
NewMove: Customizing text-to-video models with novel motions
by: Materzynska, Joanna, et al.
Published: (2023)
by: Materzynska, Joanna, et al.
Published: (2023)
VisionArena: 230K Real World User-VLM Conversations with Preference Labels
by: Chou, Christopher, et al.
Published: (2024)
by: Chou, Christopher, et al.
Published: (2024)
CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition
by: Phung, Quynh, et al.
Published: (2025)
by: Phung, Quynh, et al.
Published: (2025)
FocalPose++: Focal Length and Object Pose Estimation via Render and Compare
by: Cífka, Martin, et al.
Published: (2023)
by: Cífka, Martin, et al.
Published: (2023)
LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
by: Lian, Long, et al.
Published: (2023)
by: Lian, Long, et al.
Published: (2023)
Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior
by: Lin, David Chuan-En, et al.
Published: (2022)
by: Lin, David Chuan-En, et al.
Published: (2022)
VideoMap: Supporting Video Editing Exploration, Brainstorming, and Prototyping in the Latent Space
by: Lin, David Chuan-En, et al.
Published: (2022)
by: Lin, David Chuan-En, et al.
Published: (2022)
Scaling Up Video Summarization Pretraining with Large Language Models
by: Argaw, Dawit Mureja, et al.
Published: (2024)
by: Argaw, Dawit Mureja, et al.
Published: (2024)
Towards Automated Movie Trailer Generation
by: Argaw, Dawit Mureja, et al.
Published: (2024)
by: Argaw, Dawit Mureja, et al.
Published: (2024)
Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2024)
by: Kazakos, Evangelos, et al.
Published: (2024)
Large-scale Pre-training for Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2025)
by: Kazakos, Evangelos, et al.
Published: (2025)
Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
by: Bardhan, Jai, et al.
Published: (2026)
by: Bardhan, Jai, et al.
Published: (2026)
Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)
by: Luo, Grace, et al.
Published: (2024)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
by: Wu, Tsung-Han, et al.
Published: (2025)
by: Wu, Tsung-Han, et al.
Published: (2025)
GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
by: Souček, Tomáš, et al.
Published: (2023)
by: Souček, Tomáš, et al.
Published: (2023)
Visually Prompted Benchmarks Are Surprisingly Fragile
by: Feng, Haiwen, et al.
Published: (2025)
by: Feng, Haiwen, et al.
Published: (2025)
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
by: Vobecky, Antonin, et al.
Published: (2024)
by: Vobecky, Antonin, et al.
Published: (2024)
Video Action Differencing
by: Burgess, James, et al.
Published: (2025)
by: Burgess, James, et al.
Published: (2025)
AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
by: Mikeštíková, Anna Šárová, et al.
Published: (2025)
by: Mikeštíková, Anna Šárová, et al.
Published: (2025)
Fast Image-based Neural Relighting with Translucency-Reflection Modeling
by: Zhu, Shizhan, et al.
Published: (2023)
by: Zhu, Shizhan, et al.
Published: (2023)
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
by: Souček, Tomáš, et al.
Published: (2024)
by: Souček, Tomáš, et al.
Published: (2024)
Dual-Process Image Generation
by: Luo, Grace, et al.
Published: (2025)
by: Luo, Grace, et al.
Published: (2025)
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
by: Dunlap, Lisa, et al.
Published: (2024)
by: Dunlap, Lisa, et al.
Published: (2024)
Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
by: Wu, Tsung-Han, et al.
Published: (2024)
by: Wu, Tsung-Han, et al.
Published: (2024)
Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation
by: Vobecky, Antonin, et al.
Published: (2022)
by: Vobecky, Antonin, et al.
Published: (2022)
When Do We Not Need Larger Vision Models?
by: Shi, Baifeng, et al.
Published: (2024)
by: Shi, Baifeng, et al.
Published: (2024)
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
by: Ponimatkin, Georgy, et al.
Published: (2025)
by: Ponimatkin, Georgy, et al.
Published: (2025)
PhysPose: Refining 6D Object Poses with Physical Constraints
by: Malenický, Martin, et al.
Published: (2025)
by: Malenický, Martin, et al.
Published: (2025)
ALOHa: A New Measure for Hallucination in Captioning Models
by: Petryk, Suzanne, et al.
Published: (2024)
by: Petryk, Suzanne, et al.
Published: (2024)
Segment Anything without Supervision
by: Wang, XuDong, et al.
Published: (2024)
by: Wang, XuDong, et al.
Published: (2024)
xT: Nested Tokenization for Larger Context in Large Images
by: Gupta, Ritwik, et al.
Published: (2024)
by: Gupta, Ritwik, et al.
Published: (2024)
Similar Items
-
EditDuet: A Multi-Agent System for Video Non-Linear Editing
by: Sandoval-Castaneda, Marcelo, et al.
Published: (2025) -
ResidualViT for Efficient Temporally Dense Video Encoding
by: Soldan, Mattia, et al.
Published: (2025) -
Improving Personalized Search with Regularized Low-Rank Parameter Updates
by: Ryan, Fiona, et al.
Published: (2025) -
Generative Timelines for Instructed Visual Assembly
by: Pardo, Alejandro, et al.
Published: (2024) -
Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
by: Cheng, Jiacheng, et al.
Published: (2024)