:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dunlap, Lisa, Gonzalez, Joseph E., Darrell, Trevor, Heilbron, Fabian Caba, Sivic, Josef, Russell, Bryan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2509.08940
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

EditDuet: A Multi-Agent System for Video Non-Linear Editing
by: Sandoval-Castaneda, Marcelo, et al.
Published: (2025)

ResidualViT for Efficient Temporally Dense Video Encoding
by: Soldan, Mattia, et al.
Published: (2025)

Improving Personalized Search with Regularized Low-Rank Parameter Updates
by: Ryan, Fiona, et al.
Published: (2025)

Generative Timelines for Instructed Visual Assembly
by: Pardo, Alejandro, et al.
Published: (2024)

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval
by: Cheng, Jiacheng, et al.
Published: (2024)

Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models
by: Kwon, Gihyun, et al.
Published: (2024)

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence
by: Luo, Grace, et al.
Published: (2023)

Sync from the Sea: Retrieving Alignable Videos from Large-Scale Datasets
by: Dave, Ishan Rajendrakumar, et al.
Published: (2024)

Describing Differences in Image Sets with Natural Language
by: Dunlap, Lisa, et al.
Published: (2023)

NewMove: Customizing text-to-video models with novel motions
by: Materzynska, Joanna, et al.
Published: (2023)

VisionArena: 230K Real World User-VLM Conversations with Preference Labels
by: Chou, Christopher, et al.
Published: (2024)

CineVerse: Consistent Keyframe Synthesis for Cinematic Scene Composition
by: Phung, Quynh, et al.
Published: (2025)

FocalPose++: Focal Length and Object Pose Estimation via Render and Compare
by: Cífka, Martin, et al.
Published: (2023)

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models
by: Lian, Long, et al.
Published: (2023)

Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior
by: Lin, David Chuan-En, et al.
Published: (2022)

VideoMap: Supporting Video Editing Exploration, Brainstorming, and Prototyping in the Latent Space
by: Lin, David Chuan-En, et al.
Published: (2022)

Scaling Up Video Summarization Pretraining with Large Language Models
by: Argaw, Dawit Mureja, et al.
Published: (2024)

Towards Automated Movie Trailer Generation
by: Argaw, Dawit Mureja, et al.
Published: (2024)

Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2024)

Large-scale Pre-training for Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2025)

Persistent Robot World Models: Stabilizing Multi-Step Rollouts via Reinforcement Learning
by: Bardhan, Jai, et al.
Published: (2026)

Vision-Language Models Create Cross-Modal Task Representations
by: Luo, Grace, et al.
Published: (2024)

Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
by: Wu, Tsung-Han, et al.
Published: (2025)

GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos
by: Souček, Tomáš, et al.
Published: (2023)

Visually Prompted Benchmarks Are Surprisingly Fragile
by: Feng, Haiwen, et al.
Published: (2025)

POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
by: Vobecky, Antonin, et al.
Published: (2024)

Video Action Differencing
by: Burgess, James, et al.
Published: (2025)

AlignPose: Generalizable 6D Pose Estimation via Multi-view Feature-metric Alignment
by: Mikeštíková, Anna Šárová, et al.
Published: (2025)

Fast Image-based Neural Relighting with Translucency-Reflection Modeling
by: Zhu, Shizhan, et al.
Published: (2023)

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
by: Souček, Tomáš, et al.
Published: (2024)

Dual-Process Image Generation
by: Luo, Grace, et al.
Published: (2025)

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
by: Dunlap, Lisa, et al.
Published: (2024)

Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark
by: Wu, Tsung-Han, et al.
Published: (2024)

Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation
by: Vobecky, Antonin, et al.
Published: (2022)

When Do We Not Need Larger Vision Models?
by: Shi, Baifeng, et al.
Published: (2024)

6D Object Pose Tracking in Internet Videos for Robotic Manipulation
by: Ponimatkin, Georgy, et al.
Published: (2025)

PhysPose: Refining 6D Object Poses with Physical Constraints
by: Malenický, Martin, et al.
Published: (2025)

ALOHa: A New Measure for Hallucination in Captioning Models
by: Petryk, Suzanne, et al.
Published: (2024)

Segment Anything without Supervision
by: Wang, XuDong, et al.
Published: (2024)

xT: Nested Tokenization for Larger Context in Large Images
by: Gupta, Ritwik, et al.
Published: (2024)