:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chen, Shizhe, Garcia, Ricardo, Laptev, Ivan, Schmid, Cordelia
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.01491
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
by: Garcia, Ricardo, et al.
Published: (2024)

ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos
by: Chen, Zerui, et al.
Published: (2024)

Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation
by: Pacaud, Paul, et al.
Published: (2025)

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
by: Chen, Shizhe, et al.
Published: (2025)

ComposeAnything: Composite Object Priors for Text-to-Image Generation
by: Khan, Zeeshan, et al.
Published: (2025)

Online 3D Scene Reconstruction Using Neural Object Priors
by: Chabal, Thomas, et al.
Published: (2025)

PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
by: Chen, Shizhe, et al.
Published: (2026)

Large-scale Pre-training for Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2025)

FOM-Nav: Frontier-Object Maps for Object Goal Navigation
by: Chabal, Thomas, et al.
Published: (2025)

HORT: Monocular Hand-held Objects Reconstruction with Transformers
by: Chen, Zerui, et al.
Published: (2025)

HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching
by: Chen, Zerui, et al.
Published: (2026)

SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
by: Ye, Qilang, et al.
Published: (2025)

4D Visual Pre-training for Robot Learning
by: Hou, Chengkai, et al.
Published: (2025)

A Generative Approach for Wikipedia-Scale Visual Entity Recognition
by: Caron, Mathilde, et al.
Published: (2024)

Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
by: Caron, Mathilde, et al.
Published: (2024)

BrickNet: Graph-Backed Generative Brick Assembly
by: Kulits, Peter, et al.
Published: (2026)

Learning text-to-video retrieval from image captioning
by: Ventura, Lucas, et al.
Published: (2024)

Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2024)

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)

Time-, Memory- and Parameter-Efficient Visual Adaptation
by: Mercea, Otniel-Bogdan, et al.
Published: (2024)

ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
by: Boudjoghra, Mohamed el amine, et al.
Published: (2025)

Dense Video Object Captioning from Disjoint Supervision
by: Zhou, Xingyi, et al.
Published: (2023)

Dense Optical Tracking: Connecting the Dots
by: Moing, Guillaume Le, et al.
Published: (2023)

GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving
by: Xu, Shaoqing, et al.
Published: (2024)

Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
by: Ventura, Lucas, et al.
Published: (2025)

OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
by: Wysoczańska, Monika, et al.
Published: (2025)

CoVR-2: Automatic Data Construction for Composed Video Retrieval
by: Ventura, Lucas, et al.
Published: (2023)

Retrieval-Enhanced Contrastive Vision-Text Models
by: Iscen, Ahmet, et al.
Published: (2023)

AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars
by: Fazylov, Ramazan, et al.
Published: (2025)

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations
by: Yan, Xiangchao, et al.
Published: (2023)

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
by: Gao, Yipeng, et al.
Published: (2023)

Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
by: Zhou, Jiaming, et al.
Published: (2024)

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
by: Huang, Ian, et al.
Published: (2025)

InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
by: Dwivedi, Sai Kumar, et al.
Published: (2025)

Learning Correlation Structures for Vision Transformers
by: Kim, Manjin, et al.
Published: (2024)

LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance
by: Kim, Jae Myung, et al.
Published: (2025)

Pre-trained Visual Dynamics Representations for Efficient Policy Learning
by: Luo, Hao, et al.
Published: (2024)

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
by: Fiastre, Gabriel, et al.
Published: (2025)

Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
by: Albastaki, Shahad, et al.
Published: (2025)

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
by: Souček, Tomáš, et al.
Published: (2024)