Saved in:
| Main Authors: | Chen, Shizhe, Garcia, Ricardo, Laptev, Ivan, Schmid, Cordelia |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.01491 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
by: Garcia, Ricardo, et al.
Published: (2024)
by: Garcia, Ricardo, et al.
Published: (2024)
ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos
by: Chen, Zerui, et al.
Published: (2024)
by: Chen, Zerui, et al.
Published: (2024)
Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation
by: Pacaud, Paul, et al.
Published: (2025)
by: Pacaud, Paul, et al.
Published: (2025)
Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
by: Chen, Shizhe, et al.
Published: (2025)
by: Chen, Shizhe, et al.
Published: (2025)
ComposeAnything: Composite Object Priors for Text-to-Image Generation
by: Khan, Zeeshan, et al.
Published: (2025)
by: Khan, Zeeshan, et al.
Published: (2025)
Online 3D Scene Reconstruction Using Neural Object Priors
by: Chabal, Thomas, et al.
Published: (2025)
by: Chabal, Thomas, et al.
Published: (2025)
PointACT: Vision-Language-Action Models with Multi-Scale Point-Action Interaction
by: Chen, Shizhe, et al.
Published: (2026)
by: Chen, Shizhe, et al.
Published: (2026)
Large-scale Pre-training for Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2025)
by: Kazakos, Evangelos, et al.
Published: (2025)
FOM-Nav: Frontier-Object Maps for Object Goal Navigation
by: Chabal, Thomas, et al.
Published: (2025)
by: Chabal, Thomas, et al.
Published: (2025)
HORT: Monocular Hand-held Objects Reconstruction with Transformers
by: Chen, Zerui, et al.
Published: (2025)
by: Chen, Zerui, et al.
Published: (2025)
HO-Flow: Generalizable Hand-Object Interaction Generation with Latent Flow Matching
by: Chen, Zerui, et al.
Published: (2026)
by: Chen, Zerui, et al.
Published: (2026)
SUGAR: Learning Skeleton Representation with Visual-Motion Knowledge for Action Recognition
by: Ye, Qilang, et al.
Published: (2025)
by: Ye, Qilang, et al.
Published: (2025)
4D Visual Pre-training for Robot Learning
by: Hou, Chengkai, et al.
Published: (2025)
by: Hou, Chengkai, et al.
Published: (2025)
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
by: Caron, Mathilde, et al.
Published: (2024)
by: Caron, Mathilde, et al.
Published: (2024)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
by: Caron, Mathilde, et al.
Published: (2024)
by: Caron, Mathilde, et al.
Published: (2024)
BrickNet: Graph-Backed Generative Brick Assembly
by: Kulits, Peter, et al.
Published: (2026)
by: Kulits, Peter, et al.
Published: (2026)
Learning text-to-video retrieval from image captioning
by: Ventura, Lucas, et al.
Published: (2024)
by: Ventura, Lucas, et al.
Published: (2024)
Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2024)
by: Kazakos, Evangelos, et al.
Published: (2024)
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
by: Bousselham, Walid, et al.
Published: (2025)
by: Bousselham, Walid, et al.
Published: (2025)
Time-, Memory- and Parameter-Efficient Visual Adaptation
by: Mercea, Otniel-Bogdan, et al.
Published: (2024)
by: Mercea, Otniel-Bogdan, et al.
Published: (2024)
ScanEdit: Hierarchically-Guided Functional 3D Scan Editing
by: Boudjoghra, Mohamed el amine, et al.
Published: (2025)
by: Boudjoghra, Mohamed el amine, et al.
Published: (2025)
Dense Video Object Captioning from Disjoint Supervision
by: Zhou, Xingyi, et al.
Published: (2023)
by: Zhou, Xingyi, et al.
Published: (2023)
Dense Optical Tracking: Connecting the Dots
by: Moing, Guillaume Le, et al.
Published: (2023)
by: Moing, Guillaume Le, et al.
Published: (2023)
GaussianPretrain: A Simple Unified 3D Gaussian Representation for Visual Pre-training in Autonomous Driving
by: Xu, Shaoqing, et al.
Published: (2024)
by: Xu, Shaoqing, et al.
Published: (2024)
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
by: Ventura, Lucas, et al.
Published: (2025)
by: Ventura, Lucas, et al.
Published: (2025)
OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
by: Wysoczańska, Monika, et al.
Published: (2025)
by: Wysoczańska, Monika, et al.
Published: (2025)
CoVR-2: Automatic Data Construction for Composed Video Retrieval
by: Ventura, Lucas, et al.
Published: (2023)
by: Ventura, Lucas, et al.
Published: (2023)
Retrieval-Enhanced Contrastive Vision-Text Models
by: Iscen, Ahmet, et al.
Published: (2023)
by: Iscen, Ahmet, et al.
Published: (2023)
AGORA: Adversarial Generation Of Real-time Animatable 3D Gaussian Head Avatars
by: Fazylov, Ramazan, et al.
Published: (2025)
by: Fazylov, Ramazan, et al.
Published: (2025)
SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations
by: Yan, Xiangchao, et al.
Published: (2023)
by: Yan, Xiangchao, et al.
Published: (2023)
Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training
by: Gao, Yipeng, et al.
Published: (2023)
by: Gao, Yipeng, et al.
Published: (2023)
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
by: Zhou, Jiaming, et al.
Published: (2024)
by: Zhou, Jiaming, et al.
Published: (2024)
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
by: Huang, Ian, et al.
Published: (2025)
by: Huang, Ian, et al.
Published: (2025)
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
by: Dwivedi, Sai Kumar, et al.
Published: (2025)
by: Dwivedi, Sai Kumar, et al.
Published: (2025)
Learning Correlation Structures for Vision Transformers
by: Kim, Manjin, et al.
Published: (2024)
by: Kim, Manjin, et al.
Published: (2024)
LoFT: LoRA-fused Training Dataset Generation with Few-shot Guidance
by: Kim, Jae Myung, et al.
Published: (2025)
by: Kim, Jae Myung, et al.
Published: (2025)
Pre-trained Visual Dynamics Representations for Efficient Policy Learning
by: Luo, Hao, et al.
Published: (2024)
by: Luo, Hao, et al.
Published: (2024)
CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects
by: Fiastre, Gabriel, et al.
Published: (2025)
by: Fiastre, Gabriel, et al.
Published: (2025)
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation
by: Albastaki, Shahad, et al.
Published: (2025)
by: Albastaki, Shahad, et al.
Published: (2025)
ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions
by: Souček, Tomáš, et al.
Published: (2024)
by: Souček, Tomáš, et al.
Published: (2024)
Similar Items
-
Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-guided 3D Policy
by: Garcia, Ricardo, et al.
Published: (2024) -
ViViDex: Learning Vision-based Dexterous Manipulation from Human Videos
by: Chen, Zerui, et al.
Published: (2024) -
Scaling Cross-Environment Failure Reasoning Data for Vision-Language Robotic Manipulation
by: Pacaud, Paul, et al.
Published: (2025) -
Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation
by: Chen, Shizhe, et al.
Published: (2025) -
ComposeAnything: Composite Object Priors for Text-to-Image Generation
by: Khan, Zeeshan, et al.
Published: (2025)