Saved in:
| Main Authors: | Li, Jialu, Padmakumar, Aishwarya, Sukhatme, Gaurav, Bansal, Mohit |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2402.03561 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by: Wang, Zun, et al.
Published: (2024)
by: Wang, Zun, et al.
Published: (2024)
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
by: Li, Jialu, et al.
Published: (2025)
by: Li, Jialu, et al.
Published: (2025)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024)
by: Yu, Shoubin, et al.
Published: (2024)
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025)
by: Lee, Daeun, et al.
Published: (2025)
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
by: Wang, Zun, et al.
Published: (2024)
by: Wang, Zun, et al.
Published: (2024)
RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
by: Yoon, Jaehong, et al.
Published: (2024)
by: Yoon, Jaehong, et al.
Published: (2024)
Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
by: Lee, Daeun, et al.
Published: (2024)
by: Lee, Daeun, et al.
Published: (2024)
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)
by: Wang, Ziyang, et al.
Published: (2024)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
by: Lin, Han, et al.
Published: (2023)
by: Lin, Han, et al.
Published: (2023)
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
Error-Driven Scene Editing for 3D Grounding in Large Language Models
by: Zhang, Yue, et al.
Published: (2025)
by: Zhang, Yue, et al.
Published: (2025)
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
by: Pothiraj, Atin, et al.
Published: (2025)
by: Pothiraj, Atin, et al.
Published: (2025)
Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)
by: Wang, Ziyang, et al.
Published: (2025)
EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
by: Wang, Zun, et al.
Published: (2025)
by: Wang, Zun, et al.
Published: (2025)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
by: Prasad, Archiki, et al.
Published: (2023)
by: Prasad, Archiki, et al.
Published: (2023)
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
AgriVLN: Vision-and-Language Navigation for Agricultural Robots
by: Zhao, Xiaobei, et al.
Published: (2025)
by: Zhao, Xiaobei, et al.
Published: (2025)
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
by: Sung, Yi-Lin, et al.
Published: (2023)
by: Sung, Yi-Lin, et al.
Published: (2023)
VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
by: Yu, Shoubin, et al.
Published: (2025)
by: Yu, Shoubin, et al.
Published: (2025)
Planning with Sketch-Guided Verification for Physics-Aware Video Generation
by: Huang, Yidong, et al.
Published: (2025)
by: Huang, Yidong, et al.
Published: (2025)
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
by: Wan, David, et al.
Published: (2024)
by: Wan, David, et al.
Published: (2024)
UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation
by: Dai, Guangzhao, et al.
Published: (2024)
by: Dai, Guangzhao, et al.
Published: (2024)
TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)
by: Wang, Xizi, et al.
Published: (2024)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
by: Maharana, Adyasha, et al.
Published: (2023)
by: Maharana, Adyasha, et al.
Published: (2023)
Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
by: Yu, Shoubin, et al.
Published: (2026)
by: Yu, Shoubin, et al.
Published: (2026)
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
by: Lee, Daeun, et al.
Published: (2026)
by: Lee, Daeun, et al.
Published: (2026)
GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
by: Yang, Jiahao, et al.
Published: (2026)
by: Yang, Jiahao, et al.
Published: (2026)
Zero-Shot Generalization of Vision-Based RL Without Data Augmentation
by: Batra, Sumeet, et al.
Published: (2024)
by: Batra, Sumeet, et al.
Published: (2024)
RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
by: Niu, Tianyi, et al.
Published: (2025)
by: Niu, Tianyi, et al.
Published: (2025)
StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
by: Lee, Daeun, et al.
Published: (2025)
by: Lee, Daeun, et al.
Published: (2025)
DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)
by: Cheng, Feng, et al.
Published: (2024)
EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)
by: Wang, Ziyang, et al.
Published: (2026)
Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
by: Zhang, Yue, et al.
Published: (2024)
by: Zhang, Yue, et al.
Published: (2024)
See It from My Perspective: How Language Affects Cultural Bias in Image Understanding
by: Ananthram, Amith, et al.
Published: (2024)
by: Ananthram, Amith, et al.
Published: (2024)
SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)
by: Zhou, Gengze, et al.
Published: (2024)
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
by: Lin, Han, et al.
Published: (2025)
by: Lin, Han, et al.
Published: (2025)
GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
by: Nguyen, Duy, et al.
Published: (2025)
by: Nguyen, Duy, et al.
Published: (2025)
Unbounded: A Generative Infinite Game of Character Life Simulation
by: Li, Jialu, et al.
Published: (2024)
by: Li, Jialu, et al.
Published: (2024)
CRAFT: Video Diffusion for Bimanual Robot Data Generation
by: Chen, Jason, et al.
Published: (2026)
by: Chen, Jason, et al.
Published: (2026)
Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation
by: Wang, Zehao, et al.
Published: (2024)
by: Wang, Zehao, et al.
Published: (2024)
Similar Items
-
DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by: Wang, Zun, et al.
Published: (2024) -
Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
by: Li, Jialu, et al.
Published: (2025) -
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024) -
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025) -
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
by: Wang, Zun, et al.
Published: (2024)