:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Jialu, Padmakumar, Aishwarya, Sukhatme, Gaurav, Bansal, Mohit
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2402.03561
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DreamRunner: Fine-Grained Compositional Story-to-Video Generation with Retrieval-Augmented Motion Adaptation
by: Wang, Zun, et al.
Published: (2024)

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
by: Li, Jialu, et al.
Published: (2025)

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by: Yu, Shoubin, et al.
Published: (2024)

Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
by: Lee, Daeun, et al.
Published: (2025)

Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
by: Wang, Zun, et al.
Published: (2024)

RACCooN: A Versatile Instructional Video Editing Framework with Auto-Generated Narratives
by: Yoon, Jaehong, et al.
Published: (2024)

Self-Correcting Text-to-Video Generation with Misalignment Detection and Localized Refinement
by: Lee, Daeun, et al.
Published: (2024)

VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
by: Wang, Ziyang, et al.
Published: (2024)

VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning
by: Lin, Han, et al.
Published: (2023)

Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning
by: Wang, Ziyang, et al.
Published: (2025)

Error-Driven Scene Editing for 3D Grounding in Large Language Models
by: Zhang, Yue, et al.
Published: (2025)

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
by: Pothiraj, Atin, et al.
Published: (2025)

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding
by: Wang, Ziyang, et al.
Published: (2025)

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
by: Wang, Zun, et al.
Published: (2025)

Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
by: Prasad, Archiki, et al.
Published: (2023)

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
by: Li, Jialu, et al.
Published: (2024)

AgriVLN: Vision-and-Language Navigation for Agricultural Robots
by: Zhao, Xiaobei, et al.
Published: (2025)

ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
by: Sung, Yi-Lin, et al.
Published: (2023)

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation
by: Yu, Shoubin, et al.
Published: (2025)

Planning with Sketch-Guided Verification for Physics-Aware Video Generation
by: Huang, Yidong, et al.
Published: (2025)

Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
by: Wan, David, et al.
Published: (2024)

UnitedVLN: Generalizable Gaussian Splatting for Continuous Vision-Language Navigation
by: Dai, Guangzhao, et al.
Published: (2024)

TimeRefine: Temporal Grounding with Time Refining Video LLM
by: Wang, Xizi, et al.
Published: (2024)

Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models
by: Maharana, Adyasha, et al.
Published: (2023)

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos
by: Yu, Shoubin, et al.
Published: (2026)

VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
by: Lee, Daeun, et al.
Published: (2026)

GA-VLN: Geometry-Aware BEV Representation for Efficient Vision-Language Navigation
by: Yang, Jiahao, et al.
Published: (2026)

Zero-Shot Generalization of Vision-Based RL Without Data Augmentation
by: Batra, Sumeet, et al.
Published: (2024)

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
by: Niu, Tianyi, et al.
Published: (2025)

StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos
by: Lee, Daeun, et al.
Published: (2025)

DAM: Dynamic Adapter Merging for Continual Video QA Learning
by: Cheng, Feng, et al.
Published: (2024)

EgoMemReason: A Memory-Driven Reasoning Benchmark for Long-Horizon Egocentric Video Understanding
by: Wang, Ziyang, et al.
Published: (2026)

Vision-and-Language Navigation Today and Tomorrow: A Survey in the Era of Foundation Models
by: Zhang, Yue, et al.
Published: (2024)

See It from My Perspective: How Language Affects Cultural Bias in Image Understanding
by: Ananthram, Amith, et al.
Published: (2024)

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts
by: Zhou, Gengze, et al.
Published: (2024)

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
by: Lin, Han, et al.
Published: (2025)

GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
by: Nguyen, Duy, et al.
Published: (2025)

Unbounded: A Generative Infinite Game of Character Life Simulation
by: Li, Jialu, et al.
Published: (2024)

CRAFT: Video Diffusion for Bimanual Robot Data Generation
by: Chen, Jason, et al.
Published: (2026)

Navigating the Nuances: A Fine-grained Evaluation of Vision-Language Navigation
by: Wang, Zehao, et al.
Published: (2024)