Saved in:
| Main Authors: | Ranasinghe, Kanchana, Shukla, Satya Narayan, Poursaeed, Omid, Ryoo, Michael S., Lin, Tsung-Yu |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.07449 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
by: Mata, Cristina, et al.
Published: (2025)
by: Mata, Cristina, et al.
Published: (2025)
Language Repository for Long Video Understanding
by: Kahatapitiya, Kumara, et al.
Published: (2024)
by: Kahatapitiya, Kumara, et al.
Published: (2024)
Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
Pixel Motion Diffusion is What We Need for Robot Control
by: Nguyen, E-Ro, et al.
Published: (2025)
by: Nguyen, E-Ro, et al.
Published: (2025)
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
by: Park, Jongwoo, et al.
Published: (2024)
by: Park, Jongwoo, et al.
Published: (2024)
Pixel Motion as Universal Representation for Robot Control
by: Ranasinghe, Kanchana, et al.
Published: (2025)
by: Ranasinghe, Kanchana, et al.
Published: (2025)
Future Optical Flow Prediction Improves Robot Control & Video Generation
by: Ranasinghe, Kanchana, et al.
Published: (2026)
by: Ranasinghe, Kanchana, et al.
Published: (2026)
LatentCRF: Continuous CRF for Efficient Latent Diffusion
by: Ranasinghe, Kanchana, et al.
Published: (2024)
by: Ranasinghe, Kanchana, et al.
Published: (2024)
Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning
by: Watawana, Hasindri, et al.
Published: (2024)
by: Watawana, Hasindri, et al.
Published: (2024)
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
by: Gupta, Shashank, et al.
Published: (2025)
by: Gupta, Shashank, et al.
Published: (2025)
Robotic VLA Benefits from Joint Learning with Motion Image Diffusion
by: Fang, Yu, et al.
Published: (2025)
by: Fang, Yu, et al.
Published: (2025)
Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes
by: Marshall, Kelly O., et al.
Published: (2025)
by: Marshall, Kelly O., et al.
Published: (2025)
Predicting Penalty Kick Direction Using Multi-Modal Deep Learning with Pose-Guided Attention
by: Ranasinghe, Pasindu, et al.
Published: (2025)
by: Ranasinghe, Pasindu, et al.
Published: (2025)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation
by: De Silva, Ulindu, et al.
Published: (2025)
by: De Silva, Ulindu, et al.
Published: (2025)
Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
by: Cai, Yusen, et al.
Published: (2025)
by: Cai, Yusen, et al.
Published: (2025)
RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images
by: Fatima, Mishal, et al.
Published: (2026)
by: Fatima, Mishal, et al.
Published: (2026)
WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
by: Kotovenko, Dmytro, et al.
Published: (2024)
by: Kotovenko, Dmytro, et al.
Published: (2024)
Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
by: Li, Xiang, et al.
Published: (2023)
by: Li, Xiang, et al.
Published: (2023)
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
by: Ryoo, Michael S., et al.
Published: (2024)
by: Ryoo, Michael S., et al.
Published: (2024)
MambaGlue: Fast and Robust Local Feature Matching With Mamba
by: Ryoo, Kihwan, et al.
Published: (2025)
by: Ryoo, Kihwan, et al.
Published: (2025)
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
by: Li, Yian, et al.
Published: (2026)
by: Li, Yian, et al.
Published: (2026)
WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection
by: Tsou, Tsung-Lin, et al.
Published: (2023)
by: Tsou, Tsung-Lin, et al.
Published: (2023)
Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception
by: Lu, Jingpei, et al.
Published: (2026)
by: Lu, Jingpei, et al.
Published: (2026)
Learning GUI Grounding with Spatial Reasoning from Visual Feedback
by: Zhao, Yu, et al.
Published: (2025)
by: Zhao, Yu, et al.
Published: (2025)
Image Translation with Kernel Prediction Networks for Semantic Segmentation
by: Mata, Cristina, et al.
Published: (2025)
by: Mata, Cristina, et al.
Published: (2025)
StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
by: Yang, Yanlai, et al.
Published: (2025)
by: Yang, Yanlai, et al.
Published: (2025)
Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning
by: Yang, Jingru, et al.
Published: (2024)
by: Yang, Jingru, et al.
Published: (2024)
Improving Object Detection via Local-global Contrastive Learning
by: Triantafyllidou, Danai, et al.
Published: (2024)
by: Triantafyllidou, Danai, et al.
Published: (2024)
Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
by: Kancheti, Sai Srinivas, et al.
Published: (2026)
by: Kancheti, Sai Srinivas, et al.
Published: (2026)
Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback
by: Chen, Yang, et al.
Published: (2025)
by: Chen, Yang, et al.
Published: (2025)
Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding
by: Mirjalili, Vahid, et al.
Published: (2025)
by: Mirjalili, Vahid, et al.
Published: (2025)
CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
by: Ma, Wenxin, et al.
Published: (2026)
by: Ma, Wenxin, et al.
Published: (2026)
Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
by: Wang, Jialou, et al.
Published: (2024)
by: Wang, Jialou, et al.
Published: (2024)
Improving Open-World Object Localization by Discovering Background
by: Singh, Ashish, et al.
Published: (2025)
by: Singh, Ashish, et al.
Published: (2025)
CompCap: Improving Multimodal Large Language Models with Composite Captions
by: Chen, Xiaohui, et al.
Published: (2024)
by: Chen, Xiaohui, et al.
Published: (2024)
Text-guided Explorable Image Super-resolution
by: Gandikota, Kanchana Vaishnavi, et al.
Published: (2024)
by: Gandikota, Kanchana Vaishnavi, et al.
Published: (2024)
SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
by: Pham, Tan-Hanh, et al.
Published: (2024)
by: Pham, Tan-Hanh, et al.
Published: (2024)
Structured Spatial Reasoning with Open Vocabulary Object Detectors
by: Nejatishahidin, Negar, et al.
Published: (2024)
by: Nejatishahidin, Negar, et al.
Published: (2024)
A Versatile and Differentiable Hand-Object Interaction Representation
by: Morales, Théo, et al.
Published: (2024)
by: Morales, Théo, et al.
Published: (2024)
Similar Items
-
CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
by: Mata, Cristina, et al.
Published: (2025) -
Language Repository for Long Video Understanding
by: Kahatapitiya, Kumara, et al.
Published: (2024) -
Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024) -
Pixel Motion Diffusion is What We Need for Robot Control
by: Nguyen, E-Ro, et al.
Published: (2025) -
Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
by: Park, Jongwoo, et al.
Published: (2024)