:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ranasinghe, Kanchana, Shukla, Satya Narayan, Poursaeed, Omid, Ryoo, Michael S., Lin, Tsung-Yu
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.07449
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CoPT: Unsupervised Domain Adaptive Segmentation using Domain-Agnostic Text Embeddings
by: Mata, Cristina, et al.
Published: (2025)

Language Repository for Long Video Understanding
by: Kahatapitiya, Kumara, et al.
Published: (2024)

Understanding Long Videos with Multimodal Language Models
by: Ranasinghe, Kanchana, et al.
Published: (2024)

Pixel Motion Diffusion is What We Need for Robot Control
by: Nguyen, E-Ro, et al.
Published: (2025)

Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA
by: Park, Jongwoo, et al.
Published: (2024)

Pixel Motion as Universal Representation for Robot Control
by: Ranasinghe, Kanchana, et al.
Published: (2025)

Future Optical Flow Prediction Improves Robot Control & Video Generation
by: Ranasinghe, Kanchana, et al.
Published: (2026)

LatentCRF: Continuous CRF for Efficient Latent Diffusion
by: Ranasinghe, Kanchana, et al.
Published: (2024)

Hierarchical Text-to-Vision Self Supervised Alignment for Improved Histopathology Representation Learning
by: Watawana, Hasindri, et al.
Published: (2024)

A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
by: Gupta, Shashank, et al.
Published: (2025)

Robotic VLA Benefits from Joint Learning with Motion Image Diffusion
by: Fang, Yu, et al.
Published: (2025)

Decorum: A Language-Based Approach For Style-Conditioned Synthesis of Indoor 3D Scenes
by: Marshall, Kelly O., et al.
Published: (2025)

Predicting Penalty Kick Direction Using Multi-Modal Deep Learning with Pose-Guided Attention
by: Ranasinghe, Pasindu, et al.
Published: (2025)

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
by: Li, Xiang, et al.
Published: (2024)

Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation
by: De Silva, Ulindu, et al.
Published: (2025)

Learning to See Through a Baby's Eyes: Early Visual Diets Enable Robust Visual Intelligence in Humans and Machines
by: Cai, Yusen, et al.
Published: (2025)

RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images
by: Fatima, Mishal, et al.
Published: (2026)

WaSt-3D: Wasserstein-2 Distance for Scene-to-Scene Stylization on 3D Gaussians
by: Kotovenko, Dmytro, et al.
Published: (2024)

Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning
by: Li, Xiang, et al.
Published: (2023)

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
by: Ryoo, Michael S., et al.
Published: (2024)

MambaGlue: Fast and Robust Local Feature Matching With Mamba
by: Ryoo, Kihwan, et al.
Published: (2025)

SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
by: Li, Yian, et al.
Published: (2026)

WLST: Weak Labels Guided Self-training for Weakly-supervised Domain Adaptation on 3D Object Detection
by: Tsou, Tsung-Lin, et al.
Published: (2023)

Seeing Through Smoke: Surgical Desmoking for Improved Visual Perception
by: Lu, Jingpei, et al.
Published: (2026)

Learning GUI Grounding with Spatial Reasoning from Visual Feedback
by: Zhao, Yu, et al.
Published: (2025)

Image Translation with Kernel Prediction Networks for Semantic Segmentation
by: Mata, Cristina, et al.
Published: (2025)

StreamMem: Query-Agnostic KV Cache Memory for Streaming Video Understanding
by: Yang, Yanlai, et al.
Published: (2025)

Visual-Linguistic Agent: Towards Collaborative Contextual Object Reasoning
by: Yang, Jingru, et al.
Published: (2024)

Improving Object Detection via Local-global Contrastive Learning
by: Triantafyllidou, Danai, et al.
Published: (2024)

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs
by: Kancheti, Sai Srinivas, et al.
Published: (2026)

Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback
by: Chen, Yang, et al.
Published: (2025)

Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding
by: Mirjalili, Vahid, et al.
Published: (2025)

CausalSpatial: A Benchmark for Object-Centric Causal Spatial Reasoning
by: Ma, Wenxin, et al.
Published: (2026)

Detect2Interact: Localizing Object Key Field in Visual Question Answering (VQA) with LLMs
by: Wang, Jialou, et al.
Published: (2024)

Improving Open-World Object Localization by Discovering Background
by: Singh, Ashish, et al.
Published: (2025)

CompCap: Improving Multimodal Large Language Models with Composite Captions
by: Chen, Xiaohui, et al.
Published: (2024)

Text-guided Explorable Image Super-resolution
by: Gandikota, Kanchana Vaishnavi, et al.
Published: (2024)

SilVar: Speech Driven Multimodal Model for Reasoning Visual Question Answering and Object Localization
by: Pham, Tan-Hanh, et al.
Published: (2024)

Structured Spatial Reasoning with Open Vocabulary Object Detectors
by: Nejatishahidin, Negar, et al.
Published: (2024)

A Versatile and Differentiable Hand-Object Interaction Representation
by: Morales, Théo, et al.
Published: (2024)