:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ott, Joachim, Wang, Zuowen, Liu, Shih-Chii
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence 68T99 I.2.6; I.2.7; I.2.10
Online Access:	https://arxiv.org/abs/2406.03439
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025)

Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
by: Masrourisaadat, Nila, et al.
Published: (2024)

OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
by: Zhao, Weiyi, et al.
Published: (2025)

TensLoRA: Tensor Alternatives for Low-Rank Adaptation
by: Marmoret, Axel, et al.
Published: (2025)

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
by: Rovai, Fabio
Published: (2026)

WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
by: Liu, Bingnan, et al.
Published: (2026)

SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving
by: Bajpai, Ashutosh, et al.
Published: (2026)

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
by: Ye, Hua, et al.
Published: (2025)

Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
by: Gkountouras, John, et al.
Published: (2025)

The MSR-Video to Text Dataset with Clean Annotations
by: Chen, Haoran, et al.
Published: (2021)

A Survey on Vision-Language-Action Models for Embodied AI
by: Ma, Yueen, et al.
Published: (2024)

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models
by: Wang, Guoyan, et al.
Published: (2025)

HuMoCon: Concept Discovery for Human Motion Understanding
by: Fang, Qihang, et al.
Published: (2025)

Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)

Hateful Meme Detection through Context-Sensitive Prompting and Fine-Grained Labeling
by: Ouyang, Rongxin, et al.
Published: (2024)

ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)

Unpacking Hateful Memes: Presupposed Context and False Claims
by: Cai, Weibin, et al.
Published: (2025)

VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
by: Meng, Ziyang, et al.
Published: (2024)

VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
by: Liao, Xinyao, et al.
Published: (2025)

CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models
by: Yao, Dongyu, et al.
Published: (2024)

Survey Transfer Learning: Recycling Data with Silicon Responses
by: Amini, Ali
Published: (2025)

GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

Training a Student Expert via Semi-Supervised Foundation Model Distillation
by: Taghavi, Pardis, et al.
Published: (2026)

Spatially Optimized Compact Deep Metric Learning Model for Similarity Search
by: Islam, Md. Farhadul, et al.
Published: (2024)

Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays
by: Ko, Hanbin, et al.
Published: (2025)

Semantically Guided Adversarial Testing of Vision Models Using Language Models
by: Filus, Katarzyna, et al.
Published: (2025)

Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video
by: Moore, Alexander, et al.
Published: (2025)

Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
by: Bajpai, Ashutosh, et al.
Published: (2026)

CulinaryCut-VLAP: A Vision-Language-Action-Physics Framework for Food Cutting via a Force-Aware Material Point Method
by: Koh, Hyunseo, et al.
Published: (2026)

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
by: Koukounas, Andreas, et al.
Published: (2024)

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
by: Malikussaid, et al.
Published: (2026)

Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging
by: Buchnajzer, Zuzanna, et al.
Published: (2024)

ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
by: Parcalabescu, Letitia, et al.
Published: (2021)

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
by: Parcalabescu, Letitia, et al.
Published: (2022)

VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
by: Mercat, Jean, et al.
Published: (2026)

eStonefish-Scenes: A Sim-to-Real Validated and Robot-Centric Event-based Optical Flow Dataset for Underwater Vehicles
by: Mansour, Jad, et al.
Published: (2025)

Optimized Gradient Clipping for Noisy Label Learning
by: Ye, Xichen, et al.
Published: (2024)

Biomedical Visual Instruction Tuning with Clinician Preference Alignment
by: Cui, Hejie, et al.
Published: (2024)