Saved in:
| Main Authors: | Ott, Joachim, Wang, Zuowen, Liu, Shih-Chii |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2406.03439 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025)
by: Tu, Songjun, et al.
Published: (2025)
Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
by: Masrourisaadat, Nila, et al.
Published: (2024)
by: Masrourisaadat, Nila, et al.
Published: (2024)
OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
by: Zhao, Weiyi, et al.
Published: (2025)
by: Zhao, Weiyi, et al.
Published: (2025)
TensLoRA: Tensor Alternatives for Low-Rank Adaptation
by: Marmoret, Axel, et al.
Published: (2025)
by: Marmoret, Axel, et al.
Published: (2025)
Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
by: Rovai, Fabio
Published: (2026)
by: Rovai, Fabio
Published: (2026)
WildRoadBench: A Wild Aerial Road-Damage Grounding Benchmark for Vision-Language Models and Autonomous Agents
by: Liu, Bingnan, et al.
Published: (2026)
by: Liu, Bingnan, et al.
Published: (2026)
SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving
by: Bajpai, Ashutosh, et al.
Published: (2026)
by: Bajpai, Ashutosh, et al.
Published: (2026)
Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment
by: Ye, Hua, et al.
Published: (2025)
by: Ye, Hua, et al.
Published: (2025)
Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces
by: Gkountouras, John, et al.
Published: (2025)
by: Gkountouras, John, et al.
Published: (2025)
The MSR-Video to Text Dataset with Clean Annotations
by: Chen, Haoran, et al.
Published: (2021)
by: Chen, Haoran, et al.
Published: (2021)
A Survey on Vision-Language-Action Models for Embodied AI
by: Ma, Yueen, et al.
Published: (2024)
by: Ma, Yueen, et al.
Published: (2024)
Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models
by: Wang, Guoyan, et al.
Published: (2025)
by: Wang, Guoyan, et al.
Published: (2025)
HuMoCon: Concept Discovery for Human Motion Understanding
by: Fang, Qihang, et al.
Published: (2025)
by: Fang, Qihang, et al.
Published: (2025)
Learning the meanings of function words from grounded language using a visual question answering model
by: Portelance, Eva, et al.
Published: (2023)
by: Portelance, Eva, et al.
Published: (2023)
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning
by: Yang, Shan
Published: (2026)
by: Yang, Shan
Published: (2026)
Hateful Meme Detection through Context-Sensitive Prompting and Fine-Grained Labeling
by: Ouyang, Rongxin, et al.
Published: (2024)
by: Ouyang, Rongxin, et al.
Published: (2024)
ReSpace: Text-Driven Autoregressive 3D Indoor Scene Synthesis and Editing
by: Bucher, Martin JJ., et al.
Published: (2025)
by: Bucher, Martin JJ., et al.
Published: (2025)
Unpacking Hateful Memes: Presupposed Context and False Claims
by: Cai, Weibin, et al.
Published: (2025)
by: Cai, Weibin, et al.
Published: (2025)
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
by: Meng, Ziyang, et al.
Published: (2024)
by: Meng, Ziyang, et al.
Published: (2024)
VA-$π$: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
by: Liao, Xinyao, et al.
Published: (2025)
by: Liao, Xinyao, et al.
Published: (2025)
CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models
by: Yao, Dongyu, et al.
Published: (2024)
by: Yao, Dongyu, et al.
Published: (2024)
Survey Transfer Learning: Recycling Data with Silicon Responses
by: Amini, Ali
Published: (2025)
by: Amini, Ali
Published: (2025)
GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)
by: Shahin, Nada, et al.
Published: (2025)
Training a Student Expert via Semi-Supervised Foundation Model Distillation
by: Taghavi, Pardis, et al.
Published: (2026)
by: Taghavi, Pardis, et al.
Published: (2026)
Spatially Optimized Compact Deep Metric Learning Model for Similarity Search
by: Islam, Md. Farhadul, et al.
Published: (2024)
by: Islam, Md. Farhadul, et al.
Published: (2024)
Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays
by: Ko, Hanbin, et al.
Published: (2025)
by: Ko, Hanbin, et al.
Published: (2025)
Semantically Guided Adversarial Testing of Vision Models Using Language Models
by: Filus, Katarzyna, et al.
Published: (2025)
by: Filus, Katarzyna, et al.
Published: (2025)
Training for X-Ray Vision: Amodal Segmentation, Amodal Content Completion, and View-Invariant Object Representation from Multi-Camera Video
by: Moore, Alexander, et al.
Published: (2025)
by: Moore, Alexander, et al.
Published: (2025)
Waking Up Blind: Cold-Start Optimization of Supervision-Free Agentic Trajectories for Grounded Visual Perception
by: Bajpai, Ashutosh, et al.
Published: (2026)
by: Bajpai, Ashutosh, et al.
Published: (2026)
CulinaryCut-VLAP: A Vision-Language-Action-Physics Framework for Food Cutting via a Force-Aware Material Point Method
by: Koh, Hyunseo, et al.
Published: (2026)
by: Koh, Hyunseo, et al.
Published: (2026)
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
by: Koukounas, Andreas, et al.
Published: (2024)
by: Koukounas, Andreas, et al.
Published: (2024)
A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection
by: Malikussaid, et al.
Published: (2026)
by: Malikussaid, et al.
Published: (2026)
Autoassociative Learning of Structural Representations for Modeling and Classification in Medical Imaging
by: Buchnajzer, Zuzanna, et al.
Published: (2024)
by: Buchnajzer, Zuzanna, et al.
Published: (2024)
ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)
by: Shahin, Nada, et al.
Published: (2025)
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
by: Parcalabescu, Letitia, et al.
Published: (2021)
by: Parcalabescu, Letitia, et al.
Published: (2021)
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
by: Parcalabescu, Letitia, et al.
Published: (2022)
by: Parcalabescu, Letitia, et al.
Published: (2022)
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
by: Mercat, Jean, et al.
Published: (2026)
by: Mercat, Jean, et al.
Published: (2026)
eStonefish-Scenes: A Sim-to-Real Validated and Robot-Centric Event-based Optical Flow Dataset for Underwater Vehicles
by: Mansour, Jad, et al.
Published: (2025)
by: Mansour, Jad, et al.
Published: (2025)
Optimized Gradient Clipping for Noisy Label Learning
by: Ye, Xichen, et al.
Published: (2024)
by: Ye, Xichen, et al.
Published: (2024)
Biomedical Visual Instruction Tuning with Clinician Preference Alignment
by: Cui, Hejie, et al.
Published: (2024)
by: Cui, Hejie, et al.
Published: (2024)
Similar Items
-
Perception-Consistency Multimodal Large Language Models Reasoning via Caption-Regularized Policy Optimization
by: Tu, Songjun, et al.
Published: (2025) -
Analyzing Quality, Bias, and Performance in Text-to-Image Generative Models
by: Masrourisaadat, Nila, et al.
Published: (2024) -
OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
by: Zhao, Weiyi, et al.
Published: (2025) -
TensLoRA: Tensor Alternatives for Low-Rank Adaptation
by: Marmoret, Axel, et al.
Published: (2025) -
Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
by: Rovai, Fabio
Published: (2026)