:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Shihata, Yusuf
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language I.4; I.2
Online Access:	https://arxiv.org/abs/2507.02985
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On the Limitations of Vision-Language Models in Understanding Image Transforms
by: Anis, Ahmad Mustafa, et al.
Published: (2025)

Transfer-learning for video classification: Video Swin Transformer on multiple domains
by: Oliveira, Daniel A. P., et al.
Published: (2022)

Detection Transformers Under the Knife: A Neuroscience-Inspired Approach to Ablations
by: Hütten, Nils, et al.
Published: (2025)

From Rule-Based Models to Deep Learning Transformers Architectures for Natural Language Processing and Sign Language Translation Systems: Survey, Taxonomy and Performance Evaluation
by: Shahin, Nada, et al.
Published: (2024)

ADAT: Time-Series-Aware Adaptive Transformer Architecture for Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

Multimodal Approaches for Visually-Rich Document Type Classification: A Comparative Analysis
by: Heyne, Catyana, et al.
Published: (2026)

GLoT: A Novel Gated-Logarithmic Transformer for Efficient Sign Language Translation
by: Shahin, Nada, et al.
Published: (2025)

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't
by: Nemitz, Jonathan, et al.
Published: (2026)

WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
by: Niu, Yuwei, et al.
Published: (2025)

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers
by: Deng, Juncan, et al.
Published: (2024)

Multimodal Ensemble with Conditional Feature Fusion for Dysgraphia Diagnosis in Children from Handwriting Samples
by: Kunhoth, Jayakanth, et al.
Published: (2024)

Using Deep Learning to Generate Semantically Correct Hindi Captions
by: Khan, Wasim Akram, et al.
Published: (2026)

LLM-Guided Exemplar Selection for Few-Shot Wearable-Sensor Human Activity Recognition
by: Ronando, Elsen, et al.
Published: (2025)

myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition
by: Thu, Ye Kyaw, et al.
Published: (2026)

The Influence of Iconicity in Transfer Learning for Sign Language Recognition
by: Artiaga, Keren, et al.
Published: (2026)

RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
by: Agarwal, Amit, et al.
Published: (2025)

Intrinsic Image Fusion for Multi-View 3D Material Reconstruction
by: Kocsis, Peter, et al.
Published: (2025)

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models
by: Rudman, William, et al.
Published: (2026)

PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
by: Patel, Hitesh Laxmichand, et al.
Published: (2025)

Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing
by: Oskooei, Amirkia Rafiei, et al.
Published: (2025)

MPCC: A Novel Benchmark for Multimodal Planning with Complex Constraints in Multimodal Large Language Models
by: Ji, Yiyan, et al.
Published: (2025)

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
by: Iyengar, Anirudh Iyengar Kaniyar Narayana, et al.
Published: (2025)

Enhancing Sports Strategy with Video Analytics and Data Mining: Assessing the effectiveness of Multimodal LLMs in tennis video analysis
by: Teo, Charlton
Published: (2025)

Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
by: Ji, Binbin, et al.
Published: (2025)

Adapting Multimodal Foundation Models for Few-Shot Learning: A Comprehensive Study on Contrastive Captioners
by: Narasinghe, N. K. B. M. P. K. B., et al.
Published: (2025)

RPCASSM: Robust PCA State Space Model For Infrared Small Target Detection
by: Liu, Pingping, et al.
Published: (2026)

THIRDEYE: Cue-Aware Monocular Depth Estimation via Brain-Inspired Multi-Stage Fusion
by: Ioan, Calin Teodor
Published: (2025)

A Two-stage Transformer Framework for Temporal Localization of Distracted Driver Behaviors
by: Doan, Gia-Bao, et al.
Published: (2026)

Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion
by: Dell'Erba, Samuele, et al.
Published: (2025)

Obtaining Favorable Layouts for Multiple Object Generation
by: Battash, Barak, et al.
Published: (2024)

Fruit Classification System with Deep Learning and Neural Architecture Search
by: Dewi, Christine, et al.
Published: (2024)

SynCo: Synthetic Hard Negatives for Contrastive Visual Representation Learning
by: Giakoumoglou, Nikolaos, et al.
Published: (2024)

Long Tail Image Generation Through Feature Space Augmentation and Iterated Learning
by: Elberg, Rafael, et al.
Published: (2024)

From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space
by: Hamara, Andrew, et al.
Published: (2024)

Scaling Large Vision-Language Models for Enhanced Multimodal Comprehension In Biomedical Image Analysis
by: Umeike, Robinson, et al.
Published: (2025)

TV-TREES: Multimodal Entailment Trees for Neuro-Symbolic Video Reasoning
by: Sanders, Kate, et al.
Published: (2024)

HATL: Hierarchical Adaptive-Transfer Learning Framework for Sign Language Machine Translation
by: Shahin, Nada, et al.
Published: (2026)

NOCTIS: Novel Object Cyclic Threshold based Instance Segmentation
by: Gandyra, Max, et al.
Published: (2025)

SoccerNet-v3D: Leveraging Sports Broadcast Replays for 3D Scene Understanding
by: Gutiérrez-Pérez, Marc, et al.
Published: (2025)

Methods and strategies for improving the novel view synthesis quality of neural radiation field
by: Fang, Shun, et al.
Published: (2024)