:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Dubois, L'ea, Schmidt, Klaus, Wang, Chengyu, Park, Ji-Hoon, Wang, Lin, Munoz, Santiago
Formato:	Preprint
Publicado:	2025
Materias:	Computer Vision and Pattern Recognition CS I.2.10
Acceso en línea:	https://arxiv.org/abs/2507.05822
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
por: Sharma, Aditya, et al.
Publicado: (2025)

CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
por: Raoufi, Behnam, et al.
Publicado: (2025)

PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
por: Zhang, Sinin, et al.
Publicado: (2026)

Video-CoE: Reinforcing Video Event Prediction via Chain of Events
por: Su, Qile, et al.
Publicado: (2026)

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
por: Wang, Wentao, et al.
Publicado: (2025)

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
por: Michalkiewicz, Mateusz, et al.
Publicado: (2025)

Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention
por: Durrani, Hamza Ahmed, et al.
Publicado: (2026)

Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
por: Wang, Gaojian, et al.
Publicado: (2025)

An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
por: Li, Zhi, et al.
Publicado: (2025)

Vision-based Situational Graphs Exploiting Fiducial Markers for the Integration of Semantic Entities
por: Tourani, Ali, et al.
Publicado: (2023)

EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction
por: Su, Qile, et al.
Publicado: (2025)

CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
por: Wang, Zhaohui, et al.
Publicado: (2025)

Robust Confidence Intervals in Stereo Matching using Possibility Theory
por: Malinowski, Roman, et al.
Publicado: (2024)

DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
por: Han, Yudong, et al.
Publicado: (2024)

From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
por: Le, Van-Truong
Publicado: (2026)

VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
por: Cui, Shaoyang, et al.
Publicado: (2026)

Computer Vision for Clinical Gait Analysis: A Gait Abnormality Video Dataset
por: Ranjan, Rahm, et al.
Publicado: (2024)

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
por: Ko, Hyun-kyu, et al.
Publicado: (2026)

Predicting When to Trust Vision-Language Models for Spatial Reasoning
por: Imran, Muhammad, et al.
Publicado: (2026)

VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
por: Pu, Qingwen, et al.
Publicado: (2026)

Universal Adversarial Attack on Aligned Multimodal LLMs
por: Rahmatullaev, Temurbek, et al.
Publicado: (2025)

Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
por: Rovai, Fabio
Publicado: (2026)

A Vision-Language Model for Focal Liver Lesion Classification
por: Jian, Song, et al.
Publicado: (2025)

Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos
por: Li, Xiaoyang, et al.
Publicado: (2025)

PC-SNN: Predictive Coding-based Local Hebbian Plasticity Learning in Spiking Neural Networks
por: Wang, Haidong, et al.
Publicado: (2022)

EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
por: Yehia, Ahmad, et al.
Publicado: (2026)

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
por: Chen, Junting, et al.
Publicado: (2025)

Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
por: Wang, Yiming, et al.
Publicado: (2026)

U-Net-Like Spiking Neural Networks for Single Image Dehazing
por: Li, Huibin, et al.
Publicado: (2025)

Parameter-efficient fine-tuning (PEFT) of Vision Foundation Models for Atypical Mitotic Figure Classification
por: Ramchandani, Lavish, et al.
Publicado: (2025)

TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading
por: Lee, Byung Hoon, et al.
Publicado: (2025)

Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input
por: Ott, Joachim, et al.
Publicado: (2024)

MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction
por: Wang, Chao, et al.
Publicado: (2025)

RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
por: Qian, Wenxu, et al.
Publicado: (2025)

CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models
por: Foss, Aaron, et al.
Publicado: (2025)

Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
por: Ji, Binbin, et al.
Publicado: (2025)

Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight
por: Romero, Angel, et al.
Publicado: (2025)

Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
por: Aida, Adriana, et al.
Publicado: (2026)

StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
por: Gopinathan, Muraleekrishna, et al.
Publicado: (2024)

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
por: Han, Yudong, et al.
Publicado: (2026)