Guardado en:
| Autores principales: | Dubois, L'ea, Schmidt, Klaus, Wang, Chengyu, Park, Ji-Hoon, Wang, Lin, Munoz, Santiago |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2507.05822 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
por: Sharma, Aditya, et al.
Publicado: (2025)
por: Sharma, Aditya, et al.
Publicado: (2025)
CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
por: Raoufi, Behnam, et al.
Publicado: (2025)
por: Raoufi, Behnam, et al.
Publicado: (2025)
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
por: Zhang, Sinin, et al.
Publicado: (2026)
por: Zhang, Sinin, et al.
Publicado: (2026)
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
por: Su, Qile, et al.
Publicado: (2026)
por: Su, Qile, et al.
Publicado: (2026)
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
por: Wang, Wentao, et al.
Publicado: (2025)
por: Wang, Wentao, et al.
Publicado: (2025)
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
por: Michalkiewicz, Mateusz, et al.
Publicado: (2025)
por: Michalkiewicz, Mateusz, et al.
Publicado: (2025)
Lifelong Learning in Vision-Language Models: Enhanced EWC with Cross-Modal Knowledge Retention
por: Durrani, Hamza Ahmed, et al.
Publicado: (2026)
por: Durrani, Hamza Ahmed, et al.
Publicado: (2026)
Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
por: Wang, Gaojian, et al.
Publicado: (2025)
por: Wang, Gaojian, et al.
Publicado: (2025)
An Empirical Study for Representations of Videos in Video Question Answering via MLLMs
por: Li, Zhi, et al.
Publicado: (2025)
por: Li, Zhi, et al.
Publicado: (2025)
Vision-based Situational Graphs Exploiting Fiducial Markers for the Integration of Semantic Entities
por: Tourani, Ali, et al.
Publicado: (2023)
por: Tourani, Ali, et al.
Publicado: (2023)
EventFormer: A Node-graph Hierarchical Attention Transformer for Action-centric Video Event Prediction
por: Su, Qile, et al.
Publicado: (2025)
por: Su, Qile, et al.
Publicado: (2025)
CoT4AD: A Vision-Language-Action Model with Explicit Chain-of-Thought Reasoning for Autonomous Driving
por: Wang, Zhaohui, et al.
Publicado: (2025)
por: Wang, Zhaohui, et al.
Publicado: (2025)
Robust Confidence Intervals in Stereo Matching using Possibility Theory
por: Malinowski, Roman, et al.
Publicado: (2024)
por: Malinowski, Roman, et al.
Publicado: (2024)
DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding
por: Han, Yudong, et al.
Publicado: (2024)
por: Han, Yudong, et al.
Publicado: (2024)
From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text
por: Le, Van-Truong
Publicado: (2026)
por: Le, Van-Truong
Publicado: (2026)
VidNum-1.4K: A Comprehensive Benchmark for Video-based Numerical Reasoning
por: Cui, Shaoyang, et al.
Publicado: (2026)
por: Cui, Shaoyang, et al.
Publicado: (2026)
Computer Vision for Clinical Gait Analysis: A Gait Abnormality Video Dataset
por: Ranjan, Rahm, et al.
Publicado: (2024)
por: Ranjan, Rahm, et al.
Publicado: (2024)
3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model
por: Ko, Hyun-kyu, et al.
Publicado: (2026)
por: Ko, Hyun-kyu, et al.
Publicado: (2026)
Predicting When to Trust Vision-Language Models for Spatial Reasoning
por: Imran, Muhammad, et al.
Publicado: (2026)
por: Imran, Muhammad, et al.
Publicado: (2026)
VLM-VPI: A Vision-Language Reasoning Framework for Improving Automated Vehicle-Pedestrian Interactions
por: Pu, Qingwen, et al.
Publicado: (2026)
por: Pu, Qingwen, et al.
Publicado: (2026)
Universal Adversarial Attack on Aligned Multimodal LLMs
por: Rahmatullaev, Temurbek, et al.
Publicado: (2025)
por: Rahmatullaev, Temurbek, et al.
Publicado: (2025)
Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
por: Rovai, Fabio
Publicado: (2026)
por: Rovai, Fabio
Publicado: (2026)
A Vision-Language Model for Focal Liver Lesion Classification
por: Jian, Song, et al.
Publicado: (2025)
por: Jian, Song, et al.
Publicado: (2025)
Context-Aware Network Based on Multi-scale Spatio-temporal Attention for Action Recognition in Videos
por: Li, Xiaoyang, et al.
Publicado: (2025)
por: Li, Xiaoyang, et al.
Publicado: (2025)
PC-SNN: Predictive Coding-based Local Hebbian Plasticity Learning in Spiking Neural Networks
por: Wang, Haidong, et al.
Publicado: (2022)
por: Wang, Haidong, et al.
Publicado: (2022)
EgoTraj: Real-World Egocentric Human Trajectory Dataset for Multimodal Prediction
por: Yehia, Ahmad, et al.
Publicado: (2026)
por: Yehia, Ahmad, et al.
Publicado: (2026)
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis
por: Chen, Junting, et al.
Publicado: (2025)
por: Chen, Junting, et al.
Publicado: (2025)
Motion-Guided Semantic Alignment with Negative Prompts for Zero-Shot Video Action Recognition
por: Wang, Yiming, et al.
Publicado: (2026)
por: Wang, Yiming, et al.
Publicado: (2026)
U-Net-Like Spiking Neural Networks for Single Image Dehazing
por: Li, Huibin, et al.
Publicado: (2025)
por: Li, Huibin, et al.
Publicado: (2025)
Parameter-efficient fine-tuning (PEFT) of Vision Foundation Models for Atypical Mitotic Figure Classification
por: Ramchandani, Lavish, et al.
Publicado: (2025)
por: Ramchandani, Lavish, et al.
Publicado: (2025)
TD3Net: A temporal densely connected multi-dilated convolutional network for lipreading
por: Lee, Byung Hoon, et al.
Publicado: (2025)
por: Lee, Byung Hoon, et al.
Publicado: (2025)
Text-to-Events: Synthetic Event Camera Streams from Conditional Text Input
por: Ott, Joachim, et al.
Publicado: (2024)
por: Ott, Joachim, et al.
Publicado: (2024)
MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction
por: Wang, Chao, et al.
Publicado: (2025)
por: Wang, Chao, et al.
Publicado: (2025)
RDPO: Real Data Preference Optimization for Physics Consistency Video Generation
por: Qian, Wenxu, et al.
Publicado: (2025)
por: Qian, Wenxu, et al.
Publicado: (2025)
CausalVQA: A Physically Grounded Causal Reasoning Benchmark for Video Models
por: Foss, Aaron, et al.
Publicado: (2025)
por: Foss, Aaron, et al.
Publicado: (2025)
Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning
por: Ji, Binbin, et al.
Publicado: (2025)
por: Ji, Binbin, et al.
Publicado: (2025)
Dream to Fly: Model-Based Reinforcement Learning for Vision-Based Drone Flight
por: Romero, Angel, et al.
Publicado: (2025)
por: Romero, Angel, et al.
Publicado: (2025)
Cortex 2.0: Grounding World Models in Real-World Industrial Deployment
por: Aida, Adriana, et al.
Publicado: (2026)
por: Aida, Adriana, et al.
Publicado: (2026)
StratXplore: Strategic Novelty-seeking and Instruction-aligned Exploration for Vision and Language Navigation
por: Gopinathan, Muraleekrishna, et al.
Publicado: (2024)
por: Gopinathan, Muraleekrishna, et al.
Publicado: (2024)
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
por: Han, Yudong, et al.
Publicado: (2026)
por: Han, Yudong, et al.
Publicado: (2026)
Ejemplares similares
-
Inducing Causal World Models in LLMs for Zero-Shot Physical Reasoning
por: Sharma, Aditya, et al.
Publicado: (2025) -
CLIP-Joint-Detect: End-to-End Joint Training of Object Detectors with Contrastive Vision-Language Supervision
por: Raoufi, Behnam, et al.
Publicado: (2025) -
PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model
por: Zhang, Sinin, et al.
Publicado: (2026) -
Video-CoE: Reinforcing Video Event Prediction via Chain of Events
por: Su, Qile, et al.
Publicado: (2026) -
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph
por: Wang, Wentao, et al.
Publicado: (2025)