:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Takenaka, Patrick, Maucher, Johannes, Huber, Marco F.
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2407.09537
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ViPro-2: Unsupervised State Estimation via Integrated Dynamics for Guiding Video Prediction
by: Takenaka, Patrick, et al.
Published: (2025)

Guiding Video Prediction with Explicit Procedural Knowledge
by: Takenaka, Patrick, et al.
Published: (2024)

Anonymization of Documents for Law Enforcement with Machine Learning
by: Eberhardinger, Manuel, et al.
Published: (2025)

Classification of Inkjet Printers based on Droplet Statistics
by: Takenaka, Patrick, et al.
Published: (2024)

ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
by: Loison, António, et al.
Published: (2026)

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation
by: Yuan, Kun, et al.
Published: (2024)

Self-supervised Optimization of Hand Pose Estimation using Anatomical Features and Iterative Learning
by: Jauch, Christian, et al.
Published: (2023)

Enabling Versatile Controls for Video Diffusion Models
by: Zhang, Xu, et al.
Published: (2025)

LoViT: Long Video Transformer for Surgical Phase Recognition
by: Liu, Yang, et al.
Published: (2023)

ViRED: Prediction of Visual Relations in Engineering Drawings
by: Gu, Chao, et al.
Published: (2024)

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
by: Yan, Peizheng, et al.
Published: (2026)

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models
by: Si, Shengyu, et al.
Published: (2026)

Knowledge Distillation in YOLOX-ViT for Side-Scan Sonar Object Detection
by: Aubard, Martin, et al.
Published: (2024)

ViT-Linearizer: Distilling Quadratic Knowledge into Linear-Time Vision Models
by: Wei, Guoyizhe, et al.
Published: (2025)

ViPRA: Video Prediction for Robot Actions
by: Routray, Sandeep, et al.
Published: (2025)

An Intermediate Fusion ViT Enables Efficient Text-Image Alignment in Diffusion Models
by: Hu, Zizhao, et al.
Published: (2024)

FlowExtract: Procedural Knowledge Extraction from Maintenance Flowcharts
by: de Avalle, Guillermo Gil, et al.
Published: (2026)

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
by: Rawte, Vipula, et al.
Published: (2024)

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers
by: Marmon, Andrew, et al.
Published: (2024)

ViSAudio: End-to-End Video-Driven Binaural Spatial Audio Generation
by: Zhang, Mengchen, et al.
Published: (2025)

Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios
by: Wang, Kai, et al.
Published: (2024)

MC-ViViT: Multi-branch Classifier-ViViT to detect Mild Cognitive Impairment in older adults using facial videos
by: Sun, Jian, et al.
Published: (2023)

Less is More: Label-Guided Summarization of Procedural and Instructional Videos
by: Rajpal, Shreya, et al.
Published: (2026)

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
by: Qin, Luozheng, et al.
Published: (2026)

Controllable Pedestrian Video Editing for Multi-View Driving Scenarios via Motion Sequence
by: Fu, Danzhen, et al.
Published: (2025)

LoViF 2026 The First Challenge on Weather Removal in Videos
by: Qian, Chenghao, et al.
Published: (2026)

GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model
by: Fu, Yongjie, et al.
Published: (2024)

ProFashion: Prototype-guided Fashion Video Generation with Multiple Reference Images
by: Kong, Xianghao, et al.
Published: (2025)

EgoPro-Bench: Benchmarking Personalized Proactive Interaction in Egocentric Video Streams
by: Ran, Dongchuan, et al.
Published: (2026)

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video
by: Li, Ao, et al.
Published: (2026)

SceneX: Procedural Controllable Large-scale Scene Generation
by: Zhou, Mengqi, et al.
Published: (2024)

ViSketch-GPT: Collaborative Multi-Scale Feature Extraction for Sketch Recognition and Generation
by: Federico, Giulio, et al.
Published: (2025)

Procedural Knowledge Extraction from Industrial Troubleshooting Guides Using Vision Language Models
by: de Avalle, Guillermo Gil, et al.
Published: (2026)

MMeViT: Multi-Modal ensemble ViT for Post-Stroke Rehabilitation Action Recognition
by: Kim, Ye-eun, et al.
Published: (2025)

OCK: Unsupervised Dynamic Video Prediction with Object-Centric Kinematics
by: Song, Yeon-Ji, et al.
Published: (2024)

Twins-PainViT: Towards a Modality-Agnostic Vision Transformer Framework for Multimodal Automatic Pain Assessment using Facial Videos and fNIRS
by: Gkikas, Stefanos, et al.
Published: (2024)

CoProSketch: Controllable and Progressive Sketch Generation with Diffusion Model
by: Zhan, Ruohao, et al.
Published: (2025)

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models
by: Mou, Tingshu, et al.
Published: (2026)

ProGait: A Multi-Purpose Video Dataset and Benchmark for Transfemoral Prosthesis Users
by: Yin, Xiangyu, et al.
Published: (2025)

ProMISe: Promptable Medical Image Segmentation using SAM
by: Wang, Jinfeng, et al.
Published: (2024)