Gespeichert in:
| Hauptverfasser: | , , |
|---|---|
| Format: | Preprint |
| Veröffentlicht: |
2026
|
| Schlagworte: | |
| Online-Zugang: | https://arxiv.org/abs/2604.07034 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| _version_ | 1866914457955860480 |
|---|---|
| author | Hosseinzadeh, Mehdi Wong, King Hang Dayoub, Feras |
| author_facet | Hosseinzadeh, Mehdi Wong, King Hang Dayoub, Feras |
| contents | We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/ |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_07034 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis Hosseinzadeh, Mehdi Wong, King Hang Dayoub, Feras Robotics Artificial Intelligence Computer Vision and Pattern Recognition We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/ |
| title | KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis |
| topic | Robotics Artificial Intelligence Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2604.07034 |