Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2310.07324 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866929482893361152 |
|---|---|
| author | Radouane, Karim Lagarde, Julien Ranwez, Sylvie Tchechmedjiev, Andon |
| author_facet | Radouane, Karim Lagarde, Julien Ranwez, Sylvie Tchechmedjiev, Andon |
| contents | Diverse and extensive work has recently been conducted on text-conditioned human motion generation. However, progress in the reverse direction, motion captioning, has seen less comparable advancement. In this paper, we introduce a novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms. To encourage human-like reasoning, we propose methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words. We discuss and quantify our model's interpretability using relevant histograms and density distributions. Furthermore, we leverage interpretability to derive fine-grained information about human motion, including action localization, body part identification, and the distinction of motion-related words. Finally, we discuss the transferability of our approaches to other tasks. Our experiments demonstrate that attention guidance leads to interpretable captioning while enhancing performance compared to higher parameter-count, non-interpretable state-of-the-art systems. The code is available at: https://github.com/rd20karim/M2T-Interpretable. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2310_07324 |
| institution | arXiv |
| publishDate | 2023 |
| record_format | arxiv |
| spellingShingle | Guided Attention for Interpretable Motion Captioning Radouane, Karim Lagarde, Julien Ranwez, Sylvie Tchechmedjiev, Andon Computer Vision and Pattern Recognition Diverse and extensive work has recently been conducted on text-conditioned human motion generation. However, progress in the reverse direction, motion captioning, has seen less comparable advancement. In this paper, we introduce a novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms. To encourage human-like reasoning, we propose methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words. We discuss and quantify our model's interpretability using relevant histograms and density distributions. Furthermore, we leverage interpretability to derive fine-grained information about human motion, including action localization, body part identification, and the distinction of motion-related words. Finally, we discuss the transferability of our approaches to other tasks. Our experiments demonstrate that attention guidance leads to interpretable captioning while enhancing performance compared to higher parameter-count, non-interpretable state-of-the-art systems. The code is available at: https://github.com/rd20karim/M2T-Interpretable. |
| title | Guided Attention for Interpretable Motion Captioning |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2310.07324 |