תצוגת צוות: :: Library Catalog

שמור ב:

מידע ביבליוגרפי
Main Authors:	Yang, Zhuoran, Zhang, Yanyong
פורמט:	Preprint
יצא לאור:	2026
נושאים:	Computer Vision and Pattern Recognition
גישה מקוונת:	https://arxiv.org/abs/2602.03213
תגים:	הוספת תג אין תגיות, היה/י הראשונ/ה לתייג את הרשומה!

_version_	1866910017368621056
author	Yang, Zhuoran Zhang, Yanyong
author_facet	Yang, Zhuoran Zhang, Yanyong
contents	Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_03213
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask Yang, Zhuoran Zhang, Yanyong Computer Vision and Pattern Recognition Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.
title	ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.03213

פריטים דומים