Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.22103 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866908990207688704 |
|---|---|
| author | Tang, Jason Law, Stephen |
| author_facet | Tang, Jason Law, Stephen |
| contents | Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2604_22103 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits Tang, Jason Law, Stephen Computers and Society Computer Vision and Pattern Recognition Street-view perception models predict subjective attributes such as safety at scale, but remain correlational: they do not identify which localized visual changes would plausibly shift human judgement for a specific scene. We propose a lever-based interventional counterfactual framework that recasts scene-level explainability as a bounded search over structured counterfactual edits. Each lever specifies a semantic concept, spatial support, intervention direction, and constrained edit template. Candidate edits are generated through prompt-conditioned image editing and retained only if they satisfy validity checks for same-place preservation, locality, realism, and plausibility. In a pilot across 50 scenes from five cities, the framework reveals preliminary proxy-based directional patterns and a practical failure taxonomy under prompt-only editing, with Mobility Infrastructure and Physical Maintenance showing the largest auxiliary safety shifts. Human pairwise judgements remain the ground-truth endpoint for future validation. |
| title | How Many Visual Levers Drive Urban Perception? Interventional Counterfactuals via Multiple Localised Edits |
| topic | Computers and Society Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2604.22103 |