Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhou, Pengfei, Xia, Jie, Peng, Xiaopeng, Zhao, Wangbo, Ye, Zilong, Li, Zekai, Yang, Suorong, Pan, Jiadong, Chen, Yuanxiang, Wang, Ziqiao, Wang, Kai, Zheng, Qian, Jin, Hao, Chang, Xiaojun, Pan, Gang, Dong, Shurong, Zhang, Kaipeng, You, Yang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.05397
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912810426957824
author	Zhou, Pengfei Xia, Jie Peng, Xiaopeng Zhao, Wangbo Ye, Zilong Li, Zekai Yang, Suorong Pan, Jiadong Chen, Yuanxiang Wang, Ziqiao Wang, Kai Zheng, Qian Jin, Hao Chang, Xiaojun Pan, Gang Dong, Shurong Zhang, Kaipeng You, Yang
author_facet	Zhou, Pengfei Xia, Jie Peng, Xiaopeng Zhao, Wangbo Ye, Zilong Li, Zekai Yang, Suorong Pan, Jiadong Chen, Yuanxiang Wang, Ziqiao Wang, Kai Zheng, Qian Jin, Hao Chang, Xiaojun Pan, Gang Dong, Shurong Zhang, Kaipeng You, Yang
contents	Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_05397
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Neural-Driven Image Editing Zhou, Pengfei Xia, Jie Peng, Xiaopeng Zhao, Wangbo Ye, Zilong Li, Zekai Yang, Suorong Pan, Jiadong Chen, Yuanxiang Wang, Ziqiao Wang, Kai Zheng, Qian Jin, Hao Chang, Xiaojun Pan, Gang Dong, Shurong Zhang, Kaipeng You, Yang Computer Vision and Pattern Recognition Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.
title	Neural-Driven Image Editing
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2507.05397

Similar Items