Saved in:
Bibliographic Details
Main Authors: Zhou, Pengfei, Xia, Jie, Peng, Xiaopeng, Zhao, Wangbo, Ye, Zilong, Li, Zekai, Yang, Suorong, Pan, Jiadong, Chen, Yuanxiang, Wang, Ziqiao, Wang, Kai, Zheng, Qian, Jin, Hao, Chang, Xiaojun, Pan, Gang, Dong, Shurong, Zhang, Kaipeng, You, Yang
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.05397
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912810426957824
author Zhou, Pengfei
Xia, Jie
Peng, Xiaopeng
Zhao, Wangbo
Ye, Zilong
Li, Zekai
Yang, Suorong
Pan, Jiadong
Chen, Yuanxiang
Wang, Ziqiao
Wang, Kai
Zheng, Qian
Jin, Hao
Chang, Xiaojun
Pan, Gang
Dong, Shurong
Zhang, Kaipeng
You, Yang
author_facet Zhou, Pengfei
Xia, Jie
Peng, Xiaopeng
Zhao, Wangbo
Ye, Zilong
Li, Zekai
Yang, Suorong
Pan, Jiadong
Chen, Yuanxiang
Wang, Ziqiao
Wang, Kai
Zheng, Qian
Jin, Hao
Chang, Xiaojun
Pan, Gang
Dong, Shurong
Zhang, Kaipeng
You, Yang
contents Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.
format Preprint
id arxiv_https___arxiv_org_abs_2507_05397
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Neural-Driven Image Editing
Zhou, Pengfei
Xia, Jie
Peng, Xiaopeng
Zhao, Wangbo
Ye, Zilong
Li, Zekai
Yang, Suorong
Pan, Jiadong
Chen, Yuanxiang
Wang, Ziqiao
Wang, Kai
Zheng, Qian
Jin, Hao
Chang, Xiaojun
Pan, Gang
Dong, Shurong
Zhang, Kaipeng
You, Yang
Computer Vision and Pattern Recognition
Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.
title Neural-Driven Image Editing
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2507.05397