Saved in:
Bibliographic Details
Main Authors: Chen, Guoqiang, Ying, Lingyun, Song, Ziyang, Liu, Daguang, Wang, Qiang, Wang, Zhiqi, Hu, Li, Cheng, Shaoyin, Zhang, Weiming, Yu, Nenghai
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.10393
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866915706330677248
author Chen, Guoqiang
Ying, Lingyun
Song, Ziyang
Liu, Daguang
Wang, Qiang
Wang, Zhiqi
Hu, Li
Cheng, Shaoyin
Zhang, Weiming
Yu, Nenghai
author_facet Chen, Guoqiang
Ying, Lingyun
Song, Ziyang
Liu, Daguang
Wang, Qiang
Wang, Zhiqi
Hu, Li
Cheng, Shaoyin
Zhang, Weiming
Yu, Nenghai
contents Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
format Preprint
id arxiv_https___arxiv_org_abs_2512_10393
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Cross-modal Retrieval Models for Stripped Binary Analysis
Chen, Guoqiang
Ying, Lingyun
Song, Ziyang
Liu, Daguang
Wang, Qiang
Wang, Zhiqi
Hu, Li
Cheng, Shaoyin
Zhang, Weiming
Yu, Nenghai
Software Engineering
Artificial Intelligence
Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
title Cross-modal Retrieval Models for Stripped Binary Analysis
topic Software Engineering
Artificial Intelligence
url https://arxiv.org/abs/2512.10393