Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Chen, Guoqiang, Ying, Lingyun, Song, Ziyang, Liu, Daguang, Wang, Qiang, Wang, Zhiqi, Hu, Li, Cheng, Shaoyin, Zhang, Weiming, Yu, Nenghai
Format:	Preprint
Published:	2025
Subjects:	Software Engineering Artificial Intelligence
Online Access:	https://arxiv.org/abs/2512.10393
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915706330677248
author	Chen, Guoqiang Ying, Lingyun Song, Ziyang Liu, Daguang Wang, Qiang Wang, Zhiqi Hu, Li Cheng, Shaoyin Zhang, Weiming Yu, Nenghai
author_facet	Chen, Guoqiang Ying, Lingyun Song, Ziyang Liu, Daguang Wang, Qiang Wang, Zhiqi Hu, Li Cheng, Shaoyin Zhang, Weiming Yu, Nenghai
contents	Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
format	Preprint
id	arxiv_https___arxiv_org_abs_2512_10393
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Cross-modal Retrieval Models for Stripped Binary Analysis Chen, Guoqiang Ying, Lingyun Song, Ziyang Liu, Daguang Wang, Qiang Wang, Zhiqi Hu, Li Cheng, Shaoyin Zhang, Weiming Yu, Nenghai Software Engineering Artificial Intelligence Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.
title	Cross-modal Retrieval Models for Stripped Binary Analysis
topic	Software Engineering Artificial Intelligence
url	https://arxiv.org/abs/2512.10393

Similar Items