Saved in:
Bibliographic Details
Main Authors: VenkataKeerthy, S., Banerjee, Soumya, Dey, Sayan, Andaluri, Yashas, PS, Raghul, Kalyanasundaram, Subrahmanyam, Pereira, Fernando Magno Quintão, Upadrasta, Ramakrishna
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2312.00507
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916316939550720
author VenkataKeerthy, S.
Banerjee, Soumya
Dey, Sayan
Andaluri, Yashas
PS, Raghul
Kalyanasundaram, Subrahmanyam
Pereira, Fernando Magno Quintão
Upadrasta, Ramakrishna
author_facet VenkataKeerthy, S.
Banerjee, Soumya
Dey, Sayan
Andaluri, Yashas
PS, Raghul
Kalyanasundaram, Subrahmanyam
Pereira, Fernando Magno Quintão
Upadrasta, Ramakrishna
contents Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.
format Preprint
id arxiv_https___arxiv_org_abs_2312_00507
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity
VenkataKeerthy, S.
Banerjee, Soumya
Dey, Sayan
Andaluri, Yashas
PS, Raghul
Kalyanasundaram, Subrahmanyam
Pereira, Fernando Magno Quintão
Upadrasta, Ramakrishna
Programming Languages
Cryptography and Security
Machine Learning
Binary similarity involves determining whether two binary programs exhibit similar functionality, often originating from the same source code. In this work, we propose VexIR2Vec, an approach for binary similarity using VEX-IR, an architecture-neutral Intermediate Representation (IR). We extract the embeddings from sequences of basic blocks, termed peepholes, derived by random walks on the control-flow graph. The peepholes are normalized using transformations inspired by compiler optimizations. The VEX-IR Normalization Engine mitigates, with these transformations, the architectural and compiler-induced variations in binaries while exposing semantic similarities. We then learn the vocabulary of representations at the entity level of the IR using the knowledge graph embedding techniques in an unsupervised manner. This vocabulary is used to derive function embeddings for similarity assessment using VexNet, a feed-forward Siamese network designed to position similar functions closely and separate dissimilar ones in an n-dimensional space. This approach is amenable for both diffing and searching tasks, ensuring robustness against Out-Of-Vocabulary (OOV) issues. We evaluate VexIR2Vec on a dataset comprising 2.7M functions and 15.5K binaries from 7 projects compiled across 12 compilers targeting x86 and ARM architectures. In diffing experiments, VexIR2Vec outperforms the nearest baselines by $40\%$, $18\%$, $21\%$, and $60\%$ in cross-optimization, cross-compilation, cross-architecture, and obfuscation settings, respectively. In the searching experiment, VexIR2Vec achieves a mean average precision of $0.76$, outperforming the nearest baseline by $46\%$. Our framework is highly scalable and is built as a lightweight, multi-threaded, parallel library using only open-source tools. VexIR2Vec is $3.1$-$3.5 \times$ faster than the closest baselines and orders-of-magnitude faster than other tools.
title VEXIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity
topic Programming Languages
Cryptography and Security
Machine Learning
url https://arxiv.org/abs/2312.00507