VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

VexIR2Vec is a robust, architecture-neutral framework for binary similarity, leveraging VEX-IR to overcome challenges from compiler, architecture, and obfuscation variations. It combines peephole-based extraction, normalization, and embedding with a Siamese network.

This repository contains the source code and information described in our paper (arXiv).

You can try out an online demo at https://compilers.cse.iith.ac.in/VexIR2Vec.

S. VenkataKeerthy, Soumya Banerjee, Sayan Dey, Yashas Andaluri, Raghul PS, Subrahmanyam Kalyanasundaram, Fernando Magno Quintao Pereira, Ramakrishna Updrasta. "VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity", ACM TOSEM 2025.

Requirements

Base OS: All experiments are conducted on Ubuntu 20.04

Python Version: Python 3.6.7 is used for running all scripts and experiments.

Conda Environment For all VexIR2Vec-related workflows, use the Conda environment from vexir2vec.yml. You can create the environment by doing conda env create -f vexir2vec.yml
FastText Model For the binary similarity tasks a FastText model is required. You can download the model using:
```
wget -q https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz -O cc.en.300.bin.gz
gunzip cc.en.300.bin.gz
```
And, update the FastText model path in the utils.py.

Using VexIR2Vec

To carry out diffing and searching experiments described in our paper, you may need to generate data which involves generating binaries, disassembly and VEX IR extraction followed by generating the embeddings and performing the task like diffing or searching.

Data generation - Generating binaries, disassembly, VEX IR generaton
Generating the embeddings - Initial embedding generation (seed embeddings and pretraining), and finetuning with VexNet model.
Binary Similarity Tasks - Diffing and searching by using the embeddings obtained from the VexNet model.

Generating Binaries

In our experiments, we consider the binaries generated from coreutils, diffutils, findutils, lua, curl, putty, and gzip projects compiled using different compilation configurations:

Compilers: Clang (V6, 8, 12), GCC (V6, 8, 10)
Architectures: x86, ARM
Compiler Optimizations: O0, O1, O2, O3, Os

For further details on binary generation, see the BinGen directory

Generating Initial Embeddings

Given the binaries, the initial embeddings are generated as .data files using driver.py.

Each data file includes

Function address
file source information
function name
Corresponding embedding vector

Initial embedding generation for VexIR2Vec is supported through two approaches:

With Database: Stores and retrieves processed instruction embeddings and metadata using a structured database backend. Ideal for systematic, end-to-end replication of experiments on a large collection of binaries.
Without Database: Processes and uses the binaries directly without relying on persistent storage. Can be used for quick generation of embeddings for a binary. For more details, refer to Data Generation.

Seed Embeddings: For training seed embeddings, refer to seed_embeddings. A pre-trained vocabulary is available.

Binary Similarity Tasks - Diffing and Searching

The task independent initial embeddings are fine-tuned to capture similarity by training VexNet model which results in final VexIR2Vec embeddings. For training details, usage examples, and parameters for vexir2vec_training.py, refer to VexIR2Vec Training. A trained model is available if you do not want to train a model from scratch.

The final VexIR2Vec embeddings are used to perform similarity tasks like diffing and searching. For more details refer to Diffing and Searching experiments.

Citation

@article{VenkataKeerthy-2025-VexIR2Vec,
author = {VenkataKeerthy, S. and Banerjee, Soumya and Dey, Sayan and Andaluri, Yashas and PS, Raghul and Kalyanasundaram, Subrahmanyam and Pereira, Fernando Magno Quint\~{a}o and Upadrasta, Ramakrishna},
title = {VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity},
year = {2025},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
issn = {1049-331X},
url = {https://doi.org/10.1145/3721481},
doi = {10.1145/3721481},
journal = {ACM Trans. Softw. Eng. Methodol.},
month = mar,
keywords = {Binary Similarity, Program Embedding, Representation Learning}
}

Contributions

Please feel free to raise issues to file a bug, pose a question, or initiate any related discussions. Pull requests are welcome :)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
embeddings		embeddings
experiments		experiments
images		images
vocabulary		vocabulary
LICENSE		LICENSE
README.md		README.md
vexir2vec.yml		vexir2vec.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

Table Of Contents

Requirements

Using VexIR2Vec

Generating Binaries

Generating Initial Embeddings

Binary Similarity Tasks - Diffing and Searching

Citation

Contributions

About

Uh oh!

Releases

Packages

Contributors 2

Languages

License

IITH-Compilers/VexIR2Vec

Folders and files

Latest commit

History

Repository files navigation

VexIR2Vec: An Architecture-Neutral Embedding Framework for Binary Similarity

Table Of Contents

Requirements

Using VexIR2Vec

Generating Binaries

Generating Initial Embeddings

Binary Similarity Tasks - Diffing and Searching

Citation

Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages