LLM4Decompile
Reverse Engineering: Decompiling Binary Code with Large Language Models
Install / Use
/learn @albertan017/LLM4DecompileREADME
Reverse Engineering: Decompiling Binary Code with Large Language Models
Updates
- [2025-10-04]: Release SK²Decompile: LLM-based Two-Phase Binary Decompilation from Skeleton to Skin. Phase 1 Structure Recovery (Skeleton): Transform binary/pseudo-code into obfuscated intermediate representations 🤗 HF Link. Phase 2 Identifier Naming (Skin): Generate human-readable source code with meaningful identifiers 🤗 HF Link.
- [2025-05-20]: Release decompile-bench, contains two million binary-source function pairs for training, and 70K function pairs for evaluation. Please refer to the decompile-bench folder for details.
- [2024-10-17]: Release decompile-ghidra-100k, a subset of 100k training samples (25k per optimization level). We provide a training script that runs in ~3.5 hours on a single A100 40G GPU. It achieves a 0.26 re-executability rate, with a total cost of under $20 for quick replication of LLM4Decompile.
- [2024-09-26]: Update a Colab notebook to demonstrate the usage of the LLM4Decompile model, including examples for the LLM4Decompile-End and LLM4Decompile-Ref models.
- [2024-09-23]: Release LLM4Decompile-9B-v2, fine-tuned based on Yi-Coder-9B, achieved a re-executability rate of 0.6494 on the Decompile benchmark.
- [2024-06-19]: Release V2 series (LLM4Decompile-Ref). V2 (1.3B-22B), building upon Ghidra, are trained on 2 billion tokens to refine the decompiled pseudo-code from Ghidra. The 22B-V2 version outperforms the 6.7B-V1.5 by an additional 40.1%. Please check the ghidra folder for details.
- [2024-05-13]: Release V1.5 series (LLM4Decompile-End, directly decompile binary using LLM). V1.5 are trained with a larger dataset (15B tokens) and a maximum token length of 4,096, with remarkable performance (over 100% improvement) compared to the previous model.
- [2024-03-16]: Add llm4decompile-6.7b-uo model which is trained without prior knowledge of the optimization levels (O0~O3), the average re-executability is around 0.219, performs the best in our models.
About
- LLM4Decompile is the pioneering open-source large language model dedicated to decompilation. Its current version supports decompiling Linux x86_64 binaries, ranging from GCC's O0 to O3 optimization levels, into human-readable C source code. Our team is committed to expanding this tool's capabilities, with ongoing efforts to incorporate a broader range of architectures and configurations.
- LLM4Decompile-End focuses on decompiling the binary directly. LLM4Decompile-Ref refines the pseudo-code decompiled by Ghidra.
Evaluation
Framework
<p align="center"> <img src="https://github.com/albertan017/LLM4Decompile/blob/main/samples/compile-decompile.png" alt="image" width="400" height="auto"> </p>During compilation, the Preprocessor processes the source code (SRC) to eliminate comments and expand macros or includes. The cleaned code is then forwarded to the Compiler, which converts it into assembly code (ASM). This ASM is transformed into binary code (0s and 1s) by the Assembler. The Linker finalizes the process by linking function calls to create an executable file. Decompilation, on the other hand, involves converting binary code back into a source file. LLMs, being trained on text, lack the ability to process binary data directly. Therefore, binaries must be disassembled by Objdump into assembly language (ASM) first. It should be noted that binary and disassembled ASM are equivalent, they can be interconverted, and thus we refer to them interchangeably. Finally, the loss is computed between the decompiled code and source code to guide the training. To assess the quality of the decompiled code (SRC'), it is tested for its functionality through test assertions (re-executability).
Metrics
- Re-executability evaluates whether the decompiled code can execute properly and pass all the predefined test cases.
Benchmarks
- HumanEval-Decompile A collection of 164 C functions that exclusively rely on standard C libraries.
- ExeBench A collection of 2,621 functions drawn from real projects, each utilizing user-defined functions, structures, and macros.
Results
<p align="center"> <img src="https://github.com/albertan017/LLM4Decompile/blob/main/samples/results_end_final.png" alt="results" width="800" height="auto"> </p> <p align="center"> <img src="https://github.com/albertan017/LLM4Decompile/blob/main/samples/results_refine_final.png" alt="image" width="800" height="auto"> </p>Models
Our LLM4Decompile includes models with sizes between 1.3 billion and 33 billion parameters, and we have made these models available on Hugging Face.
| Model | Checkpoint | Size | Re-executability | Note | |-----------------------|-------------------------------------------------------------------|------|---------------------|----------------------| | llm4decompile-1.3b-v1.5| 🤗 HF Link | 1.3B | 27.3% | Note 3 | | llm4decompile-6.7b-v1.5| 🤗 HF Link | 6.7B | 45.4% | Note 3 | | llm4decompile-1.3b-v2| 🤗 HF Link | 1.3B | 46.0% | Note 4 | | llm4decompile-6.7b-v2| 🤗 HF Link | 6.7B | 52.7% | Note 4 | | llm4decompile-9b-v2| 🤗 HF Link | 9B | 64.9% | Note 4 | | llm4decompile-22b-v2| 🤗 HF Link | 22B | 63.6% | Note 4 |
Note 3: V1.5 series are trained with a larger dataset (15B tokens) and a maximum token size of 4,096, with remarkable performance (over 100% improvement) compared to the previous model.
Note 4: V2 series are built upon Ghidra and trained on 2 billion tokens to refine the decompiled pseudo-code from Ghidra. Check ghidra folder for details.
Quick Start
Setup: Please use the script below to install the necessary environment.
git clone https://github.com/albertan017/LLM4Decompile.git
cd LLM4Decompile
conda create -n 'llm4decompile' python=3.9 -y
conda activate llm4decompile
pip install -r requirements.txt
Here is an example of how to use our model (Revised for V1.5. For previous models, please check the corresponding model page at HF). Note: Replace the "func0" with the function name you want to decompile.
Preprocessing: Compile the C code into binary, and disassemble the binary into assembly instructions.
import subprocess
import os
func_name = 'func0'
OPT = ["O0", "O1", "O2", "O3"]
fileName = 'samples/sample' #'path/to/file'
for opt_state in OPT:
output_file = fileName +'_' + opt_state
input_file = fileName+'.c'
compile_command = f'gcc -o {output_file}.o {input_file} -{opt_state} -lm'#compile the code with GCC on Linux
subprocess.run(compile_command, shell=True, check=True)
compile_command = f'objdump -d {output_file}.o > {output_file}.s'#disassemble the binary file into assembly instructions
subprocess.run(compile_command, shell=True, check=True)
input_asm = ''
with open(output_file+'.s') as f:#asm file
asm= f.read()
if '<'+func_name+'>:' not in asm: #IMPORTANT replace func0 with the function name
raise ValueError("compile fails")
asm = '<'+func_name+'>:' + asm.split('<'+func_name+'>:')[-1].split('\n\n')[0] #IMPORTANT replace func0 with the function name
asm_clean = ""
asm_sp = asm.split("\n")
for tmp in asm_sp:
if len(tmp.split("\t"))<3 and '00' in tmp:
continue
idx = min(
len(tmp.split("\t")) - 1, 2
)
tmp_asm = "\t".join(tmp.split("\t")[
