Bin2ml
A command line tool for extracting machine learning ready data from software binaries powered by Radare2
Install / Use
/learn @br0kej/Bin2mlREADME
bin2ml
bin2ml is a command line tool to extract machine learning ready data from software binaries. It's ideal for researchers and hackers to easily extract data suitable for training machine learning approaches such as natural language processing (NLP) or Graph Neural Networks (GNN's) models using data derived from software binaries.
- Extract a range of different data from binaries such as Attributed Control Flow Graphs, Basic Block random walks and function instructions strings powered by Radare2.
- Multithreaded data processing throughout powered by Rayon.
- Save processed data in ready to go formats such as graphs saved as NetworkX compatible JSON objects.
- Experimental support for creating machine learning embedded basic block CFG's using
tch-rsand TorchScript traced models.
bin2mlis under active development and is in an alpha state. Things will change as the tool is developed and built upon further.
Pre-Requisites
- Radare2 Installed - Info on how to do this can be found here.
Quickstart
git clone https://github.com/br0kej/bin2ml
cd bin2ml
cargo build --release
Alternatively, there are two Dockerfile's provided. Dockerfile.build can be used to build the bin2ml binary without having to have cargo on your workstation or Dockerfile builds bin2ml as well as installing radare2 to provide a means of doing processing within the container.
Docs
bin2ml does come with some documentation (albeit incomplete) and has been developed using mdbook. The documentation can be locally served by installing the platform relevant version of mdbook from here
and then executing the commands below:
cd bin2ml/docs
mdbook serve
Alternatively, they can be viewed raw by going to the docs folder here
License
The bin2ml source and documentation are released under the MIT license.
Citation
@misc{collyer2023bin2ml,
author = {Josh Collyer},
title = {bin2ml},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/br0kej/bin2ml/}},
}
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
last30days-skill
16.5kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
sec-edgar-agentkit
10AI agent toolkit for accessing and analyzing SEC EDGAR filing data. Build intelligent agents with LangChain, MCP-use, Gradio, Dify, and smolagents to analyze financial statements, insider trading, and company filings.
