Refinery
The data scientist's open-source choice to scale, assess and maintain natural language data. Treat training data like a software artifact.
Install / Use
/learn @code-kern-ai/RefineryREADME

Does one of these scenarios sounds familiar to you?
- You are working on your own side-project in NLP, but you don't have enough labeled data to train a good model.
- You are working in a team and already have some labeled data, but your training data is just stored in a spreadsheet or some TXT-file, and you have no idea how good it actually is.
- You are working in a team about to start a new project with limited resources (annotators, budget, time), and now want to understand how you can best make use of them
If so, you are one of the people we've built refinery for. refinery helps you to build better NLP models in a data-centric approach. Semi-automate your labeling, find low-quality subsets in your training data, and monitor your data in one place.
refinery doesn't get rid of manual labeling, but it makes sure that your valuable time is spent well. Also, the makers of refinery currently work on integrations to other labeling tools, such that you can easily switch between different choices.

DEMO: You can interact with the application in a (mostly read-only) online playground. Check it out here
refinery is a multi-repository project, you can find all integrated services in the architecture below. The app builds on top of 🤗 Hugging Face and spaCy to leverage pre-built language models for your NLP tasks, as well as qdrant for neural search.
Table of contents
- 🧑💻 Why refinery?
- Your benefits
- How does Kern AI make money, if refinery is open-source?
- 🤓 Features
- ☕ Installation
- 📘 Documentation and tutorials
- 😵💫 Need help?
- 🪢 Community and contact
- 🙌 Contributing
- ❓ FAQ
- 🐍 Python SDK
- 🏠 Architecture
- 🏫 Glossary
- 👩💻👨💻 Team and contributors
- 🌟 Star History
- 📃 License
🧑💻 Why refinery?
There are already many other tools available to build training data. Why did we decide to build yet another one?
Enabling ideas of one-person-armies
We believe that developers can have crazy ideas, and we want to lower the barrier for them to go for that idea. refinery is designed to build labeled training data much faster, so that it takes you very little time to prototype an idea. We've received much love for exactly that, so make sure to give it a try for your next project.
Extending your existing labeling approach
refinery is more than a labeling tool. It has a built-in labeling editor, but its main advantages come with automation and data management. You can integrate any kind of heuristic to label what is possible automatically, and then focus on headache-causing subsets afterwards. Whether you do the labeling in refinery or any other tool (even crowd labeled) doesn't matter!
Put structure into unstructured data
refinery is the tool that brings new perspectives into your data. You're working on multilingual, human-written texts? Via our integration to bricks, you can easily enrich your texts with metadata such as the detected language, sentence complexity and many more. You can use this both to analyze your data, but also to orchestrate your labeling workflow.
Pushing collaboration
While doing so, we aim to improve the collaboration between engineers and subject matter experts (SMEs). In the past, we've seen how our application was being used in meetings to discuss label patterns in form of labeling functions and distant supervisors. We believe that data-centric AI is the best way to leverage collaboration.
Open-source, and treating training data as a software artifact
We hate the idea that there are still use cases in which the training data is just a plain CSV-file. That is okay if you really just quickly want to prototype something at hand with a few records, but any serious software should be maintainable. We believe an open-source solution for training data management is what's needed here. refinery is the tool helping you to document your data. That's how you treat training data as a software artifact.
Integrations
Lastly, refinery supports SDK actions like pulling and pushing data. Data-centric AI redefines labeling to be more than a one-time job by giving it an iterative workflow, so we aim to give you more power every day by providing end-to-end capabilities, growing the large-scale availability of high-quality training data. Use our SDK to program integrations with your existing landscapes.
Your benefits
You can automate tons of repetitive tasks, gain better insights into the data labeling workflow, receive an implicit documentation for your training data, and can ultimately build better models in shorter time.
Our goal is to make training data building feel more like a programmatic and enjoyable task, instead of something tedious and repetitive. refinery is our contribution to this goal. And we're constantly aiming to improve this contribution.
If you like what we're working on, please leave a ⭐!
How does Kern AI make money, if refinery is open-source?
You won't believe how often we get that question - and it is a fair one 🙂 Put short, the open-source version of refinery is currently a single-user version, and you can get access to a multi-user environment with our commercial options. Additionally, we have commercial products on top of refinery, e.g. to use the refinery automations as an actual realtime prediction API.
Generally, we are passionate about open-source and want to contribute as much as possible.
🤓 Features
For a detailed overview of features, please look into our docs.
