Uniflow
LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D!
Install / Use
/learn @CambioML/UniflowREADME
🌊 uniflow
<p align="center"> <a href="/LICENSE"><img alt="License Apache-2.0" src="https://img.shields.io/github/license/cambioml/uniflow?style=flat-square"></a> <a href="https://pypi.org/project/uniflow"><img src="https://img.shields.io/pypi/v/uniflow.svg" alt="pypi_status" /></a> <a href="https://github.com/cambioml/uniflow/graphs/commit-activity"><img alt="Commit activity" src="https://img.shields.io/github/commit-activity/m/cambioml/uniflow?style=flat-square"/></a> <a href="https://join.slack.com/t/cambiomlworkspace/shared_invite/zt-1zes33rmt-20Rag043uvExUaUdvt5_xQ"><img src="https://badgen.net/badge/Join/Community/cyan?icon=slack" alt="Slack" /></a> </p>uniflow provides a unified LLM interface to extract and transform and raw documents.
- Document types: Uniflow enables data extraction from PDFs, HTMLs and TXTs.
- LLM agnostic: Uniflow supports most common-used LLMs for text tranformation, including
- OpenAI models (GPT3.5 and GPT4),
- Google Gemini models (Gemini 1.5, MultiModal),
- AWS BedRock models,
- Huggingface open source models including Mistral-7B,
- Azure OpenAI models, etc.
:question: The Problems to Tackle
Uniflow addresses two key challenges in preparing LLM training data for ML scientists:
- first, extracting legacy documents like PDFs and Word files into clean text, which LLMs can learn from, is tricky due to complex PDF layouts and missing information during extraction; and
- second, the labor-intensive process of transforming extracted data into a format suitable for training LLMs, which involves creating datasets with both preferred and rejected answers for each question to support feedback-based learning techniques.
Hence, we built Uniflow, a unified LLM interface to extract and transform and raw documents.
:seedling: Use Cases
Uniflow aims to help every data scientist generate their own privacy-perserved, ready-to-use training datasets for LLM finetuning, and hence make finetuning LLMs more accessible to everyone:rocket:.
Check Uniflow hands-on solutions:
- Extract financial reports (PDFs) into summerrization
- Extract financial reports (PDFs) and finetune financial LLMs
- Extract a math book (HTMLs) into your question answer dataset
- Extract PDFs into your question answer dataset
- Build RLHF/RLAIF perference datasets for LLM finetuning
:computer: Installation
Installing uniflow takes about 5-10 minutes if you follow the 3 steps below:
-
Create a conda environment on your terminal using:
conda create -n uniflow python=3.10 -y conda activate uniflow # some OS requires `source activate uniflow` -
Install the compatible pytorch based on your OS.
- If you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via
nvcc -V.pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121 # cu121 means cuda 12.1 - If you are on a CPU instance,
pip3 install torch
- If you are on a GPU, install pytorch based on your cuda version. You can find your CUDA version via
-
Install
uniflow:pip3 install uniflow-
(Optional) If you are running one of the following
OpenAIflows, you will have to set up your OpenAI API key. To do so, create a.envfile in your root uniflow folder. Then add the following line to the.envfile:OPENAI_API_KEY=YOUR_API_KEY -
(Optional) If you are running the
HuggingfaceModelFlow, you will also need to install thetransformers,accelerate,bitsandbytes,scipylibraries:pip3 install transformers accelerate bitsandbytes scipy -
(Optional) If you are running the
LMQGModelFlow, you will also need to install thelmqgandspacylibraries:pip3 install lmqg spacy
-
Congrats you have finished the installation!
:man_technologist: Dev Setup
If you are interested in contributing to us, here are the preliminary development setups.
conda create -n uniflow python=3.10 -y
conda activate uniflow
cd uniflow
pip3 install poetry
poetry install --no-root
AWS EC2 Dev Setup
If you are on EC2, you can launch a GPU instance with the following config:
- EC2
g4dn.xlarge(if you want to run a pretrained LLM with 7B parameters) - Deep Learning AMI PyTorch GPU 2.0.1 (Ubuntu 20.04) <img src="example/image/readme_ec2_ami.jpg" alt="Alt text" width="50%" height="50%"/>
- EBS: at least 100G <img src="example/image/readme_ec2_storage.png" alt="Alt text" width="50%" height="50%"/>
API keys
If you are running one of the following OpenAI flows, you will have to set up your OpenAI API key.
To do so, create a .env file in your root uniflow folder. Then add the following line to the .env file:
OPENAI_API_KEY=YOUR_API_KEY
:scroll: Uniflow Manual
Overview
To use uniflow, follow of three main steps:
-
Pick a
Config
This determines the LLM and the different configurable parameters. -
Construct your
Prompts
Construct the context that you want to use to prompt your model. You can configure custom instructions and examples using thePromptTemplateclass. -
Run your
Flow
Run the flow on your input data and generate output from your LLM.
Note: We're currently building have
Preprocessingflows as well to help process data from different sources, such ashtml,Markdown, and more.
1. Config
The Config determines which LLM is used and how the input data is serialized and deserialized. It also has parameters that are specific to the LLM.
Here is a table of the different pre-defined configurations you can use and their corresponding LLMs:
| Config | LLM |
| ------------- | ------------- |
| Config | gpt-3.5-turbo-1106 |
| OpenAIConfig | gpt-3.5-turbo-1106|
| HuggingfaceConfig| mistralai/Mistral-7B-Instruct-v0.1 |
| LMQGConfig | lmqg/t5-base-squad-qg-ae |
You can run each config with the defaults, or you can pass in custom parameters, such as temperature or batch_size to the config for your use case. See the advanced custom configuration section for more details.
2. Prompting
By default, uniflow is set up to generate questions and answers based on the Context you pass in. To do so, it has a default instruction and few-shot examples that it uses to guide the LLM.
Here is the default instruction:
Generate one question and its corresponding answer based on the last context in the last example. Follow the format of the examples below to include context, question, and answer in the response
Here are the default few-shot examples:
context="The quick brown fox jumps over the lazy brown dog.",
question="What is the color of the fox?",
answer="brown."
context="The quick brown fox jumps over the lazy black dog.",
question="What is the color of the dog?",
answer="black."
To run with these default instructions and examples, all you need to do is pass in a list of Context objects to the flow. uniflow will then generate a custom prompt with the instructions and few-shot examples for each Context object to send to the LLM. See the Running the flow section for more details.
Context
The Context class is used to pass in the context for the LLM prompt. A Context consists of a context property, which is a string of text.
To run uniflow with the default instructions and few-shot examples, you can pass in a list of Context objects to the flow. For example:
from uniflow.op.prompt import Context
data = [
Context(
context="The quick brown fox jumps over the lazy brown dog.",
),
...
]
client.run(data)
For a more detailed overview of running the flow, see the Running the flow section.
PromptTemplate
If you want to run with a custom prompt instruction or few-shot examp
