AutoPrompt
A framework for prompt tuning using Intent-based Prompt Calibration
Install / Use
/learn @Eladlev/AutoPromptREADME
📝 AutoPrompt
<!-- MARKDOWN LINKS & IMAGES --> <!-- https://www.markdownguide.org/basic-syntax/#reference-style-links -->Auto Prompt is a prompt optimization framework designed to enhance and perfect your prompts for real-world use cases.
The framework automatically generates high-quality, detailed prompts tailored to user intentions. It employs a refinement (calibration) process, where it iteratively builds a dataset of challenging edge cases and optimizes the prompt accordingly. This approach not only reduces manual effort in prompt engineering but also effectively addresses common issues such as prompt sensitivity and inherent prompt ambiguity issues.
Our mission: Empower users to produce high-quality robust prompts using the power of large language models (LLMs).
Why Auto Prompt?
- Prompt Engineering Challenges. The quality of LLMs greatly depends on the prompts used. Even minor changes can significantly affect their performance.
- Benchmarking Challenges. Creating a benchmark for production-grade prompts is often labour-intensive and time-consuming.
- Reliable Prompts. Auto Prompt generates robust high-quality prompts, offering measured accuracy and performance enhancement using minimal data and annotation steps.
- Modularity and Adaptability. With modularity at its core, Auto Prompt integrates seamlessly with popular open-source tools such as LangChain, Wandb, and Argilla, and can be adapted for a variety of tasks, including data synthesis and prompt migration.
System Overview

The system is designed for real-world scenarios, such as moderation tasks, which are often challenged by imbalanced data distributions. The system implements the Intent-based Prompt Calibration method. The process begins with a user-provided initial prompt and task description, optionally including user examples. The refinement process iteratively generates diverse samples, annotates them via user/LLM, and evaluates prompt performance, after which an LLM suggests an improved prompt.
The optimization process can be extended to content generation tasks by first devising a ranker prompt and then performing the prompt optimization with this learned ranker. The optimization concludes upon reaching the budget or iteration limit.
This joint synthetic data generation and prompt optimization approach outperform traditional methods while requiring minimal data and iterations. Learn more in our paper Intent-based Prompt Calibration: Enhancing prompt optimization with synthetic boundary cases by E. Levi et al. (2024).
Using GPT-4 Turbo, this optimization typically completes in just a few minutes at a cost of under $1. To manage costs associated with GPT-4 LLM's token usage, the framework enables users to set a budget limit for optimization, in USD or token count, configured as illustrated here.
Demo

📖 Documentation
- How to install (Setup instructions)
- Prompt optimization examples (Use cases: movie review classification, generation, and chat moderation)
- How it works (Explanation of pipelines)
- Architecture guide (Overview of main components)
Features
- 📝 Boosts prompt quality with a minimal amount of data and annotation steps.
- 🛬 Designed for production use cases like moderation, multi-label classification, and content generation.
- ⚙️ Enables seamless migrating of prompts across model versions or LLM providers.
- 🎓 Supports prompt squeezing. Combine multiple rules into a single efficient prompt.
QuickStart
AutoPrompt requires python <= 3.10
<br />
Step 1 - Download the project
git clone git@github.com:Eladlev/AutoPrompt.git
cd AutoPrompt
<br />
Step 2 - Install dependencies
Use either Conda or pip, depending on your preference. Using Conda:
conda env create -f environment_dev.yml
conda activate AutoPrompt
Using pip:
pip install -r requirements.txt
Using pipenv:
pip install pipenv
pipenv sync
<br />
Step 3 - Configure your LLM.
Set your OpenAI API key by updating the configuration file config/llm_env.yml
-
If you need help locating your API key, visit this link.
-
We recommend using OpenAI's GPT-4 for the LLM. Our framework also supports other providers and open-source models, as discussed here.
Step 4 - Configure your Annotator
-
Select an annotation approach for your project:
- We recommend beginning with a human-in-the-loop method, utilizing Argilla. Observe that AutoPrompt is compatible with Argilla V1, not with the latest V2. Follow the Argilla setup instructions, with the following modifications:
- If you are using local docker use
v1.29.0instead of thelatesttag. - For a quick setup using HF, duplicate the following space
- If you are using local docker use
- Alternatively, you can set up an LLM as your annotator by following these configuration steps.
- We recommend beginning with a human-in-the-loop method, utilizing Argilla. Observe that AutoPrompt is compatible with Argilla V1, not with the latest V2. Follow the Argilla setup instructions, with the following modifications:
-
The default predictor LLM, GPT-3.5, for estimating prompt performance, is configured in the
predictorsection ofconfig/config_default.yml. -
Define your budget in the input config yaml file using the
max_usage parameter. For OpenAI models,max_usagesets the maximum spend in USD. For other LLMs, it limits the maximum token count.
Step 5 - Run the pipeline
First, configure your labels by editing config/config_default.yml
dataset:
label_schema: ["Yes", "No"]
For a classification pipeline, use the following command from your terminal within the appropriate working directory:
python run_pipeline.py
If the initial prompt and task description are not provided directly as input, you will be guided to provide these details. Alternatively, specify them as command-line arguments:
python run_pipeline.py \
--prompt "Does this movie review contain a spoiler? answer Yes or No" \
--task_description "Assistant is an expert classifier that will classify a movie review, and let the user know if it contains a spoiler for the reviewed movie or not." \
--num_steps 30
You can track the optimization progress using the W&B dashboard, with setup instructions available here.
If you are using pipenv, be sure to activate the environment:
pipenv shell
python run_pipeline.py
or alternatively prefix your command with pipenv run:
pipenv run python run_pipeline.py
Generation pipeline
To run the generation pipeline, use the following example command:
python run_generation_pipeline.py \
--prompt "Write a good and comprehensive movie review about a specific movie." \
--task_description "Assistant is a large language model that is tasked with writing movie reviews."
For more information, refer to our generation task example.
<br />Benchmark optimization (optimize-only mode)
If you already have an annotated dataset and want to skip sample generation and annotation, use the benchmark optimization script. This mode runs a pure optimization loop: predict → evaluate → refine.
Your dataset should be a CSV file with text and annotation columns:
text,annotation
"The movie was absolutely fantastic!",Yes
"Waste of time and money.",No
Run the optimization:
python run_benchmark_optimization.py \
--dataset path/to/your_data.csv \
--prompt "Is this movie review positive? Answer Yes or No." \
--task_description "Classify movie reviews as positive or negative." \
--labels Yes No \
--num_steps 10 \
--output results.json
Arguments:
--dataset(required): Path to CSV withtextandannotationcolumns--prompt: Initial prompt to optimize (interactive if omitted)--task_description: Task description (interactive if omitted)--labels: Label schema (default: Yes No)--num_steps: Number of optimization iterations (default: 10)--output: Output JSON file for results (default: benchmark_results.json)--config: Configuration file (default: config/config_benchmark.yml)
This is useful when:
- You already have labeled benchmark data
- You want faster iteration without sample generation
- You're fine-tuning a prompt for a specific dataset
Enjoy the results. Completion of these steps yields a refined (calibrated)
prompt tailored for your task, alongside a benchmark featuring challenging samples,
stored in the default dump path.
Tips
- Prompt accuracy may fluctuate during the optimization. To identify the best prompts, we recommend continuous refinement following the initial generation of the benchmark. Set the number of optimization iterations with
--num_stepsand control sample generation by specifyingmax_samplesi
Related Skills
node-connect
337.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.2kCommit, push, and open a PR
