PassLLM: AI-Based Targeted Password Guessing

About The Project

PassLLM is the world's most accurate targeted password guessing framework, outperforming other models by 15% to 45% in most scenarios. It uses Personally Identifiable Information (PII) - such as names, birthdays, phone numbers, emails and previous passwords - to predict the specific passwords a target is most likely to use. The model fine-tunes 7B/4B parameter LLMs on millions of leaked PII records using LoRA, enabling a private, high-accuracy framework that runs entirely on consumer PCs.

Capabilities

State-of-the-Art Accuracy: Achieves +45% higher success rates than leading benchmarks (RankGuess, TarGuess) in most scenarios.
PII Inference: With sufficient information, it successfully guesses 12.5% - 31.6% of typical users within just 100 guesses.
Efficient Fine-Tuning: Custom training loop utilizing LoRA to lower VRAM usage without sacrificing model reasoning capabilities.
Advanced Inference: Implements the paper's algorithm to maximize probability, prioritizing the most likely candidates over random sampling.
Data-Driven: Can be trained on millions of real-world credentials to learn the deep statistical patterns of human passwords creation.
Pre-trained Weights: Includes robust models pre-trained on millions of real-world records from major PII breaches (e.g., Post Millennial, ClixSense) combined with the COMB dataset.

Use Guide

Tip: You can run this tool instantly without any local installation by opening our Google Colab Demo, providing your target's PII, and simply executing each cell in order.

Installation

Python: 3.10+
Password Guessing: Runs on Any GPU, Nvidia or AMD. A standard CPU or Mac (M1/M2) is also sufficient to run the pre-trained model.
Training: NVIDIA GPU with CUDA (RTX 3090/4090 recommended, Google Colab's free tier is often enough).

# 1. Clone the repository
   git clone https://github.com/tzohar/PassLLM.git
   cd PassLLM

# 2. Install dependencies (Choose one)
   # Option A: Install from requirements (Recommended)
   pip install -r requirements.txt
   
   # Option B: Manual install
   pip install torch torch-directml "transformers<5.0.0" peft datasets bitsandbytes accelerate gradio

Configuration

Download the trained weights (~126 MB) and place them in the models/ directory. Alternatively, via terminal:

curl -L https://github.com/Tzohar/PassLLM/releases/download/v1.3.0/PassLLM-Qwen3-4B-v1.0.pth -o models/PassLLM_LoRA_Weights.pth

Once installed and downloaded, adjust the settings in the WebUI or src/config.py to match your hardware. | Hardware | OS | Device | 4-Bit Quantization | Torch DType | Inference Batch Size | | --- | --- | --- | --- | --- | --- | | NVIDIA | Any | cuda | ✅ On (Recommended) | bfloat16 | High (64+) | | AMD | Windows | dml | ❌ Off | float16 | Low (8-16) | | AMD (RDNA 3+) | Linux/WSL | cuda | ❌ Off | bfloat16 | Medium (64+) | | AMD (Older) | Linux/WSL | cuda | ❌ Off | float16 | Low (8-16) | | CPU | Any | cpu | ❌ Off | float32 | Low (1-4) |

Note (AMD on Linux/WSL): DirectML (dml) is Windows-only. For AMD GPUs on Linux or WSL, you must install ROCm and PyTorch for ROCm. Once installed, set DEVICE = "cuda" as ROCm uses the CUDA API. 4-bit quantization (bitsandbytes) is not officially supported on ROCm. Newer AMD GPUs (RDNA 3 / RX 7000 series, MI200/MI300) have native bfloat16 support, use it for significant speed improvements.

Tip: Don't forget to customize the Min/Max Password Length, Character Bias, and Epsilon (search strictness) according to your specific target's needs!

Password Guessing (Pre-Trained)

You can use the graphical interface (WebUI) or the command line to generate candidates.

Option A: WebUI (Recommended)

Launch the Interface:

python webui.py

Generate:

Open the local URL (e.g., http://127.0.0.1:7860).
Select Model: Choose the most recent model from the dropdown.
Enter PII: Fill in the target's Name, Email, Birth Year, etc., into the form.
Click Generate: The engine will stream ranked candidates in real-time.

Option B: Command Line (CLI)

Best for automation or headless servers.

Create a Target File: Create a target.jsonl file (or use the existing one) in the main folder. You can include any field defined in src/config.py.

{
  "name": "Johan P.", 
  "birth_year": "1966",
  "email": "johan66@gmail.com",
  "sister_pw": "Johan123"
}

Run the Engine:

python app.py --file target.jsonl --weights models/PassLLM-Qwen3-4B-v1.0.pth --fast

--file: Path to your target PII file.
--fast: Uses optimized, shallow beam search (omit for full deep search).
--weights: Path to your downloaded model weights (e.g., the .pth file).
--superfast: Very quick but less accurate, mainly for testing.

Training From Databases

To reproduce the paper's results or train on a new breach, you must provide a dataset of PII-to-Password pairs.

Prepare Your Dataset: Create a file at training/passllm_raw_data.jsonl. Each line must be a valid JSON object containing a pii dictionary and the target output password.

Example passllm_raw_data.jsonl:
```
{"pii": {"name": "Alice", "birth_year": "1990"}, "output": "Alice1990!"}
{"pii": {"email": "bob@test.com", "sister_pw": "iloveyou"}, "output": "iloveyou2"}
```
Note: Ensure your keys (e.g., first_name, email) match the schema defined in src/config.py.

Configure Parameters: Edit src/config.py to match your hardware and dataset specifics:

# Hardware Settings
TRAIN_BATCH_SIZE = 4           # Lower to 1 or 2 if hitting OOM on consumer GPUs
GRAD_ACCUMULATION = 16   # Simulates larger batches (Effective Batch = 4 * 16 = 64)

# Model Settings
LORA_R = 16              # Rank dimension (Keep at 16 for standard reproduction)
VOCAB_BIAS_DIGITS = -4.0 # Penalty strength for non-password patterns

Start Training:
```
python train.py
```
This script automates the full pipeline:
- Freezes the base model (Mistral/Qwen).
- Injects Trainable LoRA adapters into Attention layers.
- Masks the loss function so the model only learns to predict the password, not the PII.
- Saves the lightweight adapter weights to models/PassLLM_LoRA_Weights.pth.

Results & Demo

{"name": "Marcus Thorne", "birth_year": "1976", "username": "mthorne88", "country": "Canada"}:

$ python app.py --file target.jsonl --superfast

--- TOP CANDIDATES ---
CONFIDENCE | PASSWORD
------------------------------
1.96%     | marcus1976   
1.91%     | thorne1976 
1.20%     | mthorne1976 
1.19%     | marc1976 (marc is a common diminutive of Marcus, used in many passwords) 
1.18%     | a123456 (a high-probability global baseline across users with similar PII) 
1.16%     | marci1976 (another common variation of Marcus)
1.01%     | winniethepooh (our training dataset demonstrated Winnie-related passwords to be common in Canada)
... (907 passwords generated)

{"name": "Elena Rodriguez", "birth_year": "1995", "birth_month": "12", "birth_day": "04", "email": "elena1.rod51@gmail.com", "id":"489298321"}:

$ python app.py --file target.jsonl --fast

--- TOP CANDIDATES ---
CONFIDENCE | PASSWORD
------------------------------
8.55%     | elena1204 (all variations of name + birth date are naturally given very high probability)
8.16%     | elena1995
7.77%     | elena951204     
6.29%     | elena9512
5.37%     | Elena1995
5.32%     | elena1.rod51 
5.00%     | 120495
... (5,895 passwords generated)

{"name": "Sophia M. Turner", "birth_year": "2001", "pet_name": "Fluffy", "username": "soph_t", "email": "sturner99@yahoo.com", "country": "England", "sister_pw": ["soph12345", "13rockm4n", "01mamamia"]}:

$ python app.py --file target.jsonl --fast

--- TOP CANDIDATES ---
CONFIDENCE | PASSWORD
------------------------------
2.93%     | sophia123 (this is a mix of the target's first name and the sister password "soph12345")       
2.53%     | mamamia01 (a simple variation of another sister password)       
1.96%     | sophia2001     
1.78%     | sophie123 (UK passwords often interchange between "sophie" and "sophia")
1.45%     | 123456a (a very commmon password, ranked high due to the "12345" pattern) 
1.39%     | soph

PassLLM

Install / Use

README