🏛️ London Historical LLM (aka helloLondon)

Complete Guide to Building LLMs from Scratch

Learn to build Large Language Models from the ground up using historical London texts (1500-1850). This comprehensive 4-part series walks you through every step: data collection, custom tokenization, model training, evaluation, and deployment. We build two identical models - only the size differs (117M vs 354M parameters). Includes working code, published models, and educational inference examples.

⚠️ Educational Purpose: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you'll need much larger datasets, more sophisticated infrastructure, and additional considerations not covered here.

📖 Want to understand the core LLM concepts? This project focuses on implementation and hands-on building. For deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out Generative AI in Action by Amit Bahree. You can learn more about the book → by clicking here.

🚀 Ready to Use: The London Historical SLM is already published and available on Hugging Face!

📝 Blog Series:

Building LLMs from Scratch - Part 1 - Quick start and overview

Building LLMs from Scratch - Part 2 - Data Collection & Custom Tokenizers

Building LLMs from Scratch - Part 3 - Training Architecture & GPU Optimization

Complete 4-part series covering the end-to-end process of building historical language models from scratch.

graph LR
    A[📚 Data Collection<br/>218+ sources<br/>1500-1850] --> B[🧹 Data Cleaning<br/>Text normalization<br/>Filtering]
    B --> C[🔤 Tokenizer Training<br/>30k vocab<br/>150+ special tokens]
    C --> D[🏋️ Model Training<br/>Two Identical Models<br/>SLM: 117M / Regular: 354M]
    D --> E[📊 Evaluation<br/>Historical accuracy<br/>ROUGE, MMLU]
    E --> F[🚀 Deployment<br/>Hugging Face<br/>Local Inference]
    F --> G[🎯 Use Cases<br/>Historical text generation<br/>Educational projects<br/>Research applications]
    
    style A fill:#e1f5fe
    style D fill:#f3e5f5
    style F fill:#e8f5e8
    style G fill:#fff3e0

🎓 What You'll Learn

This isn't just a model repository—it's a complete educational journey that teaches you how to build LLMs from scratch:

Core LLM Building Skills:

Data Collection: Gather and process 218+ historical sources from Archive.org
Custom Tokenization: Build specialized tokenizers for historical language patterns
Model Architecture: Implement and train GPT-style models from scratch
Training Infrastructure: Multi-GPU training, checkpointing, and monitoring
Evaluation: Comprehensive testing with historical accuracy metrics
Deployment: Publish to Hugging Face and build inference systems

Hands-On Experience:

Working Code: Every component is fully implemented and documented
Live Models: Use published models immediately or train your own
Real Data: 500M+ characters of authentic historical English (1500-1850)
Educational Focus: Well-structured code designed for learning LLM development

📚 Complete Documentation

📖 Documentation Index: 08_documentation/README.md - Browse all guides here!

🎯 Two Model Variants - Identical Architecture, Different Sizes

We build two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters:

| Model | Parameters | Iterations | Training Time* | Use Case | Best For | |-------|------------|------------|---------------|----------|----------| | SLM (Small) | 117M | 60,000 | ~8-12 hours | Fast inference, resource-constrained | Development, testing, mobile | | Regular (Full) | 354M | 60,000 | ~28-32 hours | High-quality generation | Advanced learning, research, experimentation |

Why Two Models? The SLM is perfect for learning, testing, and resource-constrained environments. The Regular model provides higher quality generation for more advanced experimentation. Both use identical code - just different configuration files!

*Times based on dual GPU training (2x A30 GPUs). Single GPU will take ~2x longer.

🚀 Quick Start - Choose Your Learning Path

Path 1: Use Published Model (Start Here)

Generate historical text in 2 minutes - Perfect for understanding the end result

🎯 Published Model: The London Historical SLM is ready to use!
📖 Detailed Guide: See Inference Quick Start
✅ Status: Both PyTorch checkpoint and Hugging Face inference working perfectly

Prerequisites:

Python 3.8+ installed
Internet connection (to download the model)

⚠️ Ubuntu/Debian Users: You also need python3-venv package:
sudo apt install python3-venv  # For Python 3.8-3.11
sudo apt install python3.12-venv  # For Python 3.12+

Quick Setup:

# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 07_utilities/setup_inference.py

# Use the model
python3 06_inference/inference_unified.py --interactive

Try it with Python:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model (automatically downloads)
tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")

# Generate historical text
prompt = "In the year 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    inputs['input_ids'], 
    max_new_tokens=50, 
    do_sample=True,
    temperature=0.3,
    top_p=0.9,
    top_k=20,
    repetition_penalty=1.2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Output: "the most extraordinary sight. The Thames flowed dark beneath London Bridge, 
# whilst carriages rattled upon the cobblestones with great urgency. Merchants called 
# their wares from Cheapside to Billingsgate, and the smoke from countless chimneys 
# did obscure the morning sun."

Or test with PyTorch checkpoints (if you have local models):

# Test SLM checkpoint (117M parameters)
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

# Test Regular checkpoint (354M parameters)  
python 06_inference/inference_pytorch.py \
  --checkpoint 09_models/checkpoints/checkpoint-60001.pt \
  --prompt "In the year 1834, I walked through the streets of London and witnessed"

Path 2: Build from Scratch (Complete Learning Journey)

Learn every step of LLM development - Data collection, tokenization, training, and deployment

Prerequisites:

Python 3.8+ installed
8GB+ RAM (16GB+ recommended)
100GB+ free disk space
CUDA GPU (recommended, CPU works but slower)

⚠️ Ubuntu/Debian Users: You also need python3-venv package:
sudo apt install python3-venv  # For Python 3.8-3.11
sudo apt install python3.12-venv  # For Python 3.12+

Full Setup:

# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 01_environment/setup_environment.py
source activate_env.sh  # Linux/Mac

# Download data (218+ historical sources)
python3 02_data_collection/historical_data_collector.py

# Train tokenizer (30k vocab + 150+ special tokens)
python3 03_tokenizer/train_historical_tokenizer.py

# Train model
torchrun --nproc_per_node=2 04_training/train_model_slm.py  # SLM (recommended)
# or
torchrun --nproc_per_node=2 04_training/train_model.py      # Regular

# Evaluate & test
python3 05_evaluation/run_evaluation.py --mode quick
python3 06_inference/inference_unified.py --interactive

📖 Complete Training Guide: See Training Quick Start for detailed instructions

🏛️ What You Get

Historical Language Capabilities:

Tudor English (1500-1600): "thou", "thee", "hath", "doth"
Stuart Period (1600-1700): Restoration language, court speech
Georgian Era (1700-1800): Austen-style prose, social commentary
Victorian Times (1800-1850): Dickens-style narrative, industrial language

London-Specific Knowledge:

Landmarks: Thames, Westminster, Tower, Fleet Street, Cheapside
Historical Events: Great Fire, Plague, Civil War, Restoration
Social Classes: Nobles, merchants, apprentices, beggars
Professions: Apothecaries, coachmen, watermen, chimneysweeps

📊 Comprehensive Historical Dataset

This project includes one of the most comprehensive collections of historical English texts available for language model training, spanning 1500-1850 with 218+ sources and 500M+ characters.

🏛️ Historical Coverage

HelloLondon

Install / Use

README