HelloLondon
Historical Language Model for London - A specialized LLM trained on 1500-1850 historical English text
Install / Use
/learn @bahree/HelloLondonREADME
🏛️ London Historical LLM (aka helloLondon)
Complete Guide to Building LLMs from Scratch
Learn to build Large Language Models from the ground up using historical London texts (1500-1850). This comprehensive 4-part series walks you through every step: data collection, custom tokenization, model training, evaluation, and deployment. We build two identical models - only the size differs (117M vs 354M parameters). Includes working code, published models, and educational inference examples.
⚠️ Educational Purpose: This is a learning project designed to teach LLM development concepts. For production-scale LLMs, you'll need much larger datasets, more sophisticated infrastructure, and additional considerations not covered here.
📖 Want to understand the core LLM concepts? This project focuses on implementation and hands-on building. For deeper understanding of foundational concepts like tokenizers, prompt engineering, RAG, responsible AI, fine-tuning, and more, check out Generative AI in Action by Amit Bahree. You can learn more about the book → by clicking here.
🚀 Ready to Use: The London Historical SLM is already published and available on Hugging Face!
📝 Blog Series:
- Building LLMs from Scratch - Part 1 - Quick start and overview
- Building LLMs from Scratch - Part 2 - Data Collection & Custom Tokenizers
- Building LLMs from Scratch - Part 3 - Training Architecture & GPU Optimization
Complete 4-part series covering the end-to-end process of building historical language models from scratch.
graph LR
A[📚 Data Collection<br/>218+ sources<br/>1500-1850] --> B[🧹 Data Cleaning<br/>Text normalization<br/>Filtering]
B --> C[🔤 Tokenizer Training<br/>30k vocab<br/>150+ special tokens]
C --> D[🏋️ Model Training<br/>Two Identical Models<br/>SLM: 117M / Regular: 354M]
D --> E[📊 Evaluation<br/>Historical accuracy<br/>ROUGE, MMLU]
E --> F[🚀 Deployment<br/>Hugging Face<br/>Local Inference]
F --> G[🎯 Use Cases<br/>Historical text generation<br/>Educational projects<br/>Research applications]
style A fill:#e1f5fe
style D fill:#f3e5f5
style F fill:#e8f5e8
style G fill:#fff3e0
🎓 What You'll Learn
This isn't just a model repository—it's a complete educational journey that teaches you how to build LLMs from scratch:
Core LLM Building Skills:
- Data Collection: Gather and process 218+ historical sources from Archive.org
- Custom Tokenization: Build specialized tokenizers for historical language patterns
- Model Architecture: Implement and train GPT-style models from scratch
- Training Infrastructure: Multi-GPU training, checkpointing, and monitoring
- Evaluation: Comprehensive testing with historical accuracy metrics
- Deployment: Publish to Hugging Face and build inference systems
Hands-On Experience:
- Working Code: Every component is fully implemented and documented
- Live Models: Use published models immediately or train your own
- Real Data: 500M+ characters of authentic historical English (1500-1850)
- Educational Focus: Well-structured code designed for learning LLM development
📚 Complete Documentation
📖 Documentation Index: 08_documentation/README.md - Browse all guides here!
🎯 Two Model Variants - Identical Architecture, Different Sizes
We build two identical models with the same architecture, tokenizer, and training process. The only difference is the number of parameters:
| Model | Parameters | Iterations | Training Time* | Use Case | Best For | |-------|------------|------------|---------------|----------|----------| | SLM (Small) | 117M | 60,000 | ~8-12 hours | Fast inference, resource-constrained | Development, testing, mobile | | Regular (Full) | 354M | 60,000 | ~28-32 hours | High-quality generation | Advanced learning, research, experimentation |
Why Two Models? The SLM is perfect for learning, testing, and resource-constrained environments. The Regular model provides higher quality generation for more advanced experimentation. Both use identical code - just different configuration files!
*Times based on dual GPU training (2x A30 GPUs). Single GPU will take ~2x longer.
🚀 Quick Start - Choose Your Learning Path
Path 1: Use Published Model (Start Here)
Generate historical text in 2 minutes - Perfect for understanding the end result
🎯 Published Model: The London Historical SLM is ready to use!
📖 Detailed Guide: See Inference Quick Start
✅ Status: Both PyTorch checkpoint and Hugging Face inference working perfectly
Prerequisites:
- Python 3.8+ installed
- Internet connection (to download the model)
⚠️ Ubuntu/Debian Users: You also need
python3-venvpackage:sudo apt install python3-venv # For Python 3.8-3.11 sudo apt install python3.12-venv # For Python 3.12+
Quick Setup:
# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 07_utilities/setup_inference.py
# Use the model
python3 06_inference/inference_unified.py --interactive
Try it with Python:
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the model (automatically downloads)
tokenizer = AutoTokenizer.from_pretrained("bahree/london-historical-slm")
model = AutoModelForCausalLM.from_pretrained("bahree/london-historical-slm")
# Generate historical text
prompt = "In the year 1834, I walked through the streets of London and witnessed"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
inputs['input_ids'],
max_new_tokens=50,
do_sample=True,
temperature=0.3,
top_p=0.9,
top_k=20,
repetition_penalty=1.2
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
# Output: "the most extraordinary sight. The Thames flowed dark beneath London Bridge,
# whilst carriages rattled upon the cobblestones with great urgency. Merchants called
# their wares from Cheapside to Billingsgate, and the smoke from countless chimneys
# did obscure the morning sun."
Or test with PyTorch checkpoints (if you have local models):
# Test SLM checkpoint (117M parameters)
python 06_inference/inference_pytorch.py \
--checkpoint 09_models/checkpoints/slm/checkpoint-4000.pt \
--prompt "In the year 1834, I walked through the streets of London and witnessed"
# Test Regular checkpoint (354M parameters)
python 06_inference/inference_pytorch.py \
--checkpoint 09_models/checkpoints/checkpoint-60001.pt \
--prompt "In the year 1834, I walked through the streets of London and witnessed"
Path 2: Build from Scratch (Complete Learning Journey)
Learn every step of LLM development - Data collection, tokenization, training, and deployment
Prerequisites:
- Python 3.8+ installed
- 8GB+ RAM (16GB+ recommended)
- 100GB+ free disk space
- CUDA GPU (recommended, CPU works but slower)
⚠️ Ubuntu/Debian Users: You also need
python3-venvpackage:sudo apt install python3-venv # For Python 3.8-3.11 sudo apt install python3.12-venv # For Python 3.12+
Full Setup:
# Clone and setup
git clone https://github.com/bahree/helloLondon.git
cd helloLondon
python3 01_environment/setup_environment.py
source activate_env.sh # Linux/Mac
# Download data (218+ historical sources)
python3 02_data_collection/historical_data_collector.py
# Train tokenizer (30k vocab + 150+ special tokens)
python3 03_tokenizer/train_historical_tokenizer.py
# Train model
torchrun --nproc_per_node=2 04_training/train_model_slm.py # SLM (recommended)
# or
torchrun --nproc_per_node=2 04_training/train_model.py # Regular
# Evaluate & test
python3 05_evaluation/run_evaluation.py --mode quick
python3 06_inference/inference_unified.py --interactive
📖 Complete Training Guide: See Training Quick Start for detailed instructions
🏛️ What You Get
Historical Language Capabilities:
- Tudor English (1500-1600): "thou", "thee", "hath", "doth"
- Stuart Period (1600-1700): Restoration language, court speech
- Georgian Era (1700-1800): Austen-style prose, social commentary
- Victorian Times (1800-1850): Dickens-style narrative, industrial language
London-Specific Knowledge:
- Landmarks: Thames, Westminster, Tower, Fleet Street, Cheapside
- Historical Events: Great Fire, Plague, Civil War, Restoration
- Social Classes: Nobles, merchants, apprentices, beggars
- Professions: Apothecaries, coachmen, watermen, chimneysweeps
📊 Comprehensive Historical Dataset
This project includes one of the most comprehensive collections of historical English texts available for language model training, spanning 1500-1850 with 218+ sources and 500M+ characters.
🏛️ Historical Coverage
| Period | Sources | Key Content | Language Features | |--------
