NextSteamGame
A Search engine Based of Simularity, Utlizing a Hierarchy Genre Umbrella system with vector simularity
Install / Use
/learn @BakedSoups/NextSteamGameREADME
Steam Recommender
Find your new favorite game through game similarity. This algorithm attempts to reward video games that can't afford advertising.
Live Demo: https://nextsteamgame.com/
Quick Start
1. Setup Environment Variables
# Copy the example environment file
cp .env.example .env
# Edit .env and add your OpenAI API key
# Get your key from: https://platform.openai.com/api-keys
nano .env # or use your preferred editor
2. Install Dependencies
# Create virtual environment
python3 -m venv venv
# Activate virtual environment
# On Linux/Mac:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
# Install required packages
pip install -r requirements.txt
3. Run the Application
python app.py
The app will be available at http://localhost:5000
Why This Exists
Ideally this is a one-shot app that gives you exactly what you were looking for first try! If it doesn't, then we have done something wrong.
How This Works
Steam Recommender creates tags from 3 sources: Steam reviews, professional reviews, and video analysis. It applies intelligent weights to each tag and adds "unique" tags that separate games from others in their genre. All data is stored in an optimized SQLite database for lightning-fast searches.
The Algorithm
Hierarchical Genre Tree + Vector Similarity:
- 80% descriptive tags - Core gameplay elements (combat, exploration, story)
- 20% unique-in-genre tags - What makes this game special within its category
Three-tier niche carving:
main_genre → sub_genre → sub_sub_genre
Broad Category → Specific Style → Unique Defining Element
Example: Action → Methodical Combat → Interconnected World (Dark Souls)
Example: Strategy → Turn-Based → Deck Building (Slay the Spire)
Example: Action → Platformer → Stamina-Based Combat (Hollow Knight)
Similarity rewards by niche specificity:
- Same sub_sub_genre: 0.4 bonus (shares unique defining trait)
- Same sub_genre: 0.25 bonus (similar gameplay style)
- Same main_genre: 0.15 bonus (broad category match)
Architecture
Clean Separation of Concerns
├── frontend/ # Web interface
│ ├── static/ # CSS, images, JS
│ └── templates/ # HTML templates
├── backend/ # Core engine
│ ├── api/ # Flask routes & endpoints
│ ├── core/ # Game search & similarity engine
│ ├── config/ # Dynamic configuration
│ └── database_builder/ # Data pipeline
├── data/ # Databases & models
└── logs/ # Application logs
Tech Stack
- Python - Unified language for entire pipeline
- Flask - Web framework for recommendation API
- SQLite - Hierarchical game database with vector storage
- OpenAI GPT-3.5 - AI-powered tag generation from reviews
- scikit-learn - TF-IDF vectorization for similarity matching
- Beautiful Soup & Selenium - Web scraping for professional reviews
Getting Started
Prerequisites
- Python 3.8+ with pip
- OpenAI API Account - Required for review analysis
- Chrome/Chromium Browser - Required for IGN scraping (optional)
- 3+ days of runtime - Due to API rate limiting
- $50-100 budget - Estimated OpenAI API costs
Installation
# Clone the repository
git clone https://github.com/yourusername/Steam_Reccomender.git
cd Steam_Reccomender
# Install Python dependencies
pip install -r requirements.txt
# Setup environment variables (choose one method):
# Method 1: Use .env file (Recommended)
cp .env.example .env
# Edit .env and add your OpenAI API key
# Method 2: Export directly (temporary)
export OPENAI_API_KEY="your-openai-api-key-here"
export FLASK_SECRET_KEY="your-secure-random-key"
Building the Database (New Modular System)
The database building process has been completely refactored into a modular, stage-based pipeline with advanced checkpointing, error recovery, and monitoring capabilities.
Quick Start with New Modular System
# Run complete pipeline (NEW - RECOMMENDED)
python database_builder.py
# Run specific stage only (NEW)
python database_builder.py --stage data_collection
python database_builder.py --stage review_analysis
python database_builder.py --stage database_creation
# Check pipeline status (NEW)
python database_builder.py --status
# Reset pipeline if needed (NEW)
python database_builder.py --reset
Stage 1: Data Collection (~2 hours)
python database_builder.py --stage data_collection
Enhanced Features:
- Smart checkpointing: Resume from interruptions automatically
- Progress tracking: Real-time progress indicators
- Batch processing: Configurable batch sizes for optimal performance
- Error recovery: Intelligent retry mechanisms with exponential backoff
Outputs: steamspy_all_games.db, steam_api.db
Cost: FREE (only API rate limits)
Stage 2: Review Analysis (~1-2 days)
python database_builder.py --stage review_analysis
Enhanced Features:
- Cost estimation: Real-time OpenAI API cost projections
- Granular checkpointing: Resume from exact interruption point
- Quality filtering: Advanced spam and toxicity detection
- Professional reviews: Optional IGN review integration
Outputs: Analysis JSON files, hierarchical classification data Cost: $100-300 (OpenAI API usage)
Stage 3: Database Creation (~30 minutes)
python database_builder.py --stage database_creation
Enhanced Features:
- Integrity validation: Comprehensive database validation
- Performance optimization: Automatic index creation
- Statistics reporting: Detailed completion analytics
- Output verification: Automatic file validation
Outputs: steam_recommendations.db, hierarchical_vectorizer.pkl
Cost: FREE (local processing)
Pipeline Status & Monitoring
# Get comprehensive status report
python database_builder.py --status
# Validate configuration and dependencies
python database_builder.py --validate
Legacy Support
The original orchestrator is still available:
# Legacy interface (still functional)
python -m backend.database_builder.pipeline_orchestrator --stage 1
python -m backend.database_builder.pipeline_orchestrator --stage 2
python -m backend.database_builder.pipeline_orchestrator --stage 3
Cost Breakdown
| Component | Estimated Cost | Notes | |-----------|----------------|-------| | SteamSpy API | FREE | Public API, 1 second rate limit | | Steam Store API | FREE | Public API, respects rate limits | | OpenAI GPT-3.5 | $100-300 | 500-1000 games × ~500 tokens per analysis | | IGN Scraping | FREE | Web scraping with delays | | Total Estimated Cost | $100-300 | Mainly OpenAI API usage |
Reducing Costs
- Start small: Modify
DATA_COLLECTION['max_games']inbackend/config/settings.py - Use existing data: Skip Stage 2 if you have analysis JSON files
- OpenAI alternatives: Modify the review analyzer to use local models
- Caching: The pipeline saves checkpoints to resume from interruptions
Running the Application
Once you have the database built (or download pre-built databases):
# Start the Flask web application
python app.py
Visit http://localhost:5000 to use the recommender.
Using Pre-built Databases
If you don't want to spend the time/money building the database:
- Download pre-built databases (if available)
- Place them in the
data/directory:steam_recommendations.db(required)hierarchical_vectorizer.pkl(required)steam_api.db(optional, for images/pricing)
Configuration
All settings are centralized in backend/config/settings.py:
# Customize data collection
DATA_COLLECTION = {
'max_games': 20000, # Reduce for testing
'reviews_per_game': 100, # Reduce to lower OpenAI costs
'batch_size': 1000,
'checkpoint_interval': 100
}
# Adjust rate limits
RATE_LIMITS = {
'openai_max_retries': 3,
'steam_api_delay': 0.5, # Increase if rate limited
}
Current Stats
- 20,000 games in catalog (SteamSpy + Steam Store data)
- 500-1000 games with full AI analysis (Steam reviews, IGN, YouTube)
- 1000-dimensional TF-IDF vectors for similarity
- Sub-second recommendation responses across entire 20k database
- Hierarchical niche carving makes sub_sub_genre matches very valuable at this scale
Data Pipeline Details
Stage 1: Data Collection (1-2 hours)
- SteamSpy API → 20k game catalog
- Steam Store API → metadata, pricing, images
Stage 2: Review Analysis (1-2 days)
- Steam Reviews + OpenAI → intelligent tag generation
- IGN Scraping → professional review scores
- Hierarchical classification → genre taxonomy
Stage 3: Database Creation (30 mins)
- JSON → optimized SQLite schema
- TF-IDF vectorization → binary BLOB storage
- Performance indexing → sub-second queries
Limitations
The data pipeline takes 3+ days due to API rate limiting, so the database is typically 3 months old. This trade-off ensures we can analyze games thoroughly without overwhelming external APIs.
API Rate Limits:
- OpenAI: 3 requests/minute (free tier), 60 requests/minute (paid)
- Steam Store: ~1 request/second (unofficial limit)
- SteamSpy: 1 request/second (official limit)
Todo
- Context-aware review analysis (mention previous games)
- Convert Flask app to FastAPI (hitting performance limits)
- Implement ChromaDB for enhanced semantic similarity
- Humble Bundle affiliate integration
Important Notice
If any reviewing companies want their data removed from this program, please let me know. This is a data science project for educational purposes.
I run minimal ads because I'm a broke college student trying to break even on server costs.
Development
Project St
Related Skills
feishu-drive
347.6k|
things-mac
347.6kManage Things 3 via the `things` CLI on macOS (add/update projects+todos via URL scheme; read/search/list from the local Things database)
clawhub
347.6kUse the ClawHub CLI to search, install, update, and publish agent skills from clawhub.com
codebase-memory-mcp
1.2kHigh-performance code intelligence MCP server. Indexes codebases into a persistent knowledge graph — average repo in milliseconds. 66 languages, sub-ms queries, 99% fewer tokens. Single static binary, zero dependencies.
