SkillAgentSearch skills...

SteamLensAI

Game analytics platform that converts Steam review data into actionable development insights, reducing feedback analysis time by 95% for game studios.

Install / Use

/learn @Matrix030/SteamLensAI

README

 ╔═══════════════════════════════════════════════════════════════════════════════════════════╗
 ║ ███████╗████████╗███████╗ █████╗  ███╗ ███╗ ██╗     ███████╗███╗   ██╗███████╗ █████╗ ██╗ ║
 ║ ██╔════╝╚══██╔══╝██╔════╝██╔══██╗████╗ ████║██║     ██╔════╝████╗  ██║██╔════╝██╔══██╗██║ ║
 ║ ███████╗   ██║   █████╗  ███████║██╔████╔██║██║     █████╗  ██╔██╗ ██║███████╗███████║██║ ║
 ║ ╚════██║   ██║   ██╔══╝  ██╔══██║██║╚██╔╝██║██║     ██╔══╝  ██║╚██╗██║╚════██║██╔══██║██║ ║
 ║ ███████║   ██║   ███████╗██║  ██║██║ ╚═╝ ██║███████╗███████╗██║ ╚████║███████║██║  ██║██║ ║
 ║ ╚══════╝   ╚═╝   ╚══════╝╚═╝ ╚═╝ ╚═╝    ╚═╝╚══════╝ ╚══════╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝╚═╝ ║
 ╚═══════════════════════════════════════════════════════════════════════════════════════════╝

All data has been uploaded on kaggle: https://www.kaggle.com/datasets/rishikeshgharat/steam-games-data-40-gb

steamLensAI Architecture Documentation

This document explains the technical architecture, distributed processing design, and engineering decisions behind steamLensAI's high-performance Steam review analysis system.

Architecture Overview

steamLensAI is built as a distributed processing pipeline that leverages parallel computing and GPU acceleration to analyze large volumes of Steam reviews efficiently. Topic Assignment using seed-values (Theme based categorization) and sentence-transformers

1.2M reviews in 2 minutes, 30 seconds

Summarization (Heirarchical Topic Based Summarization) based on the now categorized data from the previous step:

1.2M reviews in 8 minutes

The following are the Execution Time Metrics for the above mentioned data: execution_metrics

The core innovation lies in its distributed computing approach where multiple worker processes share a single GPU through intelligent model distribution, achieving maximum hardware utilization while maintaining processing efficiency.

┌──────────────────┐    ┌─────────────────────┐
│  Distributed     │───▶│   Summarization     │
│  Processing      │    │   Pipeline          │
│  (process_files) │    │   (summarization)   │
└──────────────────┘    └─────────────────────┘
         │                         │
         ▼                         ▼
┌─────────────────────────────────────────────┐
│           Dask LocalCluster                 │
│                                             │
│  Multiple Workers → Single GPU Sharing      │
│  • Model Distribution via publish_dataset   │
│  • Coordinated GPU Memory Management        │
│  • Parallel Processing with Shared Models   │
└─────────────────────────────────────────────┘

Core Components

0. Models Used

Sentence Transformer Model

  • Model: all-MiniLM-L6-v2
  • Purpose: Converting review text to numerical embeddings for semantic similarity matching
  • Task: Topic assignment and theme categorization
  • Provider: Sentence Transformers library

Summarization Model

  • Model: sshleifer/distilbart-cnn-12-6
  • Purpose: Generating concise summaries of positive and negative reviews
  • Task: Hierarchical text summarization with sentiment separation
  • Provider: Hugging Face Transformers (DistilBART variant)

Both models support GPU acceleration and are distributed across multiple workers using Dask's publish_dataset() mechanism for efficient parallel processing.

1. Distributed Processing Engine (processing/process_files.py)

  • Role: Multi-worker data processing coordination
  • Key Technologies: Dask + LocalCluster + Sentence Transformers
  • Distributed Computing Features:
    • Creates LocalCluster with multiple worker processes (Lighter than using mini-kubes)
    • Distributes the transformer model () across workers using publish_dataset()
    • Coordinates parallel processing of review chunks
    • Manages shared GPU resources across workers
    • Handles worker-to-worker communication and synchronization

2. Topic Assignment Module (processing/topic_assignment.py)

  • Role: ML-powered review categorization on distributed workers
  • Key Technologies: Sentence Transformers + Semantic Similarity
  • Worker-Level Responsibilities:
    • Retrieves published models from worker dataset storage
    • Converts review text to numerical embeddings using shared GPU
    • Performs semantic similarity matching against game themes
    • Processes data chunks independently across multiple workers

3. Summarization Pipeline (processing/summarization.py + summarize_processor.py)

  • Role: Distributed text summarization across workers
  • Key Technologies: Transformers (DistilBART) + Multi-Worker GPU Processing
  • Distributed Features:
    • Sets up dedicated Dask cluster for summarization tasks
    • Distributes summarization models to all workers via dataset publishing
    • Coordinates hierarchical summarization across worker processes
    • Manages GPU memory sharing for multiple model instances
    • Aggregates results from parallel summarization workers

Data Flow Architecture

Phase 1: File Processing & Validation

Uploaded Files → Temporary Storage → App ID Extraction → Theme Validation
     │                │                    │                  │
     ▼                ▼                    ▼                  ▼
 [file1.parquet]  [/tmp/uuid/]      [extract_appid()]   [themes.json]
 [file2.parquet]      ...              [12345, 67890]      [lookup]
 [file3.parquet]                           ...              [✓/✗]

Phase 2: Distributed Processing Architecture

┌─────────────────┐
│   Main Process  │
│                 │
│ 1. Load Files   │──┐
│ 2. Combine Data │  │
│ 3. Create Chunks│  │
│ 4. Setup Dask   │  │
└─────────────────┘  │
                     │
                     ▼
┌───────────────────────────────────────────────────────────────────┐
│                 Dask LocalCluster                                 │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  │
│  │  Worker 1   │ │  Worker 2   │ │  Worker 3   │ │  Worker 4   │  │
│  │             │ │             │ │             │ │             │  │
│  │ • Get Model │ │ • Get Model │ │ • Get Model │ │ • Get Model │  │
│  │ • Process   │ │ • Process   │ │ • Process   │ │ • Process   │  │
│  │   Chunk A   │ │   Chunk B   │ │   Chunk C   │ │   Chunk D   │  │
│  │ • Save      │ │ • Save      │ │ • Save      │ │ • Save      │  │
│  │   Results   │ │   Results   │ │   Results   │ │   Results   │  │
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘  │
│                                  │                                │
└──────────────────────────────────┼────────────────────────────────┘
                                   │
                                   ▼
                            ┌─────────────┐
                            │  RTX 4080   │
                            │   16GB GPU  │
                            │             │
                            │ • 4 Model   │
                            │   Copies    │
                            │ • Shared    │
                            │   Compute   │
                            └─────────────┘

Phase 3: Result Aggregation

Temporary Files → Aggregation → Final Report → CSV Export
     │               │             │            │
     ▼               ▼             ▼            ▼
┌─────────────┐ ┌─────────────┐ ┌───────────┐ ┌──────────┐
│ /tmp/       │ │ Combine     │ │ Structured│ │ Download │
│ • pos_revs/ │→│ • Count     │→│ DataFrame │→│ CSV File │
│ • neg_revs/ │ │ • Merge     │ │ • Metrics │ │          │
│ • agg_data/ │ │ • Calculate │ │ • Reviews │ │          │
└─────────────┘ └─────────────┘ └───────────┘ └──────────┘

Performance Optimizations

1. GPU Model Distribution Strategy

Challenge: Distribute ML models to multiple workers on single GPU

The Serialization Problem: Imagine you have a recipe book (ML model) and you try to tear out individual pages to give different pages to different chefs (workers). The problem is that recipes are interconnected - Chef A gets page 5 (which says "add the mixture from step 3"), but Chef B has page 3 (which explains what "the mixture" is). Neither chef can cook properly because they only have fragments of the complete recipe.

Technical Details:

  • Dask.scatter() Fragmentation: Traditional scatter tries to break models into pieces and distribute fragments
  • Model Interconnectedness: ML model layers reference each other, weights depend on other weights
  • Incomplete Models Fail: Workers receiving fragmented models cannot perform inference
  • Models Need Integrity: Each worker requires the complete, intact model to function
# This FAILS - Cannot serialize CUDA tensors
model = SentenceTransformer('model', device='cuda')  # Model on GPU
client.scatter(model)  # ERROR: Cannot serialize CUDA tensors!

Solution: CPU-Serialize-GPU pattern

# Step 1: "Translate" the recipe book to common language (move to CPU)
model.to('cpu')                              # Move model to CPU memory
for param in model.parameters():
    param.data = param.data.cpu()           # Ensure ALL components are on CPU

# Step 2: "Photocopy and distribute" (serialize and send to workers)
client.publish_dataset(model, name='model') # Successfully distribute CPU model

# Step 3: Each kitchen "translates back" (workers move to GPU)
# In each worker:
worker_model = get_dataset('model')          # Get CPU copy
worker_model.to('cuda')                      # Move to shared GPU
# Result: 4 model instances on 1 GPU (efficient memory sharing)

Why This Works:

  • CPU models are just arrays of numbers (easily serializable)
  • Each worker gets its own independent copy of the model
  • Workers can move their copies to GPU without conflicts
  • Multiple model instances can coexist on the same GPU efficiently

2. Optimal Chunk Sizing

Previously handled inefficienty (pre commit 90) the whole datas

View on GitHub
GitHub Stars5
CategoryDevelopment
Updated1mo ago
Forks0

Languages

Jupyter Notebook

Security Score

90/100

Audited on Feb 24, 2026

No findings