╔═══════════════════════════════════════════════════════════════════════════════════════════╗
 ║ ███████╗████████╗███████╗ █████╗  ███╗ ███╗ ██╗     ███████╗███╗   ██╗███████╗ █████╗ ██╗ ║
 ║ ██╔════╝╚══██╔══╝██╔════╝██╔══██╗████╗ ████║██║     ██╔════╝████╗  ██║██╔════╝██╔══██╗██║ ║
 ║ ███████╗   ██║   █████╗  ███████║██╔████╔██║██║     █████╗  ██╔██╗ ██║███████╗███████║██║ ║
 ║ ╚════██║   ██║   ██╔══╝  ██╔══██║██║╚██╔╝██║██║     ██╔══╝  ██║╚██╗██║╚════██║██╔══██║██║ ║
 ║ ███████║   ██║   ███████╗██║  ██║██║ ╚═╝ ██║███████╗███████╗██║ ╚████║███████║██║  ██║██║ ║
 ║ ╚══════╝   ╚═╝   ╚══════╝╚═╝ ╚═╝ ╚═╝    ╚═╝╚══════╝ ╚══════╝╚═╝  ╚═══╝╚══════╝╚═╝  ╚═╝╚═╝ ║
 ╚═══════════════════════════════════════════════════════════════════════════════════════════╝

All data has been uploaded on kaggle: https://www.kaggle.com/datasets/rishikeshgharat/steam-games-data-40-gb

steamLensAI Architecture Documentation

This document explains the technical architecture, distributed processing design, and engineering decisions behind steamLensAI's high-performance Steam review analysis system.

Architecture Overview

steamLensAI is built as a distributed processing pipeline that leverages parallel computing and GPU acceleration to analyze large volumes of Steam reviews efficiently. Topic Assignment using seed-values (Theme based categorization) and sentence-transformers

1.2M reviews in 2 minutes, 30 seconds

Summarization (Heirarchical Topic Based Summarization) based on the now categorized data from the previous step:

1.2M reviews in 8 minutes

The following are the Execution Time Metrics for the above mentioned data: execution_metrics

The core innovation lies in its distributed computing approach where multiple worker processes share a single GPU through intelligent model distribution, achieving maximum hardware utilization while maintaining processing efficiency.

┌──────────────────┐    ┌─────────────────────┐
│  Distributed     │───▶│   Summarization     │
│  Processing      │    │   Pipeline          │
│  (process_files) │    │   (summarization)   │
└──────────────────┘    └─────────────────────┘
         │                         │
         ▼                         ▼
┌─────────────────────────────────────────────┐
│           Dask LocalCluster                 │
│                                             │
│  Multiple Workers → Single GPU Sharing      │
│  • Model Distribution via publish_dataset   │
│  • Coordinated GPU Memory Management        │
│  • Parallel Processing with Shared Models   │
└─────────────────────────────────────────────┘

Core Components

0. Models Used

Sentence Transformer Model

Model: all-MiniLM-L6-v2
Purpose: Converting review text to numerical embeddings for semantic similarity matching
Task: Topic assignment and theme categorization
Provider: Sentence Transformers library

Summarization Model

Model: sshleifer/distilbart-cnn-12-6
Purpose: Generating concise summaries of positive and negative reviews
Task: Hierarchical text summarization with sentiment separation
Provider: Hugging Face Transformers (DistilBART variant)

Both models support GPU acceleration and are distributed across multiple workers using Dask's publish_dataset() mechanism for efficient parallel processing.

1. Distributed Processing Engine (`processing/process_files.py`)

Role: Multi-worker data processing coordination
Key Technologies: Dask + LocalCluster + Sentence Transformers
Distributed Computing Features:
- Creates LocalCluster with multiple worker processes (Lighter than using mini-kubes)
- Distributes the transformer model () across workers using publish_dataset()
- Coordinates parallel processing of review chunks
- Manages shared GPU resources across workers
- Handles worker-to-worker communication and synchronization

2. Topic Assignment Module (`processing/topic_assignment.py`)

Role: ML-powered review categorization on distributed workers
Key Technologies: Sentence Transformers + Semantic Similarity
Worker-Level Responsibilities:
- Retrieves published models from worker dataset storage
- Converts review text to numerical embeddings using shared GPU
- Performs semantic similarity matching against game themes
- Processes data chunks independently across multiple workers

3. Summarization Pipeline (`processing/summarization.py` + `summarize_processor.py`)

Role: Distributed text summarization across workers
Key Technologies: Transformers (DistilBART) + Multi-Worker GPU Processing
Distributed Features:
- Sets up dedicated Dask cluster for summarization tasks
- Distributes summarization models to all workers via dataset publishing
- Coordinates hierarchical summarization across worker processes
- Manages GPU memory sharing for multiple model instances
- Aggregates results from parallel summarization workers

Data Flow Architecture

Phase 1: File Processing & Validation

Uploaded Files → Temporary Storage → App ID Extraction → Theme Validation
     │                │                    │                  │
     ▼                ▼                    ▼                  ▼
 [file1.parquet]  [/tmp/uuid/]      [extract_appid()]   [themes.json]
 [file2.parquet]      ...              [12345, 67890]      [lookup]
 [file3.parquet]                           ...              [✓/✗]

Phase 2: Distributed Processing Architecture

┌─────────────────┐
│   Main Process  │
│                 │
│ 1. Load Files   │──┐
│ 2. Combine Data │  │
│ 3. Create Chunks│  │
│ 4. Setup Dask   │  │
└─────────────────┘  │
                     │
                     ▼
┌───────────────────────────────────────────────────────────────────┐
│                 Dask LocalCluster                                 │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐  │
│  │  Worker 1   │ │  Worker 2   │ │  Worker 3   │ │  Worker 4   │  │
│  │             │ │             │ │             │ │             │  │
│  │ • Get Model │ │ • Get Model │ │ • Get Model │ │ • Get Model │  │
│  │ • Process   │ │ • Process   │ │ • Process   │ │ • Process   │  │
│  │   Chunk A   │ │   Chunk B   │ │   Chunk C   │ │   Chunk D   │  │
│  │ • Save      │ │ • Save      │ │ • Save      │ │ • Save      │  │
│  │   Results   │ │   Results   │ │   Results   │ │   Results   │  │
│  └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘  │
│                                  │                                │
└──────────────────────────────────┼────────────────────────────────┘
                                   │
                                   ▼
                            ┌─────────────┐
                            │  RTX 4080   │
                            │   16GB GPU  │
                            │             │
                            │ • 4 Model   │
                            │   Copies    │
                            │ • Shared    │
                            │   Compute   │
                            └─────────────┘

Phase 3: Result Aggregation

Temporary Files → Aggregation → Final Report → CSV Export
     │               │             │            │
     ▼               ▼             ▼            ▼
┌─────────────┐ ┌─────────────┐ ┌───────────┐ ┌──────────┐
│ /tmp/       │ │ Combine     │ │ Structured│ │ Download │
│ • pos_revs/ │→│ • Count     │→│ DataFrame │→│ CSV File │
│ • neg_revs/ │ │ • Merge     │ │ • Metrics │ │          │
│ • agg_data/ │ │ • Calculate │ │ • Reviews │ │          │
└─────────────┘ └─────────────┘ └───────────┘ └──────────┘

Performance Optimizations

1. GPU Model Distribution Strategy

Challenge: Distribute ML models to multiple workers on single GPU

The Serialization Problem: Imagine you have a recipe book (ML model) and you try to tear out individual pages to give different pages to different chefs (workers). The problem is that recipes are interconnected - Chef A gets page 5 (which says "add the mixture from step 3"), but Chef B has page 3 (which explains what "the mixture" is). Neither chef can cook properly because they only have fragments of the complete recipe.

Technical Details:

Dask.scatter() Fragmentation: Traditional scatter tries to break models into pieces and distribute fragments
Model Interconnectedness: ML model layers reference each other, weights depend on other weights
Incomplete Models Fail: Workers receiving fragmented models cannot perform inference
Models Need Integrity: Each worker requires the complete, intact model to function

# This FAILS - Cannot serialize CUDA tensors
model = SentenceTransformer('model', device='cuda')  # Model on GPU
client.scatter(model)  # ERROR: Cannot serialize CUDA tensors!

Solution: CPU-Serialize-GPU pattern

# Step 1: "Translate" the recipe book to common language (move to CPU)
model.to('cpu')                              # Move model to CPU memory
for param in model.parameters():
    param.data = param.data.cpu()           # Ensure ALL components are on CPU

# Step 2: "Photocopy and distribute" (serialize and send to workers)
client.publish_dataset(model, name='model') # Successfully distribute CPU model

# Step 3: Each kitchen "translates back" (workers move to GPU)
# In each worker:
worker_model = get_dataset('model')          # Get CPU copy
worker_model.to('cuda')                      # Move to shared GPU
# Result: 4 model instances on 1 GPU (efficient memory sharing)

Why This Works:

CPU models are just arrays of numbers (easily serializable)
Each worker gets its own independent copy of the model
Workers can move their copies to GPU without conflicts
Multiple model instances can coexist on the same GPU efficiently

2. Optimal Chunk Sizing

Previously handled inefficienty (pre commit 90) the whole datas

SteamLensAI

Install / Use

README

steamLensAI Architecture Documentation

Architecture Overview

Core Components

0. Models Used

Sentence Transformer Model

Summarization Model

1. Distributed Processing Engine (`processing/process_files.py`)

2. Topic Assignment Module (`processing/topic_assignment.py`)

3. Summarization Pipeline (`processing/summarization.py` + `summarize_processor.py`)

Data Flow Architecture

Phase 1: File Processing & Validation

Phase 2: Distributed Processing Architecture

Phase 3: Result Aggregation

Performance Optimizations

1. GPU Model Distribution Strategy

2. Optimal Chunk Sizing

SteamLensAI

Install / Use

README

steamLensAI Architecture Documentation

Architecture Overview

Core Components

0. Models Used

Sentence Transformer Model

Summarization Model

1. Distributed Processing Engine (processing/process_files.py)

2. Topic Assignment Module (processing/topic_assignment.py)

3. Summarization Pipeline (processing/summarization.py + summarize_processor.py)

Data Flow Architecture

Phase 1: File Processing & Validation

Phase 2: Distributed Processing Architecture

Phase 3: Result Aggregation

Performance Optimizations

1. GPU Model Distribution Strategy

2. Optimal Chunk Sizing

1. Distributed Processing Engine (`processing/process_files.py`)

2. Topic Assignment Module (`processing/topic_assignment.py`)

3. Summarization Pipeline (`processing/summarization.py` + `summarize_processor.py`)