THETA

LLM-adaptive embeddings (Zero-shot / LoRA) with Generative Topic Modeling & Agent-based workflow for social science text mining

Generate Convert Improve

Install / Use

/learn @CodeSoul-co/THETA

About this skill

Quality Score

0/100

README

<div align="center"> <img src="assets/THETA.png" width="40%" alt="THETA Logo"/> <h1>THETA (θ)</h1>

English | 中文

THETA (θ) is a low-barrier, high-performance LLM-enhanced topic analysis platform for social science research.

</div>

Quick Start: 5-Minute Setup
Configuration System: From Hardware to Experiments
Running Modes: Beginner vs Expert
Output Map: Where Are the Results?
Scientific Evaluation Standards
Supported Models
Training Parameters Reference
FAQ

Quick Start: 5-Minute Setup

Step 1: Clone Repository

git clone https://github.com/CodeSoul-co/THETA.git
cd THETA

Step 2: Environment Isolation (Conda)

conda create -n theta python=3.10 -y
conda activate theta

Step 3: Install Dependencies + Download Models

bash scripts/env_setup.sh

Step 4: Configure Environment Variables

# Copy configuration template
cp .env.example .env

# Edit .env file to configure model paths
# At minimum, configure: QWEN_MODEL_0_6B and SBERT_MODEL_PATH

Step 5: Load Environment Variables

# If you encounter "$'\r': command not found" error, fix Windows line endings first
sed -i 's/\r$//' scripts/env_setup.sh

# Load environment variables to current shell (required for subsequent scripts)
source scripts/env_setup.sh

Model Download Links:

| Model | Purpose | Download Link | |-------|---------|---------------| | Qwen3-Embedding-0.6B | THETA document embedding | ModelScope | | all-MiniLM-L6-v2 | CTM/SBERT embedding | HuggingFace |

Place downloaded models in the models/ directory:

models/
├── qwen3_embedding_0.6B/
└── sbert/sentence-transformers/all-MiniLM-L6-v2/

Configuration System: From Hardware to Experiments

THETA uses a layered configuration architecture for flexible control from hardware paths to experiment parameters.

Core Configuration File `.env` (Hardware Physical Paths)

Create a .env file (refer to .env.example) with required settings:

# Required: Qwen embedding model path
QWEN_MODEL_0_6B=./models/qwen3_embedding_0.6B
# QWEN_MODEL_4B=./models/qwen3_embedding_4B
# QWEN_MODEL_8B=./models/qwen3_embedding_8B

# Required: SBERT model path (needed for CTM/BERTopic)
SBERT_MODEL_PATH=./models/sbert/sentence-transformers/all-MiniLM-L6-v2

# Required: Data and result directories
DATA_DIR=./data
WORKSPACE_DIR=./data/workspace
RESULT_DIR=./result

Experiment Parameters `config/default.yaml` (Default Hyperparameters)

This file stores default training parameters for all models:

# Common training parameters
training:
  epochs: 100
  batch_size: 64
  learning_rate: 0.002

# THETA-specific parameters
theta:
  num_topics: 20
  hidden_dim: 512
  model_size: 0.6B

# Visualization settings
visualization:
  language: en    # English visualization
  dpi: 150

Priority Rule

Parameter priority order:

CLI arguments  >  YAML defaults  >  Code fallback values

Example:

--num_topics 50 overrides num_topics: 20 in YAML
If neither CLI nor YAML specifies a value, code defaults are used

Running Modes: Beginner vs Expert

Beginner Mode: One-Click Automation (Bash Scripts)

Just prepare your data, and the script will automatically complete cleaning → preprocessing → training → evaluation → visualization:

# One-click training (specify language parameter)
bash scripts/quick_start.sh my_dataset --language chinese
bash scripts/quick_start.sh my_dataset --language english

Data Preparation Requirements:

Place raw documents in data/{dataset}/ directory
Supported formats: .txt, .csv, .docx, .pdf

Expert Mode: Surgical-Level Tuning (Python CLI)

Directly call Python scripts with precise control over every parameter:

# Train LDA, override default parameters
python src/models/run_pipeline.py \
    --dataset my_dataset \
    --models lda \
    --num_topics 50 \
    --learning_rate 0.01 \
    --language chinese

# Train THETA with 4B model
python src/models/run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 4B \
    --num_topics 30 \
    --epochs 200

Key Module Entry Points:

| Module | Entry Script | Function | |--------|--------------|----------| | Data Cleaning | src/models/dataclean/main.py | Text cleaning, tokenization, stopword removal | | Data Preprocessing | src/models/prepare_data.py | Generate BOW matrix and embedding vectors | | Model Training | src/models/run_pipeline.py | Training, evaluation, visualization all-in-one |

Output Map: Where Are the Results?

THETA model and baseline model result paths are different, please note the distinction:

THETA Model Results

result/{dataset}/{model_size}/theta/exp_{timestamp}/
├── config.json                     # Experiment configuration
├── metrics.json                    # 7 evaluation metrics
├── data/                           # Preprocessed data
│   ├── bow/                        # BOW matrix
│   │   ├── bow_matrix.npy
│   │   ├── vocab.txt
│   │   ├── vocab.json
│   │   └── vocab_embeddings.npy
│   └── embeddings/                 # Qwen document embeddings
│       ├── embeddings.npy
│       └── metadata.json
├── theta/                          # Model parameters (fixed filenames, no timestamp)
│   ├── theta.npy                   # Document-topic distribution (D × K)
│   ├── beta.npy                    # Topic-word distribution (K × V)
│   ├── topic_embeddings.npy        # Topic embedding vectors
│   ├── topic_words.json            # Topic word list
│   ├── training_history.json       # Training history
│   └── etm_model.pt                # PyTorch model
└── {lang}/                         # Visualization output (zh or en)
    ├── global/                     # Global charts
    │   ├── topic_table.csv
    │   ├── topic_network.png
    │   ├── topic_similarity.png
    │   ├── topic_wordcloud.png
    │   ├── 7_core_metrics.png
    │   └── ...
    └── topic/                      # Topic details
        ├── topic_1/
        │   └── word_importance.png
        └── ...

Baseline Model Results (LDA, CTM, BTM, etc.)

result/{dataset}/{user_id}/{model}/exp_{timestamp}/
├── config.json                     # Experiment configuration
├── metrics_k{K}.json               # 7 evaluation metrics
├── {model}/                        # Model parameters
│   ├── theta_k{K}.npy              # Document-topic distribution
│   ├── beta_k{K}.npy               # Topic-word distribution
│   ├── model_k{K}.pkl              # Model file
│   └── topic_words_k{K}.json
├── {lang}/                         # Visualization directory (zh or en)
│   ├── global/                     # Global comparison charts
│   │   ├── topic_network.png
│   │   ├── topic_similarity.png
│   │   └── ...
│   └── topic/                      # Topic details
│       ├── topic_0/
│       │   ├── wordcloud.png
│       │   └── word_distribution.png
│       └── ...
└── README.md                       # Experiment summary

Path Summary

| Model Type | Result Path | |------------|-------------| | THETA | result/{dataset}/{model_size}/theta/exp_{timestamp}/ | | Baseline Models | result/{dataset}/{user_id}/{model}/exp_{timestamp}/ |

Scientific Evaluation Standards

THETA enforces 7 Gold Standard Metrics to ensure evaluation alignment across all models (THETA and 12 baselines):

| Metric | Full Name | Description | Ideal Value | |--------|-----------|-------------|-------------| | TD | Topic Diversity | Topic diversity, measures uniqueness of topic words | ↑ Higher is better | | iRBO | Inverse Rank-Biased Overlap | Inverse rank-biased overlap, measures inter-topic differences | ↑ Higher is better | | NPMI | Normalized PMI | Normalized pointwise mutual information, measures topic word co-occurrence | ↑ Higher is better | | C_V | C_V Coherence | Sliding window-based coherence | ↑ Higher is better | | UMass | UMass Coherence | Document co-occurrence-based coherence | ↑ Higher is better (negative) | | Exclusivity | Topic Exclusivity | Topic exclusivity, whether words belong to single topics | ↑ Higher is better | | PPL | Perplexity | Perplexity, model fitting ability | ↓ Lower is better |

Note: Significance data is only used for visualization, not included in core evaluation metrics.

Supported Models

Model Overview

| Model | Type | Description | Auto Topics | Best Use Case | |-------|------|-------------|-------------|---------------| | theta | Neural | THETA model with Qwen embeddings | No | General purpose, high quality | | lda | Traditional | Latent Dirichlet Allocation | No | Fast baseline, highly interpretable | | hdp | Traditional | Hierarchical Dirichlet Process | Yes | Unknown topic count | | stm | Traditional | Structural Topic Model | No | Requires covariates | | btm | Traditional | Biterm Topic Model | No | Short texts (tweets, titles) | | etm | Neural | Embedded Topic Model | No | Word embedding integration | | ctm | Neural | Contextualized Topic Model | No | Semantic understanding | | dtm | Neural | Dynamic Topic Mode

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

CodeSoul-co

View profile

View on GitHub

GitHub Stars14

CategoryDevelopment

Updated4h ago

Forks0

CodeSoul-co/THETA

Languages

Python

Security Score

95/100

Audited on Apr 3, 2026

No findings

THETA

Install / Use

README

Table of Contents

Quick Start: 5-Minute Setup

Step 1: Clone Repository

Step 2: Environment Isolation (Conda)

Step 3: Install Dependencies + Download Models

Step 4: Configure Environment Variables

Step 5: Load Environment Variables

Configuration System: From Hardware to Experiments

Core Configuration File .env (Hardware Physical Paths)

Experiment Parameters config/default.yaml (Default Hyperparameters)

Priority Rule

Running Modes: Beginner vs Expert

Beginner Mode: One-Click Automation (Bash Scripts)

Expert Mode: Surgical-Level Tuning (Python CLI)

Output Map: Where Are the Results?

THETA Model Results

Baseline Model Results (LDA, CTM, BTM, etc.)

Path Summary

Scientific Evaluation Standards

Supported Models

Model Overview

Related Skills

Core Configuration File `.env` (Hardware Physical Paths)

Experiment Parameters `config/default.yaml` (Default Hyperparameters)