THETA
LLM-adaptive embeddings (Zero-shot / LoRA) with Generative Topic Modeling & Agent-based workflow for social science text mining
Install / Use
/learn @CodeSoul-co/THETAREADME
English | 中文
THETA (θ) is a low-barrier, high-performance LLM-enhanced topic analysis platform for social science research.
</div>Table of Contents
- Quick Start: 5-Minute Setup
- Configuration System: From Hardware to Experiments
- Running Modes: Beginner vs Expert
- Output Map: Where Are the Results?
- Scientific Evaluation Standards
- Supported Models
- Training Parameters Reference
- FAQ
Quick Start: 5-Minute Setup
Step 1: Clone Repository
git clone https://github.com/CodeSoul-co/THETA.git
cd THETA
Step 2: Environment Isolation (Conda)
conda create -n theta python=3.10 -y
conda activate theta
Step 3: Install Dependencies + Download Models
bash scripts/env_setup.sh
Step 4: Configure Environment Variables
# Copy configuration template
cp .env.example .env
# Edit .env file to configure model paths
# At minimum, configure: QWEN_MODEL_0_6B and SBERT_MODEL_PATH
Step 5: Load Environment Variables
# If you encounter "$'\r': command not found" error, fix Windows line endings first
sed -i 's/\r$//' scripts/env_setup.sh
# Load environment variables to current shell (required for subsequent scripts)
source scripts/env_setup.sh
Model Download Links:
| Model | Purpose | Download Link | |-------|---------|---------------| | Qwen3-Embedding-0.6B | THETA document embedding | ModelScope | | all-MiniLM-L6-v2 | CTM/SBERT embedding | HuggingFace |
Place downloaded models in the models/ directory:
models/
├── qwen3_embedding_0.6B/
└── sbert/sentence-transformers/all-MiniLM-L6-v2/
Configuration System: From Hardware to Experiments
THETA uses a layered configuration architecture for flexible control from hardware paths to experiment parameters.
Core Configuration File .env (Hardware Physical Paths)
Create a .env file (refer to .env.example) with required settings:
# Required: Qwen embedding model path
QWEN_MODEL_0_6B=./models/qwen3_embedding_0.6B
# QWEN_MODEL_4B=./models/qwen3_embedding_4B
# QWEN_MODEL_8B=./models/qwen3_embedding_8B
# Required: SBERT model path (needed for CTM/BERTopic)
SBERT_MODEL_PATH=./models/sbert/sentence-transformers/all-MiniLM-L6-v2
# Required: Data and result directories
DATA_DIR=./data
WORKSPACE_DIR=./data/workspace
RESULT_DIR=./result
Experiment Parameters config/default.yaml (Default Hyperparameters)
This file stores default training parameters for all models:
# Common training parameters
training:
epochs: 100
batch_size: 64
learning_rate: 0.002
# THETA-specific parameters
theta:
num_topics: 20
hidden_dim: 512
model_size: 0.6B
# Visualization settings
visualization:
language: en # English visualization
dpi: 150
Priority Rule
Parameter priority order:
CLI arguments > YAML defaults > Code fallback values
Example:
--num_topics 50overridesnum_topics: 20in YAML- If neither CLI nor YAML specifies a value, code defaults are used
Running Modes: Beginner vs Expert
Beginner Mode: One-Click Automation (Bash Scripts)
Just prepare your data, and the script will automatically complete cleaning → preprocessing → training → evaluation → visualization:
# One-click training (specify language parameter)
bash scripts/quick_start.sh my_dataset --language chinese
bash scripts/quick_start.sh my_dataset --language english
Data Preparation Requirements:
- Place raw documents in
data/{dataset}/directory - Supported formats:
.txt,.csv,.docx,.pdf
Expert Mode: Surgical-Level Tuning (Python CLI)
Directly call Python scripts with precise control over every parameter:
# Train LDA, override default parameters
python src/models/run_pipeline.py \
--dataset my_dataset \
--models lda \
--num_topics 50 \
--learning_rate 0.01 \
--language chinese
# Train THETA with 4B model
python src/models/run_pipeline.py \
--dataset my_dataset \
--models theta \
--model_size 4B \
--num_topics 30 \
--epochs 200
Key Module Entry Points:
| Module | Entry Script | Function |
|--------|--------------|----------|
| Data Cleaning | src/models/dataclean/main.py | Text cleaning, tokenization, stopword removal |
| Data Preprocessing | src/models/prepare_data.py | Generate BOW matrix and embedding vectors |
| Model Training | src/models/run_pipeline.py | Training, evaluation, visualization all-in-one |
Output Map: Where Are the Results?
THETA model and baseline model result paths are different, please note the distinction:
THETA Model Results
result/{dataset}/{model_size}/theta/exp_{timestamp}/
├── config.json # Experiment configuration
├── metrics.json # 7 evaluation metrics
├── data/ # Preprocessed data
│ ├── bow/ # BOW matrix
│ │ ├── bow_matrix.npy
│ │ ├── vocab.txt
│ │ ├── vocab.json
│ │ └── vocab_embeddings.npy
│ └── embeddings/ # Qwen document embeddings
│ ├── embeddings.npy
│ └── metadata.json
├── theta/ # Model parameters (fixed filenames, no timestamp)
│ ├── theta.npy # Document-topic distribution (D × K)
│ ├── beta.npy # Topic-word distribution (K × V)
│ ├── topic_embeddings.npy # Topic embedding vectors
│ ├── topic_words.json # Topic word list
│ ├── training_history.json # Training history
│ └── etm_model.pt # PyTorch model
└── {lang}/ # Visualization output (zh or en)
├── global/ # Global charts
│ ├── topic_table.csv
│ ├── topic_network.png
│ ├── topic_similarity.png
│ ├── topic_wordcloud.png
│ ├── 7_core_metrics.png
│ └── ...
└── topic/ # Topic details
├── topic_1/
│ └── word_importance.png
└── ...
Baseline Model Results (LDA, CTM, BTM, etc.)
result/{dataset}/{user_id}/{model}/exp_{timestamp}/
├── config.json # Experiment configuration
├── metrics_k{K}.json # 7 evaluation metrics
├── {model}/ # Model parameters
│ ├── theta_k{K}.npy # Document-topic distribution
│ ├── beta_k{K}.npy # Topic-word distribution
│ ├── model_k{K}.pkl # Model file
│ └── topic_words_k{K}.json
├── {lang}/ # Visualization directory (zh or en)
│ ├── global/ # Global comparison charts
│ │ ├── topic_network.png
│ │ ├── topic_similarity.png
│ │ └── ...
│ └── topic/ # Topic details
│ ├── topic_0/
│ │ ├── wordcloud.png
│ │ └── word_distribution.png
│ └── ...
└── README.md # Experiment summary
Path Summary
| Model Type | Result Path |
|------------|-------------|
| THETA | result/{dataset}/{model_size}/theta/exp_{timestamp}/ |
| Baseline Models | result/{dataset}/{user_id}/{model}/exp_{timestamp}/ |
Scientific Evaluation Standards
THETA enforces 7 Gold Standard Metrics to ensure evaluation alignment across all models (THETA and 12 baselines):
| Metric | Full Name | Description | Ideal Value | |--------|-----------|-------------|-------------| | TD | Topic Diversity | Topic diversity, measures uniqueness of topic words | ↑ Higher is better | | iRBO | Inverse Rank-Biased Overlap | Inverse rank-biased overlap, measures inter-topic differences | ↑ Higher is better | | NPMI | Normalized PMI | Normalized pointwise mutual information, measures topic word co-occurrence | ↑ Higher is better | | C_V | C_V Coherence | Sliding window-based coherence | ↑ Higher is better | | UMass | UMass Coherence | Document co-occurrence-based coherence | ↑ Higher is better (negative) | | Exclusivity | Topic Exclusivity | Topic exclusivity, whether words belong to single topics | ↑ Higher is better | | PPL | Perplexity | Perplexity, model fitting ability | ↓ Lower is better |
Note: Significance data is only used for visualization, not included in core evaluation metrics.
Supported Models
Model Overview
| Model | Type | Description | Auto Topics | Best Use Case |
|-------|------|-------------|-------------|---------------|
| theta | Neural | THETA model with Qwen embeddings | No | General purpose, high quality |
| lda | Traditional | Latent Dirichlet Allocation | No | Fast baseline, highly interpretable |
| hdp | Traditional | Hierarchical Dirichlet Process | Yes | Unknown topic count |
| stm | Traditional | Structural Topic Model | No | Requires covariates |
| btm | Traditional | Biterm Topic Model | No | Short texts (tweets, titles) |
| etm | Neural | Embedded Topic Model | No | Word embedding integration |
| ctm | Neural | Contextualized Topic Model | No | Semantic understanding |
| dtm | Neural | Dynamic Topic Mode
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
