SkillAgentSearch skills...

THETA

LLM-adaptive embeddings (Zero-shot / LoRA) with Generative Topic Modeling & Agent-based workflow for social science text mining

Install / Use

/learn @CodeSoul-co/THETA

README

<div align="center"> <img src="assets/THETA.png" width="40%" alt="THETA Logo"/> <h1>THETA (θ)</h1>

Platform HuggingFace Paper

English | 中文

THETA (θ) is a low-barrier, high-performance LLM-enhanced topic analysis platform for social science research.

</div>

Table of Contents

  1. Quick Start: 5-Minute Setup
  2. Configuration System: From Hardware to Experiments
  3. Running Modes: Beginner vs Expert
  4. Output Map: Where Are the Results?
  5. Scientific Evaluation Standards
  6. Supported Models
  7. Training Parameters Reference
  8. FAQ

Quick Start: 5-Minute Setup

Step 1: Clone Repository

git clone https://github.com/CodeSoul-co/THETA.git
cd THETA

Step 2: Environment Isolation (Conda)

conda create -n theta python=3.10 -y
conda activate theta

Step 3: Install Dependencies + Download Models

bash scripts/env_setup.sh

Step 4: Configure Environment Variables

# Copy configuration template
cp .env.example .env

# Edit .env file to configure model paths
# At minimum, configure: QWEN_MODEL_0_6B and SBERT_MODEL_PATH

Step 5: Load Environment Variables

# If you encounter "$'\r': command not found" error, fix Windows line endings first
sed -i 's/\r$//' scripts/env_setup.sh

# Load environment variables to current shell (required for subsequent scripts)
source scripts/env_setup.sh

Model Download Links:

| Model | Purpose | Download Link | |-------|---------|---------------| | Qwen3-Embedding-0.6B | THETA document embedding | ModelScope | | all-MiniLM-L6-v2 | CTM/SBERT embedding | HuggingFace |

Place downloaded models in the models/ directory:

models/
├── qwen3_embedding_0.6B/
└── sbert/sentence-transformers/all-MiniLM-L6-v2/

Configuration System: From Hardware to Experiments

THETA uses a layered configuration architecture for flexible control from hardware paths to experiment parameters.

Core Configuration File .env (Hardware Physical Paths)

Create a .env file (refer to .env.example) with required settings:

# Required: Qwen embedding model path
QWEN_MODEL_0_6B=./models/qwen3_embedding_0.6B
# QWEN_MODEL_4B=./models/qwen3_embedding_4B
# QWEN_MODEL_8B=./models/qwen3_embedding_8B

# Required: SBERT model path (needed for CTM/BERTopic)
SBERT_MODEL_PATH=./models/sbert/sentence-transformers/all-MiniLM-L6-v2

# Required: Data and result directories
DATA_DIR=./data
WORKSPACE_DIR=./data/workspace
RESULT_DIR=./result

Experiment Parameters config/default.yaml (Default Hyperparameters)

This file stores default training parameters for all models:

# Common training parameters
training:
  epochs: 100
  batch_size: 64
  learning_rate: 0.002

# THETA-specific parameters
theta:
  num_topics: 20
  hidden_dim: 512
  model_size: 0.6B

# Visualization settings
visualization:
  language: en    # English visualization
  dpi: 150

Priority Rule

Parameter priority order:

CLI arguments  >  YAML defaults  >  Code fallback values

Example:

  • --num_topics 50 overrides num_topics: 20 in YAML
  • If neither CLI nor YAML specifies a value, code defaults are used

Running Modes: Beginner vs Expert

Beginner Mode: One-Click Automation (Bash Scripts)

Just prepare your data, and the script will automatically complete cleaning → preprocessing → training → evaluation → visualization:

# One-click training (specify language parameter)
bash scripts/quick_start.sh my_dataset --language chinese
bash scripts/quick_start.sh my_dataset --language english

Data Preparation Requirements:

  • Place raw documents in data/{dataset}/ directory
  • Supported formats: .txt, .csv, .docx, .pdf

Expert Mode: Surgical-Level Tuning (Python CLI)

Directly call Python scripts with precise control over every parameter:

# Train LDA, override default parameters
python src/models/run_pipeline.py \
    --dataset my_dataset \
    --models lda \
    --num_topics 50 \
    --learning_rate 0.01 \
    --language chinese

# Train THETA with 4B model
python src/models/run_pipeline.py \
    --dataset my_dataset \
    --models theta \
    --model_size 4B \
    --num_topics 30 \
    --epochs 200

Key Module Entry Points:

| Module | Entry Script | Function | |--------|--------------|----------| | Data Cleaning | src/models/dataclean/main.py | Text cleaning, tokenization, stopword removal | | Data Preprocessing | src/models/prepare_data.py | Generate BOW matrix and embedding vectors | | Model Training | src/models/run_pipeline.py | Training, evaluation, visualization all-in-one |


Output Map: Where Are the Results?

THETA model and baseline model result paths are different, please note the distinction:

THETA Model Results

result/{dataset}/{model_size}/theta/exp_{timestamp}/
├── config.json                     # Experiment configuration
├── metrics.json                    # 7 evaluation metrics
├── data/                           # Preprocessed data
│   ├── bow/                        # BOW matrix
│   │   ├── bow_matrix.npy
│   │   ├── vocab.txt
│   │   ├── vocab.json
│   │   └── vocab_embeddings.npy
│   └── embeddings/                 # Qwen document embeddings
│       ├── embeddings.npy
│       └── metadata.json
├── theta/                          # Model parameters (fixed filenames, no timestamp)
│   ├── theta.npy                   # Document-topic distribution (D × K)
│   ├── beta.npy                    # Topic-word distribution (K × V)
│   ├── topic_embeddings.npy        # Topic embedding vectors
│   ├── topic_words.json            # Topic word list
│   ├── training_history.json       # Training history
│   └── etm_model.pt                # PyTorch model
└── {lang}/                         # Visualization output (zh or en)
    ├── global/                     # Global charts
    │   ├── topic_table.csv
    │   ├── topic_network.png
    │   ├── topic_similarity.png
    │   ├── topic_wordcloud.png
    │   ├── 7_core_metrics.png
    │   └── ...
    └── topic/                      # Topic details
        ├── topic_1/
        │   └── word_importance.png
        └── ...

Baseline Model Results (LDA, CTM, BTM, etc.)

result/{dataset}/{user_id}/{model}/exp_{timestamp}/
├── config.json                     # Experiment configuration
├── metrics_k{K}.json               # 7 evaluation metrics
├── {model}/                        # Model parameters
│   ├── theta_k{K}.npy              # Document-topic distribution
│   ├── beta_k{K}.npy               # Topic-word distribution
│   ├── model_k{K}.pkl              # Model file
│   └── topic_words_k{K}.json
├── {lang}/                         # Visualization directory (zh or en)
│   ├── global/                     # Global comparison charts
│   │   ├── topic_network.png
│   │   ├── topic_similarity.png
│   │   └── ...
│   └── topic/                      # Topic details
│       ├── topic_0/
│       │   ├── wordcloud.png
│       │   └── word_distribution.png
│       └── ...
└── README.md                       # Experiment summary

Path Summary

| Model Type | Result Path | |------------|-------------| | THETA | result/{dataset}/{model_size}/theta/exp_{timestamp}/ | | Baseline Models | result/{dataset}/{user_id}/{model}/exp_{timestamp}/ |


Scientific Evaluation Standards

THETA enforces 7 Gold Standard Metrics to ensure evaluation alignment across all models (THETA and 12 baselines):

| Metric | Full Name | Description | Ideal Value | |--------|-----------|-------------|-------------| | TD | Topic Diversity | Topic diversity, measures uniqueness of topic words | ↑ Higher is better | | iRBO | Inverse Rank-Biased Overlap | Inverse rank-biased overlap, measures inter-topic differences | ↑ Higher is better | | NPMI | Normalized PMI | Normalized pointwise mutual information, measures topic word co-occurrence | ↑ Higher is better | | C_V | C_V Coherence | Sliding window-based coherence | ↑ Higher is better | | UMass | UMass Coherence | Document co-occurrence-based coherence | ↑ Higher is better (negative) | | Exclusivity | Topic Exclusivity | Topic exclusivity, whether words belong to single topics | ↑ Higher is better | | PPL | Perplexity | Perplexity, model fitting ability | ↓ Lower is better |

Note: Significance data is only used for visualization, not included in core evaluation metrics.


Supported Models

Model Overview

| Model | Type | Description | Auto Topics | Best Use Case | |-------|------|-------------|-------------|---------------| | theta | Neural | THETA model with Qwen embeddings | No | General purpose, high quality | | lda | Traditional | Latent Dirichlet Allocation | No | Fast baseline, highly interpretable | | hdp | Traditional | Hierarchical Dirichlet Process | Yes | Unknown topic count | | stm | Traditional | Structural Topic Model | No | Requires covariates | | btm | Traditional | Biterm Topic Model | No | Short texts (tweets, titles) | | etm | Neural | Embedded Topic Model | No | Word embedding integration | | ctm | Neural | Contextualized Topic Model | No | Semantic understanding | | dtm | Neural | Dynamic Topic Mode

Related Skills

View on GitHub
GitHub Stars14
CategoryDevelopment
Updated4h ago
Forks0

Languages

Python

Security Score

95/100

Audited on Apr 3, 2026

No findings