Jate
JATE - Just Automatic Term Extraction (in Python)
Install / Use
/learn @ziqizhang/JateREADME
JATE — Just Automatic Term Extraction
A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 classical ATE algorithms, corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.
JATE v3 is a complete rewrite of the original Java JATE library (84+ GitHub stars), which was built on Apache Solr and used in academic and industry settings for over a decade. The Python version preserves all 13 original classical algorithms from the Java codebase — with every formula verified line-by-line against the original source — while removing the Solr dependency in favour of a self-contained, pip-installable package. It also adds ensemble voting via reciprocal rank fusion when comparing multiple algorithms. The original Java library is preserved on the legacy/java branch.
Sneak Peek
Try it now — no installation needed
Launch the live demo on Hugging Face Spaces — paste any text, pick from 14 algorithms, and see extracted terms instantly in your browser.
<p align="center"> <a href="https://huggingface.co/spaces/ziqizhang2026/jate-demo"> <img src="docs/assets/demo-huggingface.png" alt="JATE online demo on Hugging Face Spaces" width="700"> </a> </p>Clone the repo for full features locally
The local UI gives you everything the online demo has and more — corpus-level extraction across entire directories, multi-algorithm comparison with a shared NLP pipeline, real-time progress streaming, and full CSV/JSON export. All processing happens on your machine, so there are no size limits and your data stays private.
pip install "jate[server]"
jate ui
<p align="center">
<img src="docs/assets/ui-corpus-results.png" alt="JATE local UI — side-by-side multi-algorithm corpus comparison" width="700">
</p>
Installation
pip install jate
Or from source:
git clone https://github.com/ziqizhang/jate.git
cd jate
pip install .
Requires Python 3.11+ and a spaCy model:
python -m spacy download en_core_web_sm
Quick start
Single document
import jate
# Extract terms from text (default: C-Value + POS pattern extraction)
result = jate.extract("Your document text here...")
for term in result:
print(f"{term.string:30s} score={term.score:.4f} surfaces={term.surface_forms}")
Corpus-level extraction
import jate
# From a list of texts
result = jate.extract_corpus(
["First document...", "Second document..."],
algorithm="tfidf",
)
# From a directory of text files
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")
# Export results
df = result.to_dataframe()
print(result.to_csv())
Compare algorithms
import jate
results = jate.compare(
["Doc one...", "Doc two..."],
algorithms=["cvalue", "tfidf", "rake", "weirdness"],
)
for algo_name, result in results.items():
print(f"\n{algo_name}: {len(result)} terms")
for term in list(result)[:5]:
print(f" {term.string:30s} {term.score:.4f}")
For large corpora, NLP processing (spaCy) uses multi-threaded C-level batching, and feature building (adjacent word computation) uses multi-process parallelism automatically.
Evaluation against a gold standard
import jate
result = jate.extract_corpus(docs, algorithm="cvalue")
evaluator = jate.Evaluator({"machine learning", "neural network", ...})
eval_result = evaluator.evaluate(result)
print(eval_result.summary())
# P=0.2800 R=0.0644 F1=0.1047 TP=28 FP=72 FN=407 predicted=100 gold=435
# Evaluate top-k
eval_at_50 = evaluator.evaluate_at_k(result, k=50)
CLI
# Extract terms from text
jate extract "Your text here" --algorithm cvalue --top 20
# Extract from a corpus directory
jate corpus path/to/docs/ --algorithm tfidf --output csv
# Compare algorithms on a corpus
jate compare path/to/docs/ --algorithms cvalue tfidf rake
# Run benchmark on built-in dataset (use --list-datasets to see all options)
jate benchmark --dataset acl_rdtec_mini --top 100
REST API (thin server)
JATE now ships a thin JSON API server on top of the core extraction API.
Install server dependencies:
pip install "jate[server]"
Start the server:
jate-api
Or with Python module execution:
python -m uvicorn jate.server:app --host 0.0.0.0 --port 8000
Extract terms over HTTP:
curl --header "Content-Type: application/json" \
--request POST \
--data '{"text":"text to process","algorithm":"cvalue"}' \
http://localhost:8000/jate/api/v1/extract
Health checks:
curl http://localhost:8000/health/live
curl http://localhost:8000/health/ready
Docker / Containerization
Build the image from repo root:
docker build -t jate:latest .
Run modes:
# 1) CLI mode (default)
docker run --rm jate:latest jate extract "local post office" --algorithm cvalue --top 20
# Corpus mode with local volume mount (recommended for local files)
docker run --rm -v "/path/to/local/folder:/data" jate:latest \
jate corpus /data --algorithm cvalue --top 20
# 2) API mode (explicit)
docker run --rm -d -p 8000:8000 --name jate-api-test jate:latest jate-api
# 3) Interactive mode with local corpus volume
docker run -it --rm -v "$(pwd)/path/to/docs:/data" jate:latest sh
# inside container:
# jate corpus /data --algorithm tfidf --output csv
Test API endpoints (when running API mode):
# Liveness
curl -s http://localhost:8000/health/live
# Readiness (validates spaCy model availability)
curl -s http://localhost:8000/health/ready
# Capabilities
curl -s http://localhost:8000/jate/api/v1/capabilities
# Extract terms
curl -s -X POST http://localhost:8000/jate/api/v1/extract \
-H "Content-Type: application/json" \
-d '{"text":"Russia says its consulate in Isfahan, Iran was damaged over the weekend as a result of strikes on the local governor'\''s office.","algorithm":"cvalue","top":6}'
Stop API mode container:
docker stop jate-api-test
Run dual-mode Docker smoke checks (CLI + API) with one build:
bash scripts/docker_smoke_test.sh
Expected extract response shape:
{
"algorithm": "cvalue",
"extractor": "pos_pattern",
"model": "en_core_web_sm",
"top": 6,
"terms": [
{
"rank": 1,
"term": "local governors office",
"score": 1.6323,
"frequency": 1,
"surface_forms": ["local governors office"],
"metadata": {}
}
]
}
Algorithms
| Algorithm | Description | Reference |
|-----------|-------------|-----------|
| tfidf | TF-IDF at corpus level | — |
| cvalue | Multi-word term extraction via nested term frequency | Frantzi et al. 2000 |
| ncvalue | C-Value extended with context word information | Frantzi et al. 2000 |
| basic | Frequency + containment scoring | Bordea et al. 2013 |
| combobasic | Basic with parent and child containment | Bordea et al. 2013 |
| attf | Average total term frequency (TTF / DF) | — |
| ttf | Raw total term frequency | — |
| ridf | Residual IDF (deviation from Poisson) | Church & Gale 1995 |
| rake | Rapid Automatic Keyword Extraction | Rose et al. 2010 |
| chi_square | Chi-square test for term independence | Matsuo & Ishizuka 2003 |
| weirdness | Target vs reference corpus frequency ratio | Ahmad et al. 1999 |
| termex | Domain pertinence + context + lexical cohesion | Sclano et al. 2007 |
| glossex | Domain specificity via glossary comparison | Park et al. 2002 |
| nmf | Topic modelling via Non-negative Matrix Factorisation | — |
Multi-algorithm comparison is available via jate.compare(), which also supports ensemble voting via reciprocal rank fusion (voting=True).
Neural taggers (optional)
JATE also supports transformer-based term taggers that extract terms per-document using BIO sequence labelling. Install with pip install "jate[neural]".
| Tagger | Description | Reference |
|--------|-------------|-----------|
| xlmr-tagger | XLM-RoBERTa token classifier, multilingual (100 languages) | Lang et al. 2021 |
| roberta-tagger | RoBERTa token classifier, English only, faster | — |
from jate.algorithms.bert_tagger import XLMRTagger
tagger = XLMRTagger() # auto-downloads from HuggingFace on first use
result = tagger.tag("Corruption in public procurement is a major challenge.")
for term in result:
print(f"{term.string:30s} confidence={term.score:.4f}")
Pre-trained model: ziqizhang2026/jate-ate-xlmr (trained on ACTER). Train your own: python examples/train_bert_tagger.ipynb on Google Colab.
Try the demo: python examples/tagger_demo.py
Candidate extractors
| Extractor | Description |
|-----------|-------------|
| pos_pattern (default) | Regex over Universal POS tags (default: (ADJ\|NOUN\|PROPN)*(NOUN\|PROPN), configurable via pattern presets) |
| ngram | Contiguous token n-grams (configurable min/max n) |
| noun_phrase | spaCy noun chunk detection |
How it works
- Candidate extraction — identifies potential terms using POS patterns, n-grams, or noun phrases
- Lemmatisation — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
- Sentence context (automatic) — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
- Corpus statistics — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
- Scoring — applies the chosen algorithm to rank candidates
- Output — retu
Related Skills
node-connect
349.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.0kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
