SkillAgentSearch skills...

Jate

JATE - Just Automatic Term Extraction (in Python)

Install / Use

/learn @ziqizhang/Jate
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="docs/assets/jate-logo.png" alt="JATE — Term Extraction" width="300"> </p>

JATE — Just Automatic Term Extraction

A Python library for automatic term extraction (ATE) from text corpora. JATE provides 14 classical ATE algorithms, corpus-level statistics, built-in evaluation, and a CLI — all pip-installable with no external services required.

JATE v3 is a complete rewrite of the original Java JATE library (84+ GitHub stars), which was built on Apache Solr and used in academic and industry settings for over a decade. The Python version preserves all 13 original classical algorithms from the Java codebase — with every formula verified line-by-line against the original source — while removing the Solr dependency in favour of a self-contained, pip-installable package. It also adds ensemble voting via reciprocal rank fusion when comparing multiple algorithms. The original Java library is preserved on the legacy/java branch.

Sneak Peek

Try it now — no installation needed

Launch the live demo on Hugging Face Spaces — paste any text, pick from 14 algorithms, and see extracted terms instantly in your browser.

<p align="center"> <a href="https://huggingface.co/spaces/ziqizhang2026/jate-demo"> <img src="docs/assets/demo-huggingface.png" alt="JATE online demo on Hugging Face Spaces" width="700"> </a> </p>

Clone the repo for full features locally

The local UI gives you everything the online demo has and more — corpus-level extraction across entire directories, multi-algorithm comparison with a shared NLP pipeline, real-time progress streaming, and full CSV/JSON export. All processing happens on your machine, so there are no size limits and your data stays private.

pip install "jate[server]"
jate ui
<p align="center"> <img src="docs/assets/ui-corpus-results.png" alt="JATE local UI — side-by-side multi-algorithm corpus comparison" width="700"> </p>

Installation

pip install jate

Or from source:

git clone https://github.com/ziqizhang/jate.git
cd jate
pip install .

Requires Python 3.11+ and a spaCy model:

python -m spacy download en_core_web_sm

Quick start

Single document

import jate

# Extract terms from text (default: C-Value + POS pattern extraction)
result = jate.extract("Your document text here...")

for term in result:
    print(f"{term.string:30s}  score={term.score:.4f}  surfaces={term.surface_forms}")

Corpus-level extraction

import jate

# From a list of texts
result = jate.extract_corpus(
    ["First document...", "Second document..."],
    algorithm="tfidf",
)

# From a directory of text files
result = jate.extract_corpus("path/to/corpus/", algorithm="cvalue")

# Export results
df = result.to_dataframe()
print(result.to_csv())

Compare algorithms

import jate

results = jate.compare(
    ["Doc one...", "Doc two..."],
    algorithms=["cvalue", "tfidf", "rake", "weirdness"],
)

for algo_name, result in results.items():
    print(f"\n{algo_name}: {len(result)} terms")
    for term in list(result)[:5]:
        print(f"  {term.string:30s}  {term.score:.4f}")

For large corpora, NLP processing (spaCy) uses multi-threaded C-level batching, and feature building (adjacent word computation) uses multi-process parallelism automatically.

Evaluation against a gold standard

import jate

result = jate.extract_corpus(docs, algorithm="cvalue")

evaluator = jate.Evaluator({"machine learning", "neural network", ...})
eval_result = evaluator.evaluate(result)
print(eval_result.summary())
# P=0.2800  R=0.0644  F1=0.1047  TP=28  FP=72  FN=407  predicted=100  gold=435

# Evaluate top-k
eval_at_50 = evaluator.evaluate_at_k(result, k=50)

CLI

# Extract terms from text
jate extract "Your text here" --algorithm cvalue --top 20

# Extract from a corpus directory
jate corpus path/to/docs/ --algorithm tfidf --output csv

# Compare algorithms on a corpus
jate compare path/to/docs/ --algorithms cvalue tfidf rake

# Run benchmark on built-in dataset (use --list-datasets to see all options)
jate benchmark --dataset acl_rdtec_mini --top 100

REST API (thin server)

JATE now ships a thin JSON API server on top of the core extraction API.

Install server dependencies:

pip install "jate[server]"

Start the server:

jate-api

Or with Python module execution:

python -m uvicorn jate.server:app --host 0.0.0.0 --port 8000

Extract terms over HTTP:

curl --header "Content-Type: application/json" \
    --request POST \
    --data '{"text":"text to process","algorithm":"cvalue"}' \
    http://localhost:8000/jate/api/v1/extract

Health checks:

curl http://localhost:8000/health/live
curl http://localhost:8000/health/ready

Docker / Containerization

Build the image from repo root:

docker build -t jate:latest .

Run modes:

# 1) CLI mode (default)
docker run --rm jate:latest jate extract "local post office" --algorithm cvalue --top 20

# Corpus mode with local volume mount (recommended for local files)
docker run --rm -v "/path/to/local/folder:/data" jate:latest \
    jate corpus /data --algorithm cvalue --top 20

# 2) API mode (explicit)
docker run --rm -d -p 8000:8000 --name jate-api-test jate:latest jate-api

# 3) Interactive mode with local corpus volume
docker run -it --rm -v "$(pwd)/path/to/docs:/data" jate:latest sh
# inside container:
# jate corpus /data --algorithm tfidf --output csv

Test API endpoints (when running API mode):

# Liveness
curl -s http://localhost:8000/health/live

# Readiness (validates spaCy model availability)
curl -s http://localhost:8000/health/ready

# Capabilities
curl -s http://localhost:8000/jate/api/v1/capabilities

# Extract terms
curl -s -X POST http://localhost:8000/jate/api/v1/extract \
    -H "Content-Type: application/json" \
    -d '{"text":"Russia says its consulate in Isfahan, Iran was damaged over the weekend as a result of strikes on the local governor'\''s office.","algorithm":"cvalue","top":6}'

Stop API mode container:

docker stop jate-api-test

Run dual-mode Docker smoke checks (CLI + API) with one build:

bash scripts/docker_smoke_test.sh

Expected extract response shape:

{
    "algorithm": "cvalue",
    "extractor": "pos_pattern",
    "model": "en_core_web_sm",
    "top": 6,
    "terms": [
        {
            "rank": 1,
            "term": "local governors office",
            "score": 1.6323,
            "frequency": 1,
            "surface_forms": ["local governors office"],
            "metadata": {}
        }
    ]
}

Algorithms

| Algorithm | Description | Reference | |-----------|-------------|-----------| | tfidf | TF-IDF at corpus level | — | | cvalue | Multi-word term extraction via nested term frequency | Frantzi et al. 2000 | | ncvalue | C-Value extended with context word information | Frantzi et al. 2000 | | basic | Frequency + containment scoring | Bordea et al. 2013 | | combobasic | Basic with parent and child containment | Bordea et al. 2013 | | attf | Average total term frequency (TTF / DF) | — | | ttf | Raw total term frequency | — | | ridf | Residual IDF (deviation from Poisson) | Church & Gale 1995 | | rake | Rapid Automatic Keyword Extraction | Rose et al. 2010 | | chi_square | Chi-square test for term independence | Matsuo & Ishizuka 2003 | | weirdness | Target vs reference corpus frequency ratio | Ahmad et al. 1999 | | termex | Domain pertinence + context + lexical cohesion | Sclano et al. 2007 | | glossex | Domain specificity via glossary comparison | Park et al. 2002 | | nmf | Topic modelling via Non-negative Matrix Factorisation | — |

Multi-algorithm comparison is available via jate.compare(), which also supports ensemble voting via reciprocal rank fusion (voting=True).

Neural taggers (optional)

JATE also supports transformer-based term taggers that extract terms per-document using BIO sequence labelling. Install with pip install "jate[neural]".

| Tagger | Description | Reference | |--------|-------------|-----------| | xlmr-tagger | XLM-RoBERTa token classifier, multilingual (100 languages) | Lang et al. 2021 | | roberta-tagger | RoBERTa token classifier, English only, faster | — |

from jate.algorithms.bert_tagger import XLMRTagger

tagger = XLMRTagger()  # auto-downloads from HuggingFace on first use
result = tagger.tag("Corruption in public procurement is a major challenge.")

for term in result:
    print(f"{term.string:30s}  confidence={term.score:.4f}")

Pre-trained model: ziqizhang2026/jate-ate-xlmr (trained on ACTER). Train your own: python examples/train_bert_tagger.ipynb on Google Colab.

Try the demo: python examples/tagger_demo.py

Candidate extractors

| Extractor | Description | |-----------|-------------| | pos_pattern (default) | Regex over Universal POS tags (default: (ADJ\|NOUN\|PROPN)*(NOUN\|PROPN), configurable via pattern presets) | | ngram | Contiguous token n-grams (configurable min/max n) | | noun_phrase | spaCy noun chunk detection |

How it works

  1. Candidate extraction — identifies potential terms using POS patterns, n-grams, or noun phrases
  2. Lemmatisation — normalises candidates to their lemmatised form (e.g. "neural networks" and "neural network" become one entry)
  3. Sentence context (automatic) — builds sentence co-occurrence and adjacency features for algorithms that use them (Chi-Square, NC-Value)
  4. Corpus statistics — builds frequency and co-occurrence counts (in-memory or SQLite-backed)
  5. Scoring — applies the chosen algorithm to rank candidates
  6. Output — retu

Related Skills

View on GitHub
GitHub Stars85
CategoryDevelopment
Updated10d ago
Forks29

Languages

Python

Security Score

100/100

Audited on Mar 26, 2026

No findings