Kwja

An integrated Japanese analyzer based on foundation models

Generate Convert Improve

Install / Use

/learn @ku-nlp/Kwja

About this skill

Quality Score

0/100

README

KWJA: Kyoto-Waseda Japanese Analyzer[^1]

[^1]: Pronunciation: /kuʒa/

PyPI - Python Version

[Paper (ja)] [Paper (en)] [Slides]

KWJA is an integrated Japanese text analyzer based on foundation models. KWJA performs many text analysis tasks, including:

Typo correction
Sentence segmentation
Word segmentation
Word normalization
Morphological analysis
Word feature tagging
Base phrase feature tagging
NER (Named Entity Recognition)
Dependency parsing
Predicate-argument structure (PAS) analysis
Bridging reference resolution
Coreference resolution
Discourse relation analysis

Requirements

Python: 3.9+
Dependencies: See pyproject.toml.
GPUs with CUDA (optional)
GPUs with MPS (optional)

Getting Started

Install KWJA with pip:

$ pip install kwja

Perform language analysis with the kwja command (the result is in the KNP format):

# Analyze a text
$ kwja --text "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"

# Analyze text files and write the result to a file
$ kwja --filename path/to/file1.txt --filename path/to/file2.txt > path/to/analyzed.knp

# Analyze texts interactively
$ kwja
Please end your input with a new line and type "EOD"
KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。
EOD

If you use Windows and PowerShell, you need to set PYTHONUTF8 environment variable to 1:

> $env:PYTHONUTF8 = "1"
> kwja ...

The output is in the KNP format, which looks like the following:

# S-ID:202210010000-0-0 kwja:1.0.2
* 2D
+ 5D <rel type="=" target="ツール" sid="202210011918-0-0" id="5"/><体言><NE:ARTIFACT:KWJA>
KWJA ＫWＪＡ KWJA 名詞 6 固有名詞 3 * 0 * 0 <基本句-主辞>
は は は 助詞 9 副助詞 2 * 0 * 0 "代表表記:は/は" <代表表記:は/は>
* 2D
+ 2D <体言>
日本 にほん 日本 名詞 6 地名 4 * 0 * 0 "代表表記:日本/にほん 地名:国" <代表表記:日本/にほん><地名:国><基本句-主辞>
+ 4D <体言><係:ノ格>
語 ご 語 名詞 6 普通名詞 1 * 0 * 0 "代表表記:語/ご 漢字読み:音 カテゴリ:抽象物" <代表表記:語/ご><漢字読み:音><カテゴリ:抽象物><基本句-主辞>
の の の 助詞 9 接続助詞 3 * 0 * 0 "代表表記:の/の" <代表表記:の/の>
...

Here are options for kwja command:

--text: Text to be analyzed.
--filename: Path to a text file to be analyzed. You can specify this option multiple times.
--model-size: Model size to be used. Specify one of tiny, base (default), and large.
--device: Device to be used. Specify cpu, cuda, or mps. If not specified, the device is automatically selected.
--typo-batch-size: Batch size for typo module.
--char-batch-size: Batch size for character module.
--seq2seq-batch-size: Batch size for seq2seq module.
--word-batch-size: Batch size for word module.
--tasks: Tasks to be performed. Specify one or more of the following values separated by commas:
- typo: Typo correction
- char: Sentence segmentation, Word segmentation, and Word normalization
- seq2seq: Word segmentation, Word normalization, Reading prediction, lemmatization, and Canonicalization.
- word: Morphological analysis, Named entity recognition, Word feature tagging, Dependency parsing, PAS analysis, Bridging reference resolution, and Coreference resolution

--config-file: Path to a custom configuration file.

You can read a KNP format file with rhoknp.

from rhoknp import Document
with open("analyzed.knp") as f:
    parsed_document = Document.from_knp(f.read())

For more details about KNP format, see Reference.

Usage from Python

Make sure you have kwja command in your path:

$ which kwja
/path/to/kwja

Install rhoknp:

$ pip install rhoknp

Perform language analysis with the kwja instance:

from rhoknp import KWJA
kwja = KWJA()
analyzed_document = kwja.apply(
    "KWJAは日本語の統合解析ツールです。汎用言語モデルを利用し、様々な言語解析を統一的な方法で解いています。"
)

Configuration

kwja can be configured with a configuration file to set the default options. Check Config file content for details.

Config file location

On non-Windows systems kwja follows the XDG Base Directory Specification convention for the location of the configuration file. The configuration dir kwja uses is itself named kwja. In that directory it refers to a file named config.yaml. For most people it should be enough to put their config file at ~/.config/kwja/config.yaml. You can also provide a configuration file in a non-standard location with an environment variable KWJA_CONFIG_FILE or a command line option --config-file.

Config file example

model_size: base
device: cpu
num_workers: 0
torch_compile: false
typo_batch_size: 1
char_batch_size: 1
seq2seq_batch_size: 1
word_batch_size: 1

Performance Table

typo, character, seq2seq, and word modules
- The performance on each task except typo correction and discourse relation analysis is the mean over all the corpora (KC, KWDLC, Fuman, and WAC) and over three runs with different random seeds.
- We set the learning rate of RoBERTa<sub>LARGE</sub> (word) to 2e-5 because we failed to fine-tune it with a higher learning rate. Other hyperparameters are the same described in configs, which are tuned for DeBERTa<sub>BASE</sub>.
seq2seq module
- The performance on each task is the mean over all the corpora (KC, KWDLC, Fuman, and WAC).
  - * denotes results of a single run
- Scores are calculated using a separate script from the character and word modules.

<table> <thead> <tr> <th rowspan="2" colspan="2">Task</th> <th colspan="6">Model</th> </tr> <tr> <th> v1.x base<br> ( <a href="https://huggingface.co/ku-nlp/roberta-base-japanese-char-wwm">char</a>, <a href="https://huggingface.co/nlp-waseda/roberta-base-japanese">word</a> ) </th> <th> v2.x base<br> ( <a href="https://huggingface.co/ku-nlp/deberta-v2-base-japanese-char-wwm">char</a>, <a href="https://huggingface.co/ku-nlp/deberta-v2-base-japanese">word</a> / <a href="https://huggingface.co/retrieva-jp/t5-base-long">seq2seq</a> ) </th> <th> v1.x large<br> ( <a href="https://huggingface.co/ku-nlp/roberta-large-japanese-char-wwm">char</a>, <a href="https://huggingface.co/nlp-waseda/roberta-large-japanese-seq512">word</a> ) </th> <th> v2.x large<br> ( <a href="https://huggingface.co/ku-nlp/deberta-v2-large-japanese-char-wwm">char</a>, <a href="https://huggingface.co/ku-nlp/deberta-v2-large-japanese">word</a> / <a href="https://huggingface.co/retrieva-jp/t5-large-long">seq2seq</a> ) </th> </tr> </thead> <tbody> <tr> <th colspan="2">Typo Correction</th> <td>79.0</td> <td>76.7</td> <td>80.8</td> <td>83.1</td> </tr> <tr> <th colspan="2">Sentence Segmentation</th> <td>-</td> <td>98.4</td> <td>-</td> <td>98.6</td> </tr> <tr> <th colspan="2">Word Segmentation</th> <td>98.5</td> <td>98.1 / 98.2*</td> <td>98.7</td> <td>98.4 / 98.4*</td> </tr> <tr> <th colspan="2">Word Normalization</th> <td>44.0</td> <td>15.3</td> <td>39.8</td> <td>48.6</td> </tr> <tr> <th rowspan="7">Morphological Analysis</th> <th>POS</th> <td>99.3</td> <td>99.4</td> <td>99.3</td> <td>99.4</td> </tr> <tr> <th>sub-POS</th> <td>98.1</td> <td>98.5</td> <td>98.2</td> <td>98.5</td> </tr> <tr> <th>conjtype</th> <td>99.4</td> <td>99.6</td> <td>99.2</td> <td>99.6</td> </tr> <tr> <th>conjform</th> <td>99.5</td> <td>99.7</td> <td>99.4</td> <td>99.7</td> </tr> <tr> <th>reading</th> <td>95.5</td> <td>95.4 / 96.2*</td> <td>90.8</td> <td>95.6 / 96.8*</td> </tr> <tr> <th>lemma</th> <td>-</td> <td>- / 97.8*</td> <td>-</td> <td>- / 98.1*</td> </tr> <tr> <th>canon</th> <td>-</td> <td>- / 95.2*</td> <td>-</td> <td>- / 95.9*</td> </tr> <tr> <th colspan="2">Named Entity Recognition</th> <td>83.0</td> <td>84.6</td> <td>82.1</td> <td>85.9</td> </tr> <tr> <th rowspan="2">Linguistic Feature Tagging</th> <th>word</th> <td>98.3</td> <td>98.6</td> <td>98.5</td> <td>98.6</td> </tr> <tr> <th>base phrase</th> <td>86.6</td> <td>93.6</td> <td>86.4</td> <td>93.4</td> </tr> <tr> <th colspan="2">Dependency Parsing</th> <td>92.9</td> <td>93.5</td> <td>93.8</td> <td>93.6</td>

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。