Ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies

Generate Convert Improve

Install / Use

/learn @megagonlabs/Ginza

About this skill

Quality Score

0/100

README

GiNZA logo

GiNZA NLP Library

An Open Source Japanese NLP Library, based on Universal Dependencies

Please read the Important changes before you upgrade GiNZA.

日本語ページはこちら

License

GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models are distributed under the MIT License. You must agree and follow the MIT License to use GiNZA NLP Library and GiNZA Japanese Universal Dependencies Models.

Explosion / spaCy

spaCy is the key framework of GiNZA.

spaCy LICENSE PAGE

Works Applications Enterprise / Sudachi/SudachiPy - SudachiDict - chiVe

SudachiPy provides high accuracies for tokenization and pos tagging.

Sudachi LICENSE PAGE, SudachiPy LICENSE PAGE, SudachiDict LEGAL PAGE, chiVe LICENSE PAGE

Hugging Face / transformers

The GiNZA v5 Transformers model (ja_ginza_electra) is trained by using Hugging Face Transformers as a framework for pretrained models.

transformers LICENSE PAGE

Training Datasets

UD Japanese BCCWJ r2.8

The parsing model of GiNZA v5 is trained on a part of UD Japanese BCCWJ r2.8 (Omura and Asahara:2018). This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

GSK2014-A (2019) BCCWJ edition

The named entity recognition model of GiNZA v5 is trained on a part of GSK2014-A (2019) BCCWJ edition (Hashimoto, Inui, and Murakami:2008). We use two of the named entity label systems, both Sekine's Extended Named Entity Hierarchy and extended OntoNotes5. This model is developed by National Institute for Japanese Language and Linguistics, and Megagon Labs.

mC4

The GiNZA v5 Transformers model (ja_ginza_electra) is trained by using transformers-ud-japanese-electra-base-discriminator which is pretrained on more than 200 million Japanese sentences extracted from mC4.

Contains information from mC4 which is made available under the ODC Attribution License.

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}

Runtime Environment

This project is developed with Python>=3.8 and pip for it. We do not recommend to use Anaconda environment because the pip install step may not work properly.

Please also see the Development Environment section below.

Runtime set up

1. Install GiNZA NLP Library with Transformer-based Model

Uninstall previous version of ginza and ja_ginza_electra packages:

$ pip uninstall ginza ja_ginza_electra

Then, install the latest version of ginza and ja_ginza_electra:

$ pip install -U ginza ja_ginza_electra

The package of ja_ginza_electra does not include pytorch_model.bin due to PyPI's archive size restrictions. This large model file will be automatically downloaded at the first run time, and the locally cached file will be used for subsequent runs.

If you need to install ja_ginza_electra along with pytorch_model.bin at the install time, you can specify direct link for GitHub release archive as follows:

$ pip install -U ginza https://github.com/megagonlabs/ginza/releases/download/latest/ja_ginza_electra-latest-with-model.tar.gz

If you hope to accelarate the transformers-based models by using GPUs with CUDA support, you can install spacy by specifying the CUDA version as follows:

pip install -U "spacy[cuda117]"

And you need to install a version of pytorch that is consistent with the CUDA version.

2. Install GiNZA NLP Library with Standard Model

Uninstall previous version:

$ pip uninstall ginza ja_ginza

Then, install the latest version of ginza and ja_ginza:

$ pip install -U ginza ja_ginza

When using Apple Silicon such as M1 or M2, you can accelerate the analysis process by installing thinc-apple-ops:

$ pip install torch thinc-apple-ops

Execute ginza command

Run ginza command from the console, then input some Japanese text. After pressing enter key, you will get the parsed results with CoNLL-U Syntactic Annotation format.

$ ginza
銀座でランチをご一緒しましょう。
# text = 銀座でランチをご一緒しましょう。
1       銀座    銀座    PROPN   名詞-固有名詞-地名-一般 _       6       nmod    _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ギンザ|NE=B-GPE|ENE=B-City|ClauseHead=6
2       で      で      ADP     助詞-格助詞     _       1       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=デ|ClauseHead=6
3       ランチ  ランチ  NOUN    名詞-普通名詞-一般      _       6       obj     _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=ランチ|ClauseHead=6
4       を      を      ADP     助詞-格助詞     _       3       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=ヲ|ClauseHead=6
5       ご      ご      NOUN    接頭辞  _       6       compound        _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|NP_B|Reading=ゴ|ClauseHead=6
6       一緒    一緒    NOUN    名詞-普通名詞-サ変可能  _       0       root    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|NP_I|Reading=イッショ|ClauseHead=6
7       し      する    AUX     動詞-非自立可能 _       6       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=サ行変格,連用形-一般|Reading=シ|ClauseHead=6
8       ましょう        ます    AUX     助動詞  _       6       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=助動詞-マス,意志推量形|Reading=マショウ|ClauseHead=6
9       。      。      PUNCT   補助記号-句点   _       6       punct   _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。|ClauseHead=6

ginzame command provides tokenization function like MeCab. The output format of ginzame is almost same as mecab, but the last pronunciation field is always '*'.

$ ginzame
銀座でランチをご一緒しましょう。
銀座	名詞,固有名詞,地名,一般,*,*,銀座,ギンザ,*
で	助詞,格助詞,*,*,*,*,で,デ,*
ランチ	名詞,普通名詞,一般,*,*,*,ランチ,ランチ,*
を	助詞,格助詞,*,*,*,*,を,ヲ,*
ご	接頭辞,*,*,*,*,*,御,ゴ,*
一緒	名詞,普通名詞,サ変可能,*,*,*,一緒,イッショ,*
し	動詞,非自立可能,*,*,サ行変格,連用形-一般,為る,シ,*
ましょう	助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,*
。	補助記号,句点,*,*,*,*,。,。,*
EOS

The format of spaCy's JSON is available by specifying -f 3 or -f json for ginza command.

$ ginza -f json
銀座でランチをご一緒しましょう。
[
 {
  "paragraphs": [
   {
    "raw": "銀座でランチをご一緒しましょう。",
    "sentences": [
     {
      "tokens": [
       {"id": 1, "orth": "銀座", "tag": "名詞-固有名詞-地名-一般", "pos": "PROPN", "lemma": "銀座", "head": 5, "dep": "obl", "ner": "B-City"},
       {"id": 2, "orth": "で", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "で", "head": -1, "dep": "case", "ner": "O"},
       {"id": 3, "orth": "ランチ", "tag": "名詞-普通名詞-一般", "pos": "NOUN", "lemma": "ランチ", "head": 3, "dep": "obj", "ner": "O"},
       {"id": 4, "orth": "を", "tag": "助詞-格助詞", "pos": "ADP", "lemma": "を", "head": -1, "dep": "case", "ner": "O"},
       {"id": 5, "orth": "ご", "tag": "接頭辞", "pos": "NOUN", "lemma": "ご", "head": 1, "dep": "compound", "ner": "O"},
       {"id": 6, "orth": "一緒", "tag": "名詞-普通名詞-サ変可能", "pos": "VERB", "lemma": "一緒", "head": 0, "dep": "ROOT", "ner": "O"},
       {"id": 7, "orth": "し", "tag": "動詞-非自立可能", "pos": "AUX", "lemma": "する", "head": -1, "dep": "advcl", "ner": "O"},
       {"id": 8, "orth": "ましょう", "tag": "助動詞", "pos": "AUX", "lemma": "ます", "head": -2, "dep": "aux", "ner": "O"},
       {"id": 9, "orth": "。", "tag": "補助記号-句点", "pos": "PUNCT", "lemma": "。", "head": -3, "dep": "punct", "ner": "O"}
      ]
     }
    ]
   }
  ]
 }
]

If you want to use cabocha -f1 (lattice style) like output, add -f 1 or -f cabocha option to ginza command. This option's format is almost same as cabocha -f1 but the func_index field (after the slash) is slightly different. Our func_index field indicates the boundary where the 自立語 ends in each 文節 (and the 機能語 might start from there). And the functional token filter is also slightly different between cabocha -f1 and ' ginza -f cabocha.

$ ginza -f cabocha
銀座でランチをご一緒しましょう。
* 0 2D 0/1 0.000000
銀座	名詞,固有名詞,地名,一般,,銀座,ギンザ,*	B-City
で	助詞,格助詞,*,*,,で,デ,*	O
* 1 2D 0/1 0.000000
ランチ	名詞,普通名詞,一般,*,,ランチ,ランチ,*	O
を	助詞,格助詞,*,*,,を,ヲ,*	O
* 2 -1D 0/2 0.000000
ご	接頭辞,*,*,*,,ご,ゴ,*	O
一緒	名詞,普通名詞,サ変可能,*,,一緒,イッショ,*	O
し	動詞,非自立可能,*,*,サ行変格,連用形-一般,する,シ,*	O
ましょう	助動詞,*,*,*,助動詞-マス,意志推量形,ます,マショウ,*	O
。	補

Related Skills

node-connect

349.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。