Sudachi

日本語 README

Sudachi is Japanese morphological analyzer. Morphological analysis consists mainly of the following tasks.

Segmentation
Part-of-speech tagging
Normalization

Tutorial

For a tutorial on installation, please refer to the tutorial page.

For a tutorial on the plugin, please refer to the plugin tutorial page.

For information on building Sudachi from source or development see Development page.

Features

Sudachi has the following features.

Multiple-length segmentation
- You can change the mode of segmentations
- Extract morphemes and named entities at once
Large lexicon
- Based on UniDic and NEologd
Plugins
- You can change the behavior of processings
Work closely with the synonym dictionary
- We will release the sysnonym dictionary at a later date

Dictionaries

Sudachi has three types of dictionaries.

Small: includes only the vocabulary of UniDic
Core: includes basic vocabulary (default)
Full: includes miscellaneous proper nouns

Click here for pre-built dictionaries. For more details, see SudachiDict.

How to use the small / full dictionary

Run the command line tool with the configuration string

$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'

Use on the command line

$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]

Options

-r conf specifies the setting file (overrides -s)
-s json additional settings (overrides -r)
-p directory root directory of resources
-m {A|B|C} specifies the mode of splitting
-a outputs the dictionary form, the reading form, the dictionary id, the synonym group id list, and OOV flag.
-d dump the debug outputs
-o file specifies output file (default: the standard output)
-t separate words with spaces
-ts separate words with spaces, and break line for each sentence
-f ignore errors
--systemDict file specify path to the system dictionary. Will override other settings.
--userDict file add a user dictionary. Will not override other settings, but add another user dictionary.
--format class use the provided class for formatting output instead of default configuration

Examples

$ echo 東京都へ行く | java -jar target/sudachi.jar
東京都  名詞,固有名詞,地名,一般,*,*     東京都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -a
東京都  名詞,固有名詞,地名,一般,*,*     東京都  東京都  トウキョウト    0       []
へ      助詞,格助詞,*,*,*,*     へ      へ      ヘ      0       []
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く    行く    イク    0       []
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -m A
東京    名詞,固有名詞,地名,一般,*,*     東京
都      名詞,普通名詞,一般,*,*,*        都
へ      助詞,格助詞,*,*,*,*     へ
行く    動詞,非自立可能,*,*,五段-カ行,終止形-一般       行く
EOS

$ echo 東京都へ行く | java -jar target/sudachi.jar -t
東京都 へ 行く

How to use the API

You can find details in the Javadoc.

To compile an application with Sudachi API, declare a dependency on Sudachi in maven project.

<dependency>
  <groupId>com.worksap.nlp</groupId>
  <artifactId>sudachi</artifactId>
  <version>0.5.3</version>
</dependency>

The modes of splitting

Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.

The followings are examples in the core dictionary.

A：選挙/管理/委員/会
B：選挙/管理/委員会
C：選挙管理委員会

A：客室/乗務/員
B：客室/乗務員
C：客室乗務員

A：労働/者/協同/組合
B：労働者/協同/組合
C：労働者協同組合

A：機能/性/食品
B：機能性/食品
C：機能性食品

The followings are examples in the full dictionary.

A：医薬/品/安全/管理/責任/者
B：医薬品/安全/管理/責任者
C：医薬品安全管理責任者

A：消費/者/安全/調査/委員/会
B：消費者/安全/調査/委員会
C：消費者安全調査委員会

A：さっぽろ/テレビ/塔
B：さっぽろ/テレビ塔
C：さっぽろテレビ塔

A：カンヌ/国際/映画/祭
B：カンヌ/国際/映画祭
C：カンヌ国際映画祭

In full-text searching, to use A and B can improve precision and recall.

Plugins

You can use or make plugins which modify the behavior of Sudachi.

| Type of Plugins | Example | |:------------------|:---------------------------------------------| | Modify the Inputs | Character normalization | | Make OOVs | Considering script styles | | Connect Words | Inhibition, Overwrite costs | | Modify the Path | Fix Person names, Equalization of splitting |

Prepared Plugins

We prepared following plugins.

| Type of Plugins | Plugin | | |:------------------|:---------------------------------|:------------------------------------| | Modify the Inputs | character normalization | Full/half-width, Cases, Variants | | | normalization of prolong symbols | Normalize "~", "ー"s | | | Remove yomigana | Remove yomigana in parentheses | | Make OOVs | Make one character OOVs | Use as the fallback | | | MeCab compatible OOVs | | | Connect Words | Inhibition | Specified by part-of-speech | | Modify the Path | Join Katakata OOVs | | | | Join numerics | | | | Equalization of splitting* | Smooth of OOVs and not OOVs | | | Normalize numerics | Normalize Kanji numerics and scales | | | Estimate person names* | |

* will be released at a later date.

Normalized Form

Sudachi normalize the following variations.

Okurigana
- e.g. 打込む → 打ち込む
Script
- e.g. かつ丼 → カツ丼
Variant
- e.g. 附属 → 付属
Misspelling
- e.g. シュミレーション → シミュレーション
Contracted form
- e.g. ちゃあ → ては

Character Normalization

DefaultInputTextPlugin normalizes an input text in the following order.

To lower case by Character.toLowerCase()
Unicode normalization by NFKC

When rewrite.def has the following descriptions, DefaultInputTextPlugin stops the above processing and aplies the followings.

Ignore

# single code point: this character is skipped in character normalization
髙

Replace

# rewrite rule: <target> <replacement>
A' Ā

If the number of characters increases as a result of character normalization, Sudachi may output morphemes whose length is 0 in the original input text.

User Dictionary

To create and use your own dictionaries, please refer to docs/user_dict.md.

Comparison with MeCab and Kuromoji

| | Sudachi | MeCab | kuromoji | |:-----------------------------|:--------|:----------|:-----------| | Multiple Segmentation | Yes | No | Limited ^a | | Normalization | Yes | No | Limited ^b | | Joining, Correction | Yes | No | Limited ^b | | Use multiple user dictionary | Yes | Yes | No | | Saving Memory | Good ^c | Poor | Good | | Accuracy | Good | Good | Good | | Speed | Good | Excellent | Good |

^a: approximation with n-best
^b: with Lucene filters
^c: memory sharing with multiple Java VMs

Future Releases

Speeding up
Releasing plugins
Improving the accuracy
Adding more split informations
Adding more normalized forms
Fix reading forms (pronunciation -> Furigana)
Coodinating segmentations with the synonym dictionary

Licenses

Sudachi

Sudachi by Works Applications Co., Ltd. is licensed under the Apache License, Version2.0

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Logo

Sudachi Logo

This logo or a modified version may be used by anyone to refer to the morphological analyzer Sudachi, but does not indicate endorsement by Works Applications Co., Ltd.

Elasticsearch

We release a plug-in for Elasticsearch.

https://github.com/WorksApplications/elasticsearch-sudachi

Python / Rust

An implementation of Sudachi in Python and Rust

https://github.com/WorksApplications/sudachi.rs

Slack

We have a Slack workspace for developers and users to ask quest

Sudachi

Install / Use

README

Sudachi

Tutorial

Features

Dictionaries

How to use the small / full dictionary

Use on the command line

Options

Examples

How to use the API

The modes of splitting

Plugins

Prepared Plugins

Normalized Form

Character Normalization

User Dictionary

Comparison with MeCab and Kuromoji

Future Releases

Licenses

Sudachi

Logo

Elasticsearch

Python / Rust

Slack