Sudachi
A Japanese Tokenizer for Business
Install / Use
/learn @WorksApplications/SudachiREADME
Sudachi
<p align="center"><a href="https://nlp.worksap.co.jp/"><img width="70" src="./docs/Sudachi.png" alt="Sudachi logo"></a></p> <p align="center"> <a href="https://github.com/WorksApplications/Sudachi/actions/workflows/build.yml"><img src="https://github.com/WorksApplications/Sudachi/actions/workflows/build.yml/badge.svg" alt="Build"></a> <a href="https://sonarcloud.io/dashboard/index/com.worksap.nlp.sudachi"><img src="https://sonarcloud.io/api/project_badges/measure?project=com.worksap.nlp.sudachi&metric=alert_status" alt="Quality Gate"></a> </p>Sudachi is Japanese morphological analyzer. Morphological analysis consists mainly of the following tasks.
- Segmentation
- Part-of-speech tagging
- Normalization
Tutorial
For a tutorial on installation, please refer to the tutorial page.
For a tutorial on the plugin, please refer to the plugin tutorial page.
For information on building Sudachi from source or development see Development page.
Features
Sudachi has the following features.
- Multiple-length segmentation
- You can change the mode of segmentations
- Extract morphemes and named entities at once
- Large lexicon
- Based on UniDic and NEologd
- Plugins
- You can change the behavior of processings
- Work closely with the synonym dictionary
- We will release the sysnonym dictionary at a later date
Dictionaries
Sudachi has three types of dictionaries.
- Small: includes only the vocabulary of UniDic
- Core: includes basic vocabulary (default)
- Full: includes miscellaneous proper nouns
Click here for pre-built dictionaries. For more details, see SudachiDict.
How to use the small / full dictionary
Run the command line tool with the configuration string
$ java -jar sudachi-XX.jar -s '{"systemDict":"system_small.dic"}'
Use on the command line
$ java -jar sudachi-XX.jar [-r conf] [-s json] [-m mode] [-a] [-d] [-f] [-o output] [file...]
Options
-r confspecifies the setting file (overrides -s)-s jsonadditional settings (overrides -r)-p directoryroot directory of resources-m {A|B|C}specifies the mode of splitting-aoutputs the dictionary form, the reading form, the dictionary id, the synonym group id list, and OOV flag.-ddump the debug outputs-o filespecifies output file (default: the standard output)-tseparate words with spaces-tsseparate words with spaces, and break line for each sentence-fignore errors--systemDict filespecify path to the system dictionary. Will override other settings.--userDict fileadd a user dictionary. Will not override other settings, but add another user dictionary.--format classuse the provided class for formatting output instead of default configuration
Examples
$ echo 東京都へ行く | java -jar target/sudachi.jar
東京都 名詞,固有名詞,地名,一般,*,* 東京都
へ 助詞,格助詞,*,*,*,* へ
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く
EOS
$ echo 東京都へ行く | java -jar target/sudachi.jar -a
東京都 名詞,固有名詞,地名,一般,*,* 東京都 東京都 トウキョウト 0 []
へ 助詞,格助詞,*,*,*,* へ へ ヘ 0 []
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く 行く イク 0 []
EOS
$ echo 東京都へ行く | java -jar target/sudachi.jar -m A
東京 名詞,固有名詞,地名,一般,*,* 東京
都 名詞,普通名詞,一般,*,*,* 都
へ 助詞,格助詞,*,*,*,* へ
行く 動詞,非自立可能,*,*,五段-カ行,終止形-一般 行く
EOS
$ echo 東京都へ行く | java -jar target/sudachi.jar -t
東京都 へ 行く
How to use the API
You can find details in the Javadoc.
To compile an application with Sudachi API, declare a dependency on Sudachi in maven project.
<dependency>
<groupId>com.worksap.nlp</groupId>
<artifactId>sudachi</artifactId>
<version>0.5.3</version>
</dependency>
The modes of splitting
Sudachi provides three modes of splitting. In A mode, texts are divided into the shortest units equivalent to the UniDic short unit. In C mode, it extracts named entities. In B mode, into the middle units.
The followings are examples in the core dictionary.
A:選挙/管理/委員/会
B:選挙/管理/委員会
C:選挙管理委員会
A:客室/乗務/員
B:客室/乗務員
C:客室乗務員
A:労働/者/協同/組合
B:労働者/協同/組合
C:労働者協同組合
A:機能/性/食品
B:機能性/食品
C:機能性食品
The followings are examples in the full dictionary.
A:医薬/品/安全/管理/責任/者
B:医薬品/安全/管理/責任者
C:医薬品安全管理責任者
A:消費/者/安全/調査/委員/会
B:消費者/安全/調査/委員会
C:消費者安全調査委員会
A:さっぽろ/テレビ/塔
B:さっぽろ/テレビ塔
C:さっぽろテレビ塔
A:カンヌ/国際/映画/祭
B:カンヌ/国際/映画祭
C:カンヌ国際映画祭
In full-text searching, to use A and B can improve precision and recall.
Plugins
You can use or make plugins which modify the behavior of Sudachi.
| Type of Plugins | Example | |:------------------|:---------------------------------------------| | Modify the Inputs | Character normalization | | Make OOVs | Considering script styles | | Connect Words | Inhibition, Overwrite costs | | Modify the Path | Fix Person names, Equalization of splitting |
Prepared Plugins
We prepared following plugins.
| Type of Plugins | Plugin | | |:------------------|:---------------------------------|:------------------------------------| | Modify the Inputs | character normalization | Full/half-width, Cases, Variants | | | normalization of prolong symbols | Normalize "~", "ー"s | | | Remove yomigana | Remove yomigana in parentheses | | Make OOVs | Make one character OOVs | Use as the fallback | | | MeCab compatible OOVs | | | Connect Words | Inhibition | Specified by part-of-speech | | Modify the Path | Join Katakata OOVs | | | | Join numerics | | | | Equalization of splitting* | Smooth of OOVs and not OOVs | | | Normalize numerics | Normalize Kanji numerics and scales | | | Estimate person names* | |
* will be released at a later date.
Normalized Form
Sudachi normalize the following variations.
- Okurigana
- e.g. 打込む → 打ち込む
- Script
- e.g. かつ丼 → カツ丼
- Variant
- e.g. 附属 → 付属
- Misspelling
- e.g. シュミレーション → シミュレーション
- Contracted form
- e.g. ちゃあ → ては
Character Normalization
DefaultInputTextPlugin normalizes an input text in the following order.
- To lower case by
Character.toLowerCase() - Unicode normalization by NFKC
When rewrite.def has the following descriptions, DefaultInputTextPlugin stops the above processing and aplies the followings.
- Ignore
# single code point: this character is skipped in character normalization
髙
- Replace
# rewrite rule: <target> <replacement>
A' Ā
If the number of characters increases as a result of character normalization, Sudachi may output morphemes whose length is 0 in the original input text.
User Dictionary
To create and use your own dictionaries, please refer to docs/user_dict.md.
Comparison with MeCab and Kuromoji
| | Sudachi | MeCab | kuromoji | |:-----------------------------|:--------|:----------|:-----------| | Multiple Segmentation | Yes | No | Limited ^a | | Normalization | Yes | No | Limited ^b | | Joining, Correction | Yes | No | Limited ^b | | Use multiple user dictionary | Yes | Yes | No | | Saving Memory | Good ^c | Poor | Good | | Accuracy | Good | Good | Good | | Speed | Good | Excellent | Good |
- ^a: approximation with n-best
- ^b: with Lucene filters
- ^c: memory sharing with multiple Java VMs
Future Releases
- Speeding up
- Releasing plugins
- Improving the accuracy
- Adding more split informations
- Adding more normalized forms
- Fix reading forms (pronunciation -> Furigana)
- Coodinating segmentations with the synonym dictionary
Licenses
Sudachi
Sudachi by Works Applications Co., Ltd. is licensed under the Apache License, Version2.0
Copyright (c) 2017 Works Applications Co., Ltd.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Logo

This logo or a modified version may be used by anyone to refer to the morphological analyzer Sudachi, but does not indicate endorsement by Works Applications Co., Ltd.
Copyright (c) 2017 Works Applications Co., Ltd.
Elasticsearch
We release a plug-in for Elasticsearch.
- https://github.com/WorksApplications/elasticsearch-sudachi
Python / Rust
An implementation of Sudachi in Python and Rust
- https://github.com/WorksApplications/sudachi.rs
Slack
We have a Slack workspace for developers and users to ask quest
