<h1 align="center"> KSS: Korean String processing Suite </h1> <p align="center"> <a href="https://github.com/hyunwoongko/kss/releases"><img alt="GitHub release" src="https://img.shields.io/github/release/hyunwoongko/kss.svg"></a> <a href="https://github.com/hyunwoongko/kss/issues"><img alt="Issues" src="https://img.shields.io/github/issues/hyunwoongko/kss"></a> <a href="https://github.com/hyunwoongko/kss/actions"><img alt="Tests on Ubuntu" src="https://github.com/hyunwoongko/kss/actions/workflows/test_ubuntu.yaml/badge.svg"></a> <a href="https://github.com/hyunwoongko/kss/actions"><img alt="Tests on MacOS" src="https://github.com/hyunwoongko/kss/actions/workflows/test_macos.yaml/badge.svg"></a> <a href="https://github.com/hyunwoongko/kss/actions"><img alt="Tests on Windows" src="https://github.com/hyunwoongko/kss/actions/workflows/test_windows.yaml/badge.svg"></a> </p>

Kss is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.

What's New:

April 27, 2024 Released Kss 6.0 Python.
March 31, 2024 Released Kss 5.0 Python.
December 19, 2022 Released Kss 4.0 Python.
May 5, 2022 Released Kss Fluter.
August 25, 2021 Released Kss Java.
August 18, 2021 Released Kss 3.0 Python.
December 21, 2020 Released Kss 2.0 Python.
August 16, 2019 Released Kss 1.0 C++.

Installation

⚠️ Note (2025-08-01):

The automatic installation of mecab and related Python packages has been removed from setup.py to prevent password prompts and privilege escalation during installation.

If you require mecab functionality, please install mecab and its Python bindings manually before installing or using KSS. See below for instructions.

Install Kss

Kss can be easily installed using the pip package manager.

pip install kss

Install Mecab (Optional)

Please install mecab or konlpy.tag.Mecab to use Kss much faster.

mecab (Linux/MacOS): https://github.com/hyunwoongko/python-mecab-kor
mecab (Windows): https://cleancode-ws.tistory.com/97
konlpy.tag.Mecab (Linux/MacOS): https://konlpy.org/en/latest/api/konlpy.tag/#mecab-class
konlpy.tag.Mecab (Windows): https://uwgdqo.tistory.com/363

Usage

1. Basic Usage

All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.

from kss import Kss

module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)

2. Available Modules

If you want to check the available modules, you can use the available() function.

from kss import Kss

Kss.available()

['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']

3. Checking the usage of each module

If you want to check the usage of each module, you can use the help() function.

from kss import Kss

module = Kss("split_sentences")
module.help()

Split texts into sentences.

Args:
    text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
    backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct', 'fast' are supported
    num_workers (Union[int, str]): the number of multiprocessing workers
    strip (bool): strip all sentences or not
    return_morphemes (bool): whether to return morphemes or not
    ignores (List[str]): list of strings to ignore

Returns:
    Union[List[str], List[List[str]]]: outputs of sentence splitting

Examples:
    >>> from kss import Kss
    >>> split_sentences = Kss("split_sentences")
    >>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
    >>> split_sentences(text)
    ['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']

4. Multiprocessing

If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel. And you can set the number of processes to use by setting the num_workers parameter. If you input num_workers < 2, Kss will not use multiprocessing.

from kss import Kss

module = Kss("MODULE_NAME")

# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)

5. Backward Compatibility

The old version of Kss used functional usage. Kss also supports this for backward compatibility.

from kss import split_sentences

output = split_sentences("YOUR_INPUT_STRING", **kwargs)

6. Alias of module names

Because there are so many modules in Kss, user may have difficulty remembering the names of each module. Kss provides aliases for some modules to make it easier to use them.

from kss import Kss

module_1 = Kss("split_morphemes")
module_2 = Kss("tokenize")
# For example, 'split_morphemes' module can be loaded by using the alias named 'tokenize'.

You can check the alias of each module by using the alias() function.

from kss import Kss

Kss.alias()

{'aug': 'augment', 'augmentation': 'augment', 'collocation': 'collocate', 'hangulization': 'hangulize', 'hangulisation': 'hangulize', 'hangulise': 'hangulize', 'hanja': 'hanja2hangul', 'hangul2jamo': 'h2j', 'hangul2hcj': 'h2hcj', 'jamo2hangul': 'j2h', 'jamo2hcj': 'j2hcj', 'hcj2hangul': 'hcj2h', 'hcj2jamo': 'hcj2j', 'josa': 'select_josa', 'keyword': 'extract_keywords', 'keywords': 'extract_keywords', 'morpheme': 'split_morphemes', 'morphemes': 'split_morphemes', 'annonymization': 'anonymize', 'news_cleaning': 'clean_news', 'news': 'clean_news', 'completed_form': 'is_completed_form', 'completed': 'is_completed_form', 'filter': 'filter_out', 'reduce_repeats': 'reduce_char_repeats', 'reduce_char': 'reduce_char_repeats', 'reduce_chars': 'reduce_char_repeats', 'reduce_emoticon': 'reduce_emoticon_repeats', 'reduce_emoticons': 'reduce_emoticon_repeats', 'reduce_emo': 'reduce_emoticon_repeats', 'remove_invisible': 'remove_invisible_chars', 'invisible_chars': 'remove_invisible_chars', 'invisible': 'remove_invisible_chars', 'normalization': 'normalize', 'normalisation': 'normalize', 'normalise': 'normalize', 'preprocessing': 'preprocess', 'prep': 'preprocess', 'romanization': 'romanize', 'romanisation': 'romanize', 'romanise': 'romanize', 'safety': 'is_unsafe', 'check_safety': 'is_unsafe', 'sentence': 'split_sentences', 'sentences': 'split_sentences', 'sent_split': 'split_sentences', 'sent_splits': 'split_sentences', 'sents_split': 'split_sentences', 'split_sent': 'split_sentences', 'split_sents': 'split_sentences', 'spacing': 'correct_spacing', 'space': 'correct_spacing', 'spaces': 'correct_spacing', 'summarization': 'summarize_sentences', 'summarize': 'summarize_sentences', 'summ': 'summarize_sentences', 'morph': 'split_morphemes', 'morphs': 'split_morphemes', 'tokenize': 'split_morphemes', 'tokenization': 'split_morphemes', 'split_morph': 'split_morphemes', 'split_morphs': 'split_morphemes', 'morph_split': 'split_morphemes', 'morph_splits': 'split_morphemes', 'morphs_split': 'split_morphemes'}

Supported Modules

Kss supports the following modules and there are the simple usages of each module in the following sections.

Because there are so many modules, I apologize for not being able to explain each one in detail.

<details> <summary>1. augment</summary>

This augments text with synonym replacement method and, optionally it postprocesses the text by correcting josa. For this, Kss uses the Korean wordnet from KAIST.

Args:

text (Union[str, List[str], Tuple[str]]): single text or list of texts
replacement_ratio (float): ratio of words to be replaced
josa_correction (bool): whether to correct josa or not
num_workers (Union[int, str]): the number of multiprocessing workers
backend (str): morpheme analyzer backend. 'mecab', 'pecab' are supported
verbose (bool): whether to print verbose outputs or not

Returns:

Union[str, List[str]]: augmented text or list of augmented texts

Examples:

>>> from kss import Kss
>>> augment = Kss("augment")
>>> text = "앞서 지난해 11월, 보이저 1호는 명령을 수신하고 수행하는 데엔 문제가 없었지만 통신 장치에 문제가 생겨 과학·엔지니어링 데이터가 지구로 전송되지 않았던 바 있다. 당시 그들은 컴퓨터 시스템을 재시작하고 문제의 근본적인 원인을 파악하기 위해 명령을 보내려고 시도했고, 이달 1일 '포크'라는 명령을 보냈다."
>>> output = augment(text)
>>> print(output)
"앞서 지난해 11월, 보이저 1호는 명령을 수신하고 시행하는 데엔 문제가 없었지만 송신 장비에 문제가 생겨 과학·엔지니어링 데이터가 지구로 전송되지 않았던 바 있다. 당시 그들은 컴퓨터 시스템을 재시작하고 문제의 근본적인 원인을 파악하기 위해 명령을 보내려고 시도했고, 이달 1일 '포크'라는 명령을 보냈다."

References:

This was copied from KoEDA and modified by Kss

</details> <details> <summary>2. collocate</summary>

This returns collocation (연어) of given

Kss

Install / Use

README