Kss
KSS: Korean String processing Suite
Install / Use
/learn @hyunwoongko/KssREADME
Kss is a Korean string processing suite that provides various functions for processing Korean strings. It is designed to be simple and easy to use, and it is designed to be used in various fields such as natural language processing, data preprocessing, and data analysis.
What's New:
- April 27, 2024 Released Kss 6.0 Python.
- March 31, 2024 Released Kss 5.0 Python.
- December 19, 2022 Released Kss 4.0 Python.
- May 5, 2022 Released Kss Fluter.
- August 25, 2021 Released Kss Java.
- August 18, 2021 Released Kss 3.0 Python.
- December 21, 2020 Released Kss 2.0 Python.
- August 16, 2019 Released Kss 1.0 C++.
Installation
⚠️ Note (2025-08-01):
The automatic installation of mecab and related Python packages has been removed from
setup.pyto prevent password prompts and privilege escalation during installation.If you require mecab functionality, please install mecab and its Python bindings manually before installing or using KSS. See below for instructions.
Install Kss
Kss can be easily installed using the pip package manager.
pip install kss
Install Mecab (Optional)
Please install mecab or konlpy.tag.Mecab to use Kss much faster.
- mecab (Linux/MacOS): https://github.com/hyunwoongko/python-mecab-kor
- mecab (Windows): https://cleancode-ws.tistory.com/97
- konlpy.tag.Mecab (Linux/MacOS): https://konlpy.org/en/latest/api/konlpy.tag/#mecab-class
- konlpy.tag.Mecab (Windows): https://uwgdqo.tistory.com/363
Usage
1. Basic Usage
All functions can be used by creating an instance of the Kss class and calling the instance with the inputs.
from kss import Kss
module = Kss("MODULE_NAME")
output = module("YOUR_INPUT_STRING", **kwargs)
2. Available Modules
If you want to check the available modules, you can use the available() function.
from kss import Kss
Kss.available()
['augment', 'collocate', 'g2p', 'hangulize', 'split_hanja', 'is_hanja', 'hanja2hangul', 'h2j', 'h2hcj', 'j2h', 'j2hcj', 'hcj2h', 'hcj2j', 'is_jamo', 'is_jamo_modern', 'is_hcj', 'is_hcj_modern', 'is_hangul_char', 'select_josa', 'combine_josa', 'extract_keywords', 'split_morphemes', 'paradigm', 'anonymize', 'clean_news', 'is_completed_form', 'get_all_completed_form_hangul_chars', 'get_all_incompleted_form_hangul_chars', 'filter_out', 'half2full', 'reduce_char_repeats', 'reduce_emoticon_repeats', 'remove_invisible_chars', 'normalize', 'preprocess', 'qwerty', 'romanize', 'is_unsafe', 'split_sentences', 'correct_spacing', 'summarize_sentences']
3. Checking the usage of each module
If you want to check the usage of each module, you can use the help() function.
from kss import Kss
module = Kss("split_sentences")
module.help()
Split texts into sentences.
Args:
text (Union[str, List[str], Tuple[str]]): single text or list/tuple of texts
backend (str): morpheme analyzer backend. 'mecab', 'pecab', 'punct', 'fast' are supported
num_workers (Union[int, str]): the number of multiprocessing workers
strip (bool): strip all sentences or not
return_morphemes (bool): whether to return morphemes or not
ignores (List[str]): list of strings to ignore
Returns:
Union[List[str], List[List[str]]]: outputs of sentence splitting
Examples:
>>> from kss import Kss
>>> split_sentences = Kss("split_sentences")
>>> text = "회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요 다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다 강남역 맛집 토끼정의 외부 모습."
>>> split_sentences(text)
['회사 동료 분들과 다녀왔는데 분위기도 좋고 음식도 맛있었어요', '다만, 강남 토끼정이 강남 쉑쉑버거 골목길로 쭉 올라가야 하는데 다들 쉑쉑버거의 유혹에 넘어갈 뻔 했답니다', '강남역 맛집 토끼정의 외부 모습.']
4. Multiprocessing
If you input a list of strings, Kss will automatically use multiprocessing to process the strings in parallel.
And you can set the number of processes to use by setting the num_workers parameter.
If you input num_workers < 2, Kss will not use multiprocessing.
from kss import Kss
module = Kss("MODULE_NAME")
# using all cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], **kwargs)
# using 4 cores
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=4, **kwargs)
# using 1 core (no multiprocessing)
output = module(["YOUR_INPUT_STRING1", "YOUR_INPUT_STRING2", ...], num_workers=1, **kwargs)
5. Backward Compatibility
The old version of Kss used functional usage. Kss also supports this for backward compatibility.
from kss import split_sentences
output = split_sentences("YOUR_INPUT_STRING", **kwargs)
6. Alias of module names
Because there are so many modules in Kss, user may have difficulty remembering the names of each module. Kss provides aliases for some modules to make it easier to use them.
from kss import Kss
module_1 = Kss("split_morphemes")
module_2 = Kss("tokenize")
# For example, 'split_morphemes' module can be loaded by using the alias named 'tokenize'.
You can check the alias of each module by using the alias() function.
from kss import Kss
Kss.alias()
{'aug': 'augment', 'augmentation': 'augment', 'collocation': 'collocate', 'hangulization': 'hangulize', 'hangulisation': 'hangulize', 'hangulise': 'hangulize', 'hanja': 'hanja2hangul', 'hangul2jamo': 'h2j', 'hangul2hcj': 'h2hcj', 'jamo2hangul': 'j2h', 'jamo2hcj': 'j2hcj', 'hcj2hangul': 'hcj2h', 'hcj2jamo': 'hcj2j', 'josa': 'select_josa', 'keyword': 'extract_keywords', 'keywords': 'extract_keywords', 'morpheme': 'split_morphemes', 'morphemes': 'split_morphemes', 'annonymization': 'anonymize', 'news_cleaning': 'clean_news', 'news': 'clean_news', 'completed_form': 'is_completed_form', 'completed': 'is_completed_form', 'filter': 'filter_out', 'reduce_repeats': 'reduce_char_repeats', 'reduce_char': 'reduce_char_repeats', 'reduce_chars': 'reduce_char_repeats', 'reduce_emoticon': 'reduce_emoticon_repeats', 'reduce_emoticons': 'reduce_emoticon_repeats', 'reduce_emo': 'reduce_emoticon_repeats', 'remove_invisible': 'remove_invisible_chars', 'invisible_chars': 'remove_invisible_chars', 'invisible': 'remove_invisible_chars', 'normalization': 'normalize', 'normalisation': 'normalize', 'normalise': 'normalize', 'preprocessing': 'preprocess', 'prep': 'preprocess', 'romanization': 'romanize', 'romanisation': 'romanize', 'romanise': 'romanize', 'safety': 'is_unsafe', 'check_safety': 'is_unsafe', 'sentence': 'split_sentences', 'sentences': 'split_sentences', 'sent_split': 'split_sentences', 'sent_splits': 'split_sentences', 'sents_split': 'split_sentences', 'split_sent': 'split_sentences', 'split_sents': 'split_sentences', 'spacing': 'correct_spacing', 'space': 'correct_spacing', 'spaces': 'correct_spacing', 'summarization': 'summarize_sentences', 'summarize': 'summarize_sentences', 'summ': 'summarize_sentences', 'morph': 'split_morphemes', 'morphs': 'split_morphemes', 'tokenize': 'split_morphemes', 'tokenization': 'split_morphemes', 'split_morph': 'split_morphemes', 'split_morphs': 'split_morphemes', 'morph_split': 'split_morphemes', 'morph_splits': 'split_morphemes', 'morphs_split': 'split_morphemes'}
Supported Modules
Kss supports the following modules and there are the simple usages of each module in the following sections.
Because there are so many modules, I apologize for not being able to explain each one in detail.
<details> <summary>1. augment</summary>This augments text with synonym replacement method and, optionally it postprocesses the text by correcting josa. For this, Kss uses the Korean wordnet from KAIST.
Args:
- text (
Union[str, List[str], Tuple[str]]): single text or list of texts - replacement_ratio (
float): ratio of words to be replaced - josa_correction (
bool): whether to correct josa or not - num_workers (
Union[int, str]): the number of multiprocessing workers - backend (
str): morpheme analyzer backend. 'mecab', 'pecab' are supported - verbose (
bool): whether to print verbose outputs or not
Returns:
Union[str, List[str]]: augmented text or list of augmented texts
Examples:
>>> from kss import Kss
>>> augment = Kss("augment")
>>> text = "앞서 지난해 11월, 보이저 1호는 명령을 수신하고 수행하는 데엔 문제가 없었지만 통신 장치에 문제가 생겨 과학·엔지니어링 데이터가 지구로 전송되지 않았던 바 있다. 당시 그들은 컴퓨터 시스템을 재시작하고 문제의 근본적인 원인을 파악하기 위해 명령을 보내려고 시도했고, 이달 1일 '포크'라는 명령을 보냈다."
>>> output = augment(text)
>>> print(output)
"앞서 지난해 11월, 보이저 1호는 명령을 수신하고 시행하는 데엔 문제가 없었지만 송신 장비에 문제가 생겨 과학·엔지니어링 데이터가 지구로 전송되지 않았던 바 있다. 당시 그들은 컴퓨터 시스템을 재시작하고 문제의 근본적인 원인을 파악하기 위해 명령을 보내려고 시도했고, 이달 1일 '포크'라는 명령을 보냈다."
References:
- This was copied from KoEDA and modified by Kss
This returns collocation (연어) of given
