SkillAgentSearch skills...

WordSegment

Chinese WordSegment based on algorithms including Maxmatch (forward, backward, bidirectional), HMM,N-gramm(max prob ngram, biward ngam) etc...中文分词算法的实现,包括最大向前匹配、最大向后匹配,最大双向匹配,ngram,HMM,及其性能对比

Install / Use

/learn @liuhuanyong/WordSegment
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

WordSegment

Chinese WordSegment based on algorithms including Maxmatch (forward, backward, bidirectional), HMM etc...

项目介绍

1、MaxMatch:
dict.txt: 分词用词典位置
max_forward_cut:正向最大匹配分词
max_backward_cut:逆向最大匹配分词
max_biward_cut:双向最大匹配分词
result:
输入:我们在野生动物园玩
输出:
forward_cutlist: ['我们', '在野', '生动', '物', '园', '玩']
backward_cutlist: ['我们', '在', '野生', '动物园', '玩']
biward_seglit: ['我们', '在', '野生', '动物园', '玩']

2、HMM:
hmm_train.py:基于人民日报语料29W句子,训练初始状态概率,发射概率,转移概率
data:训练语料,放在 ./data/train.txt
model: 保存训练的概率模型,训练完成后可直接调用
trans_path = './model/prob_trans.model'
emit_path = './model/prob_emit.model'
start_path = './model/prob_start.model'

hmm_cut.py:基于训练得到的model,结合viterbi算法进行分词
输入:我们在野生动物园玩
输出:['我们', '在', '野', '生动', '物园', '玩']

3、N-gram
train_ngram.py:基于人民日报语料29W句子,训练词语出现概率,2-gram条件概率
data: 训练语料,放在 ./data/train.txt
model: 保存概率模型,训练完成后可直接调用
word_path = './model/word_dict.model' (词语出现概率)
trans_path = './model/trans_dict.model'(2-gram条件概率)
max_ngram.py: 最大化概率2-gram分词算法
biward_ngram.py: 基于ngram的前向后向最大匹配算法

4、算法比较
1、评测语料:微软评测语料,共3985个句子
2、性能比较

| Algorithm | Precision | Recall | F1-score | Cost-Time | | --- | :---: | --- | --- | --- | | HMM | 0.65| 0.75| 0.70 | 4.87 | | MaxForward | 0.76 | 0.87 | 0.81 |244.14 | | MaxBackward | 0.76 | 0.87 | 0.81 | 280.61 | | MaxBiWard |0.76 | 0.87 | 0.81 | 443.23| | MaxProbNgram | 0.76 | 0.87 | 0.81 | 8.99| | MaxBiwardNgram | 0.74 | 0.86 | 0.80 | 3.96|

Related Skills

View on GitHub
GitHub Stars103
CategoryDevelopment
Updated2mo ago
Forks45

Languages

Python

Security Score

80/100

Audited on Jan 9, 2026

No findings