Pytextclassifier
pytextclassifier is a toolkit for text classification. 文本分类,LR,Xgboost,TextCNN,FastText,TextRNN,BERT等分类模型实现,开箱即用。
Install / Use
/learn @shibing624/PytextclassifierREADME
PyTextClassifier: Python Text Classifier
Introduction
PyTextClassifier: Python Text Classifier. It can be applied to the fields of sentiment polarity analysis, text risk classification and so on, and it supports multiple classification algorithms and clustering algorithms.
pytextclassifier is a python Open Source Toolkit for text classification. The goal is to implement text analysis algorithm, so to achieve the use in the production environment.
文本分类器,提供多种文本分类和聚类算法,支持句子和文档级的文本分类任务,支持二分类、多分类、多标签分类、多层级分类和Kmeans聚类,开箱即用。python3开发。
Guide
Feature
pytextclassifier has the characteristics of clear algorithm, high performance and customizable corpus.
Functions:
Classifier
- [x] LogisticRegression
- [x] Random Forest
- [x] Decision Tree
- [x] K-Nearest Neighbours
- [x] Naive bayes
- [x] Xgboost
- [x] Support Vector Machine(SVM)
- [x] TextCNN
- [x] TextRNN
- [x] Fasttext
- [x] BERT
Cluster
- [x] MiniBatchKmeans
While providing rich functions, pytextclassifier internal modules adhere to low coupling, model adherence to inert loading, dictionary publication, and easy to use.
Install
- Requirements and Installation
pip3 install torch # conda install pytorch
pip3 install pytextclassifier
or
git clone https://github.com/shibing624/pytextclassifier.git
cd pytextclassifier
python3 setup.py install
Usage
Text Classifier
English Text Classifier
Including model training, saving, predict, evaluate, for example examples/lr_en_classification_demo.py:
import sys
sys.path.append('..')
from pytextclassifier import ClassicClassifier
if __name__ == '__main__':
m = ClassicClassifier(output_dir='models/lr', model_name_or_model='lr')
# ClassicClassifier support model_name:lr, random_forest, decision_tree, knn, bayes, svm, xgboost
print(m)
data = [
('education', 'Student debt to cost Britain billions within decades'),
('education', 'Chinese education for TV experiment'),
('sports', 'Middle East and Asia boost investment in top level sports'),
('sports', 'Summit Series look launches HBO Canada sports doc series: Mudhar')
]
# train and save best model
m.train(data)
# load best model from model_dir
m.load_model()
predict_label, predict_proba = m.predict([
'Abbott government spends $8 million on higher education media blitz'])
print(f'predict_label: {predict_label}, predict_proba: {predict_proba}')
test_data = [
('education', 'Abbott government spends $8 million on higher education media blitz'),
('sports', 'Middle East and Asia boost investment in top level sports'),
]
acc_score = m.evaluate_model(test_data)
print(f'acc_score: {acc_score}')
output:
ClassicClassifier instance (LogisticRegression(fit_intercept=False), stopwords size: 2438)
predict_label: ['education'], predict_proba: [0.5378236358492112]
acc_score: 1.0
Chinese Text Classifier(中文文本分类)
Text classification compatible with Chinese and English corpora.
example examples/lr_classification_demo.py
import sys
sys.path.append('..')
from pytextclassifier import ClassicClassifier
if __name__ == '__main__':
m = ClassicClassifier(output_dir='models/lr-toy', model_name_or_model='lr')
# 经典分类方法,支持的模型包括:lr, random_forest, decision_tree, knn, bayes, svm, xgboost
data = [
('education', '名师指导托福语法技巧:名词的复数形式'),
('education', '中国高考成绩海外认可 是“狼来了”吗?'),
('education', '公务员考虑越来越吃香,这是怎么回事?'),
('sports', '图文:法网孟菲尔斯苦战进16强 孟菲尔斯怒吼'),
('sports', '四川丹棱举行全国长距登山挑战赛 近万人参与'),
('sports', '米兰客场8战不败国米10年连胜'),
]
m.train(data)
print(m)
# load best model from model_dir
m.load_model()
predict_label, predict_proba = m.predict(['福建春季公务员考试报名18日截止 2月6日考试',
'意甲首轮补赛交战记录:米兰客场8战不败国米10年连胜'])
print(f'predict_label: {predict_label}, predict_proba: {predict_proba}')
test_data = [
('education', '福建春季公务员考试报名18日截止 2月6日考试'),
('sports', '意甲首轮补赛交战记录:米兰客场8战不败国米10年连胜'),
]
acc_score = m.evaluate_model(test_data)
print(f'acc_score: {acc_score}') # 1.0
#### train model with 1w data
print('-' * 42)
m = ClassicClassifier(output_dir='models/lr', model_name_or_model='lr')
data_file = 'thucnews_train_1w.txt'
m.train(data_file)
m.load_model()
predict_label, predict_proba = m.predict(
['顺义北京苏活88平米起精装房在售',
'美EB-5项目“15日快速移民”将推迟'])
print(f'predict_label: {predict_label}, predict_proba: {predict_proba}')
output:
ClassicClassifier instance (LogisticRegression(fit_intercept=False), stopwords size: 2438)
predict_label: ['education' 'sports'], predict_proba: [0.5, 0.598941806741534]
acc_score: 1.0
------------------------------------------
predict_label: ['realty' 'education'], predict_proba: [0.7302956923617372, 0.2565005445322923]
Visual Feature Importance
Show feature weights of model, and prediction word weight, for example examples/visual_feature_importance.ipynb
import sys
sys.path.append('..')
from pytextclassifier import ClassicClassifier
import jieba
tc = ClassicClassifier(output_dir='models/lr-toy', model_name_or_model='lr')
data = [
('education', '名师指导托福语法技巧:名词的复数形式'),
('education', '中国高考成绩海外认可 是“狼来了”吗?'),
('sports', '图文:法网孟菲尔斯苦战进16强 孟菲尔斯怒吼'),
('sports', '四川丹棱举行全国长距登山挑战赛 近万人参与'),
('sports', '米兰客场8战不败国米10年连胜')
]
tc.train(data)
import eli5
infer_data = ['高考指导托福语法技巧国际认可',
'意甲首轮补赛交战记录:米兰客场8战不败国米10年连胜']
eli5.show_weights(tc.model, vec=tc.feature)
seg_infer_data = [' '.join(jieba.lcut(i)) for i in infer_data]
eli5.show_prediction(tc.model, seg_infer_data[0], vec=tc.feature,
target_names=['education', 'sports'])
output:

Deep Classification model
本项目支持以下深度分类模型:FastText、TextCNN、TextRNN、Bert模型,import模型对应的方法来调用:
from pytextclassifier import FastTextClassifier, TextCNNClassifier, TextRNNClassifier, BertClassifier
下面以FastText模型为示例,其他模型的使用方法类似。
FastText 模型
训练和预测FastText模型示例examples/fasttext_classification_demo.py
import sys
sys.path.append('..')
from pytextclassifier import FastTextClassifier, load_data
if __name__ == '__main__':
m = FastTextClassifier(output_dir='models/fasttext-toy')
data = [
('education', '名师指导托福语法技巧:名词的复数形式'),
('education', '中国高考成绩海外认可 是“狼来了”吗?'),
('education', '公务员考虑越来越吃香,这是怎么回事?'),
('sports', '图文:法网孟菲尔斯苦战进16强 孟菲尔斯怒吼'),
('sports', '四川丹棱举行全国长距登山挑战赛 近万人参与'),
('sports', '米兰客场8战不败保持连胜'),
]
m.train(data, num_epochs=3)
print(m)
# load trained best model
m.load_model()
predict_label, predict_proba = m.predict(['福建春季公务员考试报名18日截止 2月6日考试',
'意甲首轮补赛交战记录:米兰客场8战不败国米10年连胜'])
print(f'predict_label: {predict_label}, predict_proba: {predict_proba}')
test_data = [
('education', '福建春季公务员考试报名18日截止 2月6日考试'),
('sports', '意甲首轮补赛交战记录:米兰客场8战不败国米10年连胜'),
]
acc_score = m.evaluate_model(test_data)
print(f'acc_score: {acc_score}') # 1.0
#### train model with 1w data
print('-' * 42)
data_file = 'thucnews_train_1w.txt'
m = FastTextClassifier(output_dir='models/fasttext')
m.train(data_file, names=('labels', 'text'), num_epochs=3)
# load best trained model from model_dir
m.load_model()
predict_label, predict_proba = m.predict(
['顺义北京苏活88平米起精装房在售',
'美EB-5项目“15日快速移民”将推迟']
)
print(f'predict_label: {predict_label}, predict_proba: {predict_proba}')
x, y, df = load_data(data_file)
test_data = df[:100]
acc_score = m.evaluate_model(test_data)
print(f'acc_score: {acc_score}')
BERT 类模型
多分类模型
训练和预测BERT多分类模型,示例examples/bert_classification_zh_demo.py
import sys
sys.path.append('..')
from pytextclassifier import BertClassifier
if __name__ == '__main__':
m = BertClassifier(output_dir='models/bert-chinese-toy', num_classes=2,
model_type='bert', model_name='bert-base-chinese', num_epochs=2)
# model_type: support 'bert', 'albert', 'roberta', 'xlnet'
# model_name: support 'bert-base-chinese', 'bert-base-cased', 'bert-base-multilingual-cased' ...
data
