Cntext

cntext 是一个专为社会科学实证研究设计的中文文本分析 Python 库。它不仅提供传统的词频统计和情感分析，还支持词嵌入训练、语义投影计算等高级功能，帮助研究者从大规模非结构化文本中测量抽象构念——如态度、认知、文化观念与心理状态。

Generate Convert Improve

Install / Use

/learn @hiDaDeng/Cntext

About this skill

Quality Score

0/100

README

Table of Contents

cntext：面向社会科学研究的中文文本分析工具库
安装 cntext
功能模块
QuickStart
一、IO 模块
二、Stats 模块
三、Plot 模块
四、Model 模块
五、Mind 模块
六、LLM 模块
- 6.1 ct.llm()
- 6.2 内置prompt
使用声明
- apalike
- bibtex
- endnote

cntext：面向社会科学研究的中文文本分析工具库

cntext 是专为社会科学实证研究者设计的中文文本分析 Python 库。它不止于词频统计式的传统情感分析，还拥有词嵌入训练、语义投影计算，可从大规模非结构化文本中测量抽象构念——如态度、认知、文化观念与心理状态。

🎯 你能用它做什么

构建结构化研究数据集
- 汇总多个文本文件（txt/pdf/docx/csv）为 DataFrame：ct.read_files()
- 提取上市公司年报中的“管理层讨论与分析”（MD&A）：ct.extract_mda()
- 计算文本可读性指标（如Flesch指数）：ct.readability()
基础文本分析(传统方法)
- 词频统计与关键词提取：ct.word_count()
- 情感分析（可选hownet、dutir等内置词典）：ct.sentiment()
- 文本相似度计算（余弦距离）：ct.cosine_sim()
测量内隐态度与文化变迁
- 两行代码训练领域专用词向量（Word2Vec/GloVe）：ct.Word2Vec()
- 构建概念语义轴（如“创新 vs 守旧”）：ct.generate_concept_axis()
- 通过语义投影量化刻板印象、组织文化偏移：ct.project_text()
融合大模型进行结构化分析
- 调用 LLM 对文本进行语义解析，返回结构化结果（如情绪维度、意图分类）：ct.llm()

cntext 不追求黑箱预测，而致力于让文本成为理论驱动的科学测量工具。开源免费，欢迎学界同仁使用、验证与共建。

安装 cntext

pip3 install cntext --upgrade

需要注意， cntext 使用环境为 Python3.9 ~ 3.12,如安装失败，问题可能出在 python 版本问题；

功能模块

import cntext as ct
ct.hello()

cntext 含 io、model、stats、mind 五个模块

导入数据用 io
训练模型扩展词典用 model
统计词频、情感分析、相似度等用 stats
可视化模块 plot
态度认知文化变迁用 mind
大模型 LLM

函数部分加粗的为常用函数。

| 模块 | 函数 | 功能 | | ----------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------ | | io | ct.get_cntext_path() | 查看 cntext 安装路径 | | io | ct.get_dict_list() | 查看 cntext 内置词典 | | io | ct.get_files(fformat) | 查看符合 fformat 路径规则的所有的文件 | | io | ct.detect_encoding(file, num_lines=100) | 诊断 txt、csv 编码格式 | | io | ct.read_yaml_dict(yfile) | 读取内置 yaml 词典 | | io | ct.read_pdf(file) | 读取 PDF 文件 | | io | ct.read_docx(file) | 读取 docx 文件 | | io | ct.read_file(file, encodings) | 读取文件 | | io | ct.read_files(fformat, encoding) | 读取符合 fformat 路径规则的所有的文件，返回 df | | io | ct.extract_mda(text, kws_pattern) | 提取 A 股年报中的 MD&A 文本内容。如果返回'',则提取失败。 | | io | ct.traditional2simple(text) | 繁体转简体 | | io | ct.fix_text(text) | 将不正常的、混乱编码的文本转化为正常的文本。例如全角转半角 | | io | ct.fix_contractions(text) | 英文缩写(含俚语表达)处理，如 you're -> you are | | io | ct.clean_text(text, lang='chinese') | 中文、英文文本清洗 | | model | ct.Word2Vec(corpus_file, encoding, lang='chinese', ...) | 训练 Word2Vec | | model | ct.GloVe(corpus_file, encoding, lang='chinese', ...) | GloVe, 底层使用的 Standfordnlp/GloVe | | model | ct.evaluate_similarity(wv, file=None) | 使用近义法评估模型表现，默认使用内置的数据进行评估。 | | model | ct.evaluate_analogy(wv, file=None) | 使用类比法评估模型表现，默认使用内置的数据进行评估。 | | model | ct.glove2word2vec(glove_file, word2vec_file) | 将 GLoVe 模型.txt 文件转化为 Word2Vec 模型.txt 文件；一般很少用到 | | model | ct.load_w2v(wv_path) | 读取 cntext2.x 训练出的 Word2Vec/GloVe 模型文件 | | model | ct.expand_dictionary(wv, seeddict, topn=100) | 扩展词典, 结果保存到路径[output/Word2Vec]中 | | model | ct.SoPmi(corpus_file, seed_file, lang='chinese') | 共现法扩展词典 | | stats | ct.word_count(text, lang='chinese') | 词频统计 | | stats | readability(text, lang='chinese', syllables=3) | 文本可读性 | | stats | ct.sentiment(text, diction, lang='chinese') | 无(等)权重词典的情感分析 | | stats | ct.sentiment_by_valence(text, diction, lang='chinese') | 带权重的词典的情感分析 | | stats | ct.word_in_context(text, keywords, window=3, lang='chinese') | 在 text 中查找 keywords 出现的上下文内容(窗口 window)，返回 df | | stats | ct.epu() | 使用新闻文本数据计算经济政策不确定性 EPU，返回 df | | stats | ct.fepu(text, ep_pattern='', u_pattern='') | 使用 md&a 文本数据计算企业不确定性感知 FEPU | | stats | **_ct

Related Skills

docs-writer

99.6k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

341.8k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

summarize

341.8k

Summarize or extract text/transcripts from URLs, podcasts, and local files (great fallback for “transcribe this YouTube/video”).

feishu-doc

341.8k