TWTM

The code (demo) is about the paper "Tag-Weighted Topic Model for Mining Semi-Structured Documents"

Generate Convert Improve

Install / Use

/learn @shuangyinli/TWTM

About this skill

Quality Score

0/100

README

The code (demo) is about the paper "Tag-Weighted Topic Model for Mining Semi-Structured Documents"

The paper is at http://dl.acm.org/citation.cfm?id=2540540 Author: Shuangyin Li, Jiefei Li, Rong Pan Sun Yat-sen University Any question about code please contact us by emails. shuangyinli AT cse.ust.hk lijiefei AT mail2.sysu.edu.cn. panr AT sysu.edu.cn.

License

Copyright 2013 Shuangyin Li, Jiefei Li, Rong Pan Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Easy Way:

./example.sh

Install

cd src/ && make

Usage

###Input file format: DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ... DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ... DocNumLabels label1 label2 ... @ DocNumWords word1 word2 ...

Each row represent one document with labels. DocNumLables means the number labels of document. DocNumWords means the number words of document. Each label is integer and represent one label. Each word is integer and represent one word. demo/twtm.demo.input is a simple demo input file. demo/label.txt is the label dictionary file. The word in row 1 means the label0. demo/words.dic is the word dictionary file. ###Training:

./twtm est <input data file> <setting.txt> <num_topics> <model save dir>

Example:

./src/twtm est demo/twtm.demo.input src/setting.txt 10 demo/model

Some model training parameters are set in the file "setting.txt".

###Inference: There are two methods to inference a new document's topic distribution. One is still using the labels of new document to inference.

./twtm inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/

We can get the doc-topics-dis.txt file in output dir. The file indicates the topic distribution in input data file. The values in the file should be exp(.) so that we can konw that exact probablility.

The other one is just using the words of new document. So with the TWTM model, we can inference some new document without any label just like LDA model.

./twtm lda-inf <input data file> <setting.txt> <model dir> <prefix> <output dir>

Example:

./src/twtm lda-inf demo/twtm.demo.input src/setting.txt demo/model/ final demo/output/

Related Skills

node-connect

354.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。