PCPM

Presenting Collection of Pretrained Models. Links to pretrained models in NLP and voice.

Generate Convert Improve

Install / Use

/learn @rusiaaman/PCPM

About this skill

Quality Score

0/100

README

PCPM

Presenting Corpus of Pretrained Models. Links to pretrained models in NLP and voice with training script.

With rapid progress in NLP it is becoming easier to bootstrap a machine learning project involving text. Instead of starting with a base code, one can now start with a base pretrained model and within a few iterations get SOTA performance. This repository is made with the view that pretrained models minimizes collective human effort and cost of resources, thus accelerating development in the field.

Models listed are curated for either pytorch or tensorflow because of their wide usage.

Note: pytorch-transofmers is an awesome library which can be used to quickly infer/fine-tune from many pre-trained models in NLP. The pre-trained models from those are not included here.

Text ML models
Speech to text models
Datasets
Hall of Shame
Non english models
Other Collections

Text ML

Language Models

Name | Link | Trained On | Training script| -------|----------|:--------------:|------------:| Transformer-xl | https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models | enwik8, lm1b, wt103, text8 | https://github.com/kimiyoung/transformer-xl | GPT-2 | https://github.com/openai/gpt-2/blob/master/download_model.py | webtext | https://github.com/nshepperd/gpt-2/ | Adaptive Inputs (fairseq) | https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md#pre-trained-models | lm1b | https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md

Permutation lanugage modelling Based - XLNet

Name | Link | Trained On | Training script| -------|----------|:--------------:|------------:| XLnet | https://github.com/zihangdai/xlnet/#released-models | booksCorpus+English Wikipedia+Giga5+ClueWeb 2012-B+Common Crawl | https://github.com/zihangdai/xlnet/

Masked Language Modelling Based - Bert

Name | Link | Trained On | Training script| -------|----------|:--------------:|------------:| RoBERTa | https://github.com/pytorch/fairseq/tree/master/examples/roberta#pre-trained-models | booksCorpus+CC-N EWS+OpenWebText+CommonCrawl-Stories |https://github.com/huggingface/transformers | BERT | https://github.com/google-research/bert/ | booksCorpus+English Wikipedia | https://github.com/huggingface/transformers| MT-DNN | https://mrc.blob.core.windows.net/mt-dnn-model/mt_dnn_base.pt (https://github.com/namisan/mt-dnn/blob/master/download.sh)| glue | https://github.com/namisan/mt-dnn |

Machine Translation

Name | Link | Trained On | Training script| -------|----------|:--------------:|------------:| OpenNMT | http://opennmt.net/Models-py/ (pytorch) http://opennmt.net/Models-tf/ (tensorflow) | English-German | https://github.com/OpenNMT/OpenNMT-py | Fairseq (multiple models) | https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md#pre-trained-models | WMT14 English-French, WMT16 English-German | https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md

Sentiment

Name | Link | Trained On | Training script| -------|----------|:--------------:|------------:| Nvidia sentiment-discovery | https://github.com/NVIDIA/sentiment-discovery#pretrained-models | SST, imdb, Semeval-2018-tweet-emotion | https://github.com/NVIDIA/sentiment-discovery | MT-DNN Sentiment | https://drive.google.com/open?id=1-ld8_WpdQVDjPeYhb3AK8XYLGlZEbs-l | SST | https://github.com/namisan/mt-dnn |

Reading Comprehension

SQUAD 1.1

Rank | Name | Link | Training script| -------|-------|----------|:--------------:| 49 | BiDaf | https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz | https://github.com/allenai/allennlp |

Summarization

Model for English summarization

Name | Link | Trained On | Training script| -------|-------|----------|:--------------:| OpenNMT | http://opennmt.net/Models-py/ | Gigaword standard | https://github.com/OpenNMT/OpenNMT-py |

Speech to Text

Name | Link | Trained On | Training script | -------|----------|:--------------:|------------:| NeMo-quartznet | https://ngc.nvidia.com/catalog/models/nvidia:quartznet15x5 | librispeech,mozilla-common-voice | https://github.com/NVIDIA/NeMo OpenSeq2Seq-Jasper | https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#models | librispeech | https://github.com/NVIDIA/OpenSeq2Seq Espnet | https://github.com/espnet/espnet#asr-results | librispeech,Aishell,HKUST,TEDLIUM2 | https://github.com/espnet/espnet wav2letter++ | https://talonvoice.com/research/ | librispeech | https://github.com/facebookresearch/wav2letter Deepspeech2 pytorch | https://github.com/SeanNaren/deepspeech.pytorch/issues/299#issuecomment-394658265 | librispeech | https://github.com/SeanNaren/deepspeech.pytorch Deepspeech | https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model | mozilla-common-voice, librispeech, fisher, switchboard | https://github.com/mozilla/DeepSpeech speech-to-text-wavenet | https://github.com/buriburisuri/speech-to-text-wavenet#pre-trained-models | vctk | https://github.com/buriburisuri/speech-to-text-wavenet at16k | https://github.com/at16k/at16k#download-models | NA | NA

Datasets

Datasets referenced in this document

Language Model data

Common crawl

http://commoncrawl.org/

enwik8

Wikipedia data dump (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

text8

Wikipedia cleaned text (Large text compression benchmark) http://mattmahoney.net/dc/textdata.html

lm1b

1 Billion Word Language Model Benchmark https://www.statmt.org/lm-benchmark/

wt103

Wikitext 103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

webtext

Original dataset not released by the authors. An open source collection is available at https://skylion007.github.io/OpenWebTextCorpus/

English wikipedia

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

BooksCorpus

https://yknzhu.wixsite.com/mbweb https://github.com/soskek/bookcorpus

Sentiment

SST

Stanford sentiment tree bank https://nlp.stanford.edu/sentiment/index.html. One of the Glue tasks.

IMDB

IMDB movie review dataset used for sentiment classification http://ai.stanford.edu/~amaas/data/sentiment

Semeval2018te

Semeval 2018 tweet emotion dataset https://competitions.codalab.org/competitions/17751

Glue

Glue is a collection of resources for benchmarking natural language systems. https://gluebenchmark.com/ Contains datasets on natural language inference, sentiment classification, paraphrase detection, similarity matching and lingusitc acceptability.

Speech to text data

fisher

https://pdfs.semanticscholar.org/a723/97679079439b075de815553c7b687ccfa886.pdf

librispeech

www.danielpovey.com/files/2015_icassp_librispeech.pdf

switchboard

https://ieeexplore.ieee.org/document/225858/

Mozilla common voice

https://github.com/mozilla/voice-web

vctk

https://datashare.is.ed.ac.uk/handle/10283/2651

Hall of Shame

High quality research which doesn't include pretrained models and/or code for public use.

KERMIT https://arxiv.org/abs/1906.01604 Generative Insertion-Based Modeling for Sequences. No code.

Non English

Other Collections

Allen NLP

Built on pytorch, allen nlp has produced SOTA models and open sourced them. https://github.com/allenai/allennlp/blob/master/MODELS.md

They have neat interactive demo on various tasks at https://demo.allennlp.org/

GluonNLP

Based on MXNet this library has extensive list of pretrained models on various tasks in NLP. http://gluon-nlp.mxnet.io/master/index.html#model-zoo

Related Skills

node-connect

354.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

112.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

354.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

354.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

rusiaaman

View profile

View on GitHub

GitHub Stars23

CategoryDevelopment

Updated8mo ago

Forks2

rusiaaman/PCPM

Security Score

87/100

Audited on Jul 15, 2025

No findings

PCPM

Install / Use

README

PCPM

Contents

Text ML

Language Models

Permutation lanugage modelling Based - XLNet

Masked Language Modelling Based - Bert

Machine Translation

Sentiment

Reading Comprehension

SQUAD 1.1

Summarization

Speech to Text

Datasets

Language Model data

Common crawl

enwik8

text8

lm1b

wt103

webtext

English wikipedia

BooksCorpus

Sentiment

SST

IMDB

Semeval2018te

Glue

Speech to text data

fisher

librispeech

switchboard

Mozilla common voice

vctk

Hall of Shame

Non English

Other Collections

Allen NLP

GluonNLP

Related Skills