SkillAgentSearch skills...

Bert

TensorFlow code and pre-trained models for BERT

Install / Use

/learn @google-research/Bert
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

BERT

***** New March 11th, 2020: Smaller BERT Models *****

This is a release of 24 smaller BERT models (English only, uncased, trained with WordPiece masking) referenced in Well-Read Students Learn Better: On the Importance of Pre-training Compact Models.

We have shown that the standard BERT recipe (including model architecture and training objective) is effective on a wide range of model sizes, beyond BERT-Base and BERT-Large. The smaller BERT models are intended for environments with restricted computational resources. They can be fine-tuned in the same manner as the original BERT models. However, they are most effective in the context of knowledge distillation, where the fine-tuning labels are produced by a larger and more accurate teacher.

Our goal is to enable research in institutions with fewer computational resources and encourage the community to seek directions of innovation alternative to increasing model capacity.

You can download all 24 from here, or individually from the table below:

| |H=128|H=256|H=512|H=768| |---|:---:|:---:|:---:|:---:| | L=2 |2/128 (BERT-Tiny)|2/256|2/512|2/768| | L=4 |4/128|4/256 (BERT-Mini)|4/512 (BERT-Small)|4/768| | L=6 |6/128|6/256|6/512|6/768| | L=8 |8/128|8/256|8/512 (BERT-Medium)|8/768| | L=10 |10/128|10/256|10/512|10/768| | L=12 |12/128|12/256|12/512|12/768 (BERT-Base)|

Note that the BERT-Base model in this release is included for completeness only; it was re-trained under the same regime as the original model.

Here are the corresponding GLUE scores on the test set:

|Model|Score|CoLA|SST-2|MRPC|STS-B|QQP|MNLI-m|MNLI-mm|QNLI(v2)|RTE|WNLI|AX| |---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:| |BERT-Tiny|64.2|0.0|83.2|81.1/71.1|74.3/73.6|62.2/83.4|70.2|70.3|81.5|57.2|62.3|21.0| |BERT-Mini|65.8|0.0|85.9|81.1/71.8|75.4/73.3|66.4/86.2|74.8|74.3|84.1|57.9|62.3|26.1| |BERT-Small|71.2|27.8|89.7|83.4/76.2|78.8/77.0|68.1/87.0|77.6|77.0|86.4|61.8|62.3|28.6| |BERT-Medium|73.5|38.0|89.6|86.6/81.6|80.4/78.4|69.6/87.9|80.0|79.1|87.7|62.2|62.3|30.5|

For each task, we selected the best fine-tuning hyperparameters from the lists below, and trained for 4 epochs:

  • batch sizes: 8, 16, 32, 64, 128
  • learning rates: 3e-4, 1e-4, 5e-5, 3e-5

If you use these models, please cite the following paper:

@article{turc2019,
  title={Well-Read Students Learn Better: On the Importance of Pre-training Compact Models},
  author={Turc, Iulia and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina},
  journal={arXiv preprint arXiv:1908.08962v2 },
  year={2019}
}

***** New May 31st, 2019: Whole Word Masking Models *****

This is a release of several new models which were the result of an improvement the pre-processing code.

In the original pre-processing code, we randomly select WordPiece tokens to mask. For example:

Input Text: the man jumped up , put his basket on phil ##am ##mon ' s head Original Masked Input: [MASK] man [MASK] up , put his [MASK] on phil [MASK] ##mon ' s head

The new technique is called Whole Word Masking. In this case, we always mask all of the the tokens corresponding to a word at once. The overall masking rate remains the same.

Whole Word Masked Input: the man [MASK] up , put his basket on [MASK] [MASK] [MASK] ' s head

The training is identical -- we still predict each masked WordPiece token independently. The improvement comes from the fact that the original prediction task was too 'easy' for words that had been split into multiple WordPieces.

This can be enabled during data generation by passing the flag --do_whole_word_mask=True to create_pretraining_data.py.

Pre-trained models with Whole Word Masking are linked below. The data and training were otherwise identical, and the models have identical structure and vocab to the original models. We only include BERT-Large models. When using these models, please make it clear in the paper that you are using the Whole Word Masking variant of BERT-Large.

Model | SQUAD 1.1 F1/EM | Multi NLI Accuracy ---------------------------------------- | :-------------: | :----------------: BERT-Large, Uncased (Original) | 91.0/84.3 | 86.05 BERT-Large, Uncased (Whole Word Masking) | 92.8/86.7 | 87.07 BERT-Large, Cased (Original) | 91.5/84.8 | 86.09 BERT-Large, Cased (Whole Word Masking) | 92.9/86.7 | 86.46

***** New February 7th, 2019: TfHub Module *****

BERT has been uploaded to TensorFlow Hub. See run_classifier_with_tfhub.py for an example of how to use the TF Hub module, or run an example in the browser on Colab.

***** New November 23rd, 2018: Un-normalized multilingual model + Thai + Mongolian *****

We uploaded a new multilingual model which does not perform any normalization on the input (no lower casing, accent stripping, or Unicode normalization), and additionally inclues Thai and Mongolian.

It is recommended to use this version for developing multilingual models, especially on languages with non-Latin alphabets.

This does not require any code changes, and can be downloaded here:

***** New November 15th, 2018: SOTA SQuAD 2.0 System *****

We released code changes to reproduce our 83% F1 SQuAD 2.0 system, which is currently 1st place on the leaderboard by 3%. See the SQuAD 2.0 section of the README for details.

***** New November 5th, 2018: Third-party PyTorch and Chainer versions of BERT available *****

NLP researchers from HuggingFace made a PyTorch version of BERT available which is compatible with our pre-trained checkpoints and is able to reproduce our results. Sosuke Kobayashi also made a Chainer version of BERT available (Thanks!) We were not involved in the creation or maintenance of the PyTorch implementation so please direct any questions towards the authors of that repository.

***** New November 3rd, 2018: Multilingual and Chinese models available *****

We have made two new BERT models available:

  • BERT-Base, Multilingual (Not recommended, use Multilingual Cased instead): 102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
  • BERT-Base, Chinese: Chinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters

We use character-based tokenization for Chinese, and WordPiece tokenization for all other language

Related Skills

View on GitHub
GitHub Stars39.9k
CategoryDevelopment
Updated1h ago
Forks9.7k

Languages

Python

Security Score

100/100

Audited on Mar 27, 2026

No findings