Xlnet
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Install / Use
/learn @zihangdai/XlnetREADME
Introduction
XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.
For a detailed description of technical details and experimental results, please refer to our paper:
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Zhilin Yang*, Zihang Dai*, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
(*: equal contribution)
Preprint 2019
Release Notes
- July 16, 2019: XLNet-Base.
- June 19, 2019: initial release with XLNet-Large and code.
Results
As of June 19, 2019, XLNet outperforms BERT on 20 tasks and achieves state-of-the-art results on 18 tasks. Below are some comparison between XLNet-Large and BERT-Large, which have similar model sizes:
Results on Reading Comprehension
Model | RACE accuracy | SQuAD1.1 EM | SQuAD2.0 EM --- | --- | --- | --- BERT-Large | 72.0 | 84.1 | 78.98 XLNet-Base | | | 80.18 XLNet-Large | 81.75 | 88.95 | 86.12
We use SQuAD dev results in the table to exclude other factors such as using additional training data or other data augmentation techniques. See SQuAD leaderboard for test numbers.
Results on Text Classification
Model | IMDB | Yelp-2 | Yelp-5 | DBpedia | Amazon-2 | Amazon-5 --- | --- | --- | --- | --- | --- | --- BERT-Large | 4.51 | 1.89 | 29.32 | 0.64 | 2.63 | 34.17 XLNet-Large | 3.79 | 1.55 | 27.80 | 0.62 | 2.40 | 32.26
The above numbers are error rates.
Results on GLUE
Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B --- | --- | --- | --- | --- | --- | --- | --- | --- BERT-Large | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 XLNet-Base | 86.8 | 91.7 | 91.4 | 74.0 | 94.7 | 88.2 | 60.2 | 89.5 XLNet-Large | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8
We use single-task dev results in the table to exclude other factors such as multi-task learning or using ensembles.
Pre-trained models
Released Models
As of <u>July 16, 2019</u>, the following models have been made available:
XLNet-Large, Cased: 24-layer, 1024-hidden, 16-headsXLNet-Base, Cased: 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).
We only release cased models for now because on the tasks we consider, we found: (1) for the base setting, cased and uncased models have similar performance; (2) for the large setting, cased models are a bit better in some tasks.
Each .zip file contains three items:
- A TensorFlow checkpoint (
xlnet_model.ckpt) containing the pre-trained weights (which is actually 3 files). - A Sentence Piece model (
spiece.model) used for (de)tokenization. - A config file (
xlnet_config.json) which specifies the hyperparameters of the model.
Future Release Plan
We also plan to continuously release more pretrained models under different settings, including:
- A pretrained model that is finetuned on Wikipedia. This can be used for tasks with Wikipedia text such as SQuAD and HotpotQA.
- Pretrained models with other hyperparameter configurations, targeting specific downstream tasks.
- Pretrained models that benefit from new techniques.
Subscribing to XLNet on Google Groups
To receive notifications about updates, announcements and new releases, we recommend subscribing to the XLNet on Google Groups.
Fine-tuning with XLNet
As of <u>June 19, 2019</u>, this code base has been tested with TensorFlow 1.13.1 under Python2.
Memory Issue during Finetuning
- Most of the SOTA results in our paper were produced on TPUs, which generally have more RAM than common GPUs. As a result, it is currently very difficult (costly) to re-produce most of the
XLNet-LargeSOTA results in the paper using GPUs with 12GB - 16GB of RAM, because a 16GB GPU is only able to hold a <u>single sequence with length 512</u> forXLNet-Large. Therefore, a large number (ranging from 32 to 128, equal tobatch_size) of GPUs are required to reproduce many results in the paper. - We are experimenting with gradient accumulation to potentially relieve the memory burden, which could be included in a near-future update.
- Alternative methods of finetuning XLNet on constrained hardware have been presented in renatoviolin's repo, which obtained 86.24 F1 on SQuAD2.0 with a 8GB memory GPU.
Given the memory issue mentioned above, using the default finetuning scripts (run_classifier.py and run_squad.py), we benchmarked the maximum batch size on a single 16GB GPU with TensorFlow 1.13.1:
| System | Seq Length | Max Batch Size |
| ------------- | ---------- | -------------- |
| XLNet-Base | 64 | 120 |
| ... | 128 | 56 |
| ... | 256 | 24 |
| ... | 512 | 8 |
| XLNet-Large | 64 | 16 |
| ... | 128 | 8 |
| ... | 256 | 2 |
| ... | 512 | 1 |
In most cases, it is possible to reduce the batch size train_batch_size or the maximum sequence length max_seq_length to fit in given hardware. The decrease in performance depends on the task and the available resources.
Text Classification/Regression
The code used to perform classification/regression finetuning is in run_classifier.py. It also contains examples for standard one-document classification, one-document regression, and document pair classification. Here, we provide two concrete examples of how run_classifier.py can be used.
From here on, we assume XLNet-Large and XLNet-base has been downloaded to $LARGE_DIR and $BASE_DIR respectively.
(1) STS-B: sentence pair relevance regression (with GPUs)
-
Download the GLUE data by running this script and unpack it to some directory
$GLUE_DIR. -
Perform multi-GPU (4 V100 GPUs) finetuning with XLNet-Large by running
CUDA_VISIBLE_DEVICES=0,1,2,3 python run_classifier.py \ --do_train=True \ --do_eval=False \ --task_name=sts-b \ --data_dir=${GLUE_DIR}/STS-B \ --output_dir=proc_data/sts-b \ --model_dir=exp/sts-b \ --uncased=False \ --spiece_model_file=${LARGE_DIR}/spiece.model \ --model_config_path=${LARGE_DIR}/xlnet_config.json \ --init_checkpoint=${LARGE_DIR}/xlnet_model.ckpt \ --max_seq_length=128 \ --train_batch_size=8 \ --num_hosts=1 \ --num_core_per_host=4 \ --learning_rate=5e-5 \ --train_steps=1200 \ --warmup_steps=120 \ --save_steps=600 \ --is_regression=True -
Evaluate the finetuning results with a single GPU by
CUDA_VISIBLE_DEVICES=0 python run_classifier.py \ --do_train=False \ --do_eval=True \ --task_name=sts-b \ --data_dir=${GLUE_DIR}/STS-B \ --output_dir=proc_data/sts-b \ --model_dir=exp/sts-b \ --uncased=False \ --spiece_model_file=${LARGE_DIR}/spiece.model \ --model_config_path=${LARGE_DIR}/xlnet_config.json \ --max_seq_length=128 \ --eval_batch_size=8 \ --num_hosts=1 \ --num_core_per_host=1 \ --eval_all_ckpt=True \ --is_regression=True # Expected performance: "eval_pearsonr 0.916+ "
Notes:
- In the context of GPU training,
num_core_per_hostdenotes the number of GPUs to use. - In the multi-GPU setting,
train_batch_sizerefers to the <u>per-GPU batch size</u>. eval_all_ckptallows one to evaluate all saved checkpoints (save frequency is controlled bysave_steps) after training finishes and choose the best model based on dev performance.data_dirandoutput_dirrefer to the directories of the "raw data" and "preprocessed tfrecords" respectively, whilemodel_diris the working directory for saving checkpoints and tensorflow events.model_dirshould be set as a separate folder toinit_checkpoint.- To try out <u>XLNet-base</u>, one can simply set
--train_batch_size=32and--num_core_per_host=1, along with according changes ininit_checkpointandmodel_config_path. - For GPUs with smaller RAM, please proportionally decrease the
train_batch_sizeand increasenum_core_per_hostto use the same training setting. - Important: we separate the training and evaluation into "two phases", as using multi GPUs to perform evaluation is tricky (one has to correctly separate the data across GPUs). To ensure correctness, we only support single-GPU evaluation for now.
(2) IMDB: movie review sentiment classification (with TPU V3-8)
-
Download and unpack the IMDB dataset by running
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz tar zxvf aclImdb_v1.tar.gz -
Launch a Google cloud TPU V3-8 instance (see the Google Cloud TPU tutorial for how to set up Cloud TPUs).
-
Set up your Google storage bucket path
$GS_ROOTand move the IMDB dataset and pretrained checkpoint into your Goog
Related Skills
best-practices-researcher
The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app
groundhog
398Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
isf-agent
a repo for an agent that helps researchers apply for isf funding
last30days-skill
17.2kAI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary
