Cappy

NeurIPS 2023 - Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

Generate Convert Improve

Install / Use

/learn @tanyuqian/Cappy

About this skill

Quality Score

0/100

README

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer

This repo contains code of the following paper:

Cappy: Outperforming and Boosting Large Multi-Task LMs with a Small Scorer
Bowen Tan, Yun Zhu, Lijuan Liu, Eric Xing, Zhiting Hu, Jindong Chen
NeurIPS 2023
[arXiv] [Model Card (btan2/cappy-large)]

Getting Started

Cappy is a pretrained small scorer designed to enhance the performance and efficiency of multi-task LLMs.
Cappy takes in an instruction and a candidate response as input, and produces a score between 0 and 1, indicating an estimated correctness of the response with respect to the instruction.
With merely 360 million parameters, Cappy functions either independently on classification tasks or serve as an auxiliary component for LLMs, boosting their performance.
Also, Cappy enables efficiently integrating downstream supervision without requiring LLM finetuning nor the access to their parameters.
Furthermore, Cappy is flexible to cooperate with other LLM adaptations, including finetuning and in-context learning, and prompt tuning, offering additional performance enhancement.

Now, Cappy can be loaded with transformers either as a Jax/Flax model or a PyTorch model.

Jax/Flax

from transformers import AutoTokenizer, FlaxAutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = FlaxAutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()

PyTorch

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained('btan2/cappy-large')
cappy = AutoModelForSequenceClassification.from_pretrained('btan2/cappy-large')

instruction = """
What label best describes this news article?
Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
"""
response = 'Business'

inputs = tokenizer([(instruction, response), ], return_tensors='pt')
score = cappy(**inputs).logits[0][0].item()

Below are the scripts to recover the experiments in the paper.

Requirements

Cappy's pretraining and finetuning are both based on Redco, a lightweight tool automating distributed training on both GPUs and TPUs.

To install redco

pip install redco==0.4.13

Sometimes the Jax version needs be adjusted based on your device & environment. Here are some instructions.

To install other requirements,

pip install -r requirements.txt

Pretraining Cappy

Cappy's pretraining uses the code from this example in Redco. We will release Cappy's pretraining data soon.

Evaluting Cappy on PromptSource (zero-shot)

Download Test Data

Following the setting from OPT-IML paper (Section 5.2). We conduct zero-shot evaluation on 11 held-out classification tasks from PromptSource.

bash scripts/download_promptsource_test_data.sh

Running Cappy

python cappy_promptsource.py --model_name_or_path btan2/cappy-large

Results

| | OPT 30B | OPT-IML 30B | OPT 175B | OPT-IML 175B | T0 11B | Cappy (ours, 0.36B) | |------------:|:-------:|:-----------:|:--------:|:------------:|:------:|:-------------------:| | ANLI R1 | 33.7 | 37.1 | 34.1 | 42.2 | 42.1 | 34.3 | | ANLI R2 | 34.1 | 35.4 | 34.1 | 38.5 | 37.9 | 33.9 | | ANLI R3 | 34.7 | 36.6 | 34.7 | 39.6 | 39.7 | 34.7 | | CB | 24.6 | 43.2 | 38.9 | 56.4 | 58.5 | 59.4 | | RTE | 56.4 | 67.8 | 54.0 | 73.4 | 80.2 | 71.9 | | StoryCloze | 55.5 | 90.7 | 57.0 | 95.0 | 96.7 | 93.7 | | WSC | 43.5 | 58.2 | 51.0 | 59.2 | 58.6 | 63.8 | | WiC | 50.8 | 54.7 | 49.7 | 53.6 | 56.0 | 51.9 | | Winogrande | 50.2 | 53.4 | 50.1 | 56.6 | 62.5 | 51.7 | | WinoGender | 54.9 | 64.6 | 53.9 | 72.7 | 83.8 | 68.9 | | Crows-Pairs | 85.5 | 22.3 | 85.5 | 34.4 | 24.0 | 57.8 | | Average | 47.6 | 51.3 | 49.3 | 56.5 | 58.2 | 56.6 |

Baseline results come from OPT-IML paper (Section 5.2).

Boosting FLAN-T5 with Cappy on Big-Bench Tasks

Getting Big-Bench Tasks

We take all 45 generative tasks from Big-Bench in our experiment. The command below process the tasks into .jsonl format.

python scripts/get_bigbench_data.py

The processed datasets can be found in ./bigbench_data, where ./bigbench_data/subset_names.json records all the task names.

Getting FLAN-T5 Outputs

We collect generated outputs (as well as log-likelihoods on evaluation sets) from FLAN-T5 models (from -small to -xxl). They can be downloaded with

bash scripts/download_bigbench_flan_gens.sh

If you want to generate outputs by your self and/or adjust some generation settings, we provide generation code as below that supports distributed inference using multiple GPUs together (in case the model is too large to accomodate on a single GPU, e.g., FLAN-T5-XXL (11B)).

python scripts/bigbench_flan_generate.py \
  --model_name_or_path google/flan-t5-xl \
  --n_model_shards 4

where --n_model_shards refers to the number of shards you want to split the large model into (it's usually the number of GPUs on your device if it's not 1).

Adapting Cappy to boost FLAN-T5

XLA_PYTHON_CLIENT_MEM_FRACTION=.95 python cappy_bigbench.py \
  --model_name_or_path btan2/cappy-large \
  --bigbench_subset_name auto_categorization \
  --bigbench_gen_model flan-t5-xxl \
  --train_size 102400

XLA_PYTHON_CLIENT_MEM_FRACTION=.95: (In case GPU memory exceeds) adjust the GPU memory pre-allocation to Jax, see here for more details.
--bigbench_subset_name: the name of subset from Big-Bench (see ./bigbench_data/subset_names.json for all of them).
--bigbench_gen_model: the FLAN model to be boosted.
--train_size: the target data size to construct for Cappy's finetuning on the task (collect FLAN outputs, and then truncate or repeat).

See def main(...) in cappy_bigbench.py for all the arguments.

Every sub-task takes 40 mins to run on a single A10G GPU. The result will be logged in ./bigbench_cappy_results/{flan_model}/{subset_name}.json.

Besides, to run all the Big-Bench subsets at once,

python scripts/run_cappy_bigbench.py --cuda_idx 0

Results

To present baseline results, python scripts/present_bigbench_baselines.py

To present Cappy results on all 45 Big-Bench subtasks, python scripts/present_cappy_bigbench_results.py --gen_model_name flan-t5-xxl

The reported numbers on the paper are produced on TPU machines. Here we provide our reproduction results on A10G GPUs in ./bigbench_cappy_results. The gap between them is slight (ΔrougeL <= 0.8).

| | flan-t5-small | flan-t5-base | flan-t5-large | flan-t5-xl | flan-t5-xxl | |----------------------|---------------|--------------|---------------|-------------|-------------| | Beam Search (beam=4) | 16.4025 | 19.8594 | 23.4802 | 26.1177 | 29.6608 | | Sampling | 11.4317 | 15.7909 | 19.6248 | 23.2191 | 25.7273 | | Temperature (t=0.9) | 12.0126 | 17.0571 | 20.0481 | 24.2702 | 27.0985 | | Topk (k=40) | 11.5157 | 15.7481 | 19.7634 | 22.6692 | 25.8226 | | Nucleus (p=0.95) | 11.9171 | 16.6174 | 20.1986 | 24.1654 | 26.9036 | | Self-Score (sum) | 15.0806 | 20.711 | 24.1224 | 28.4665 | 32.0156 | | Self-Score (mean) | 16.4223 | 20.1317 | 23.7828 | 26.7694 | 30.246 | | Cappy (ours) | 23.6543 | 27.6178 | 30.3802 | 33.2775 | 37.1678 |

Acknowledgement

Cappy is Mario's ally throughout Super Mario Odyssey and assists him in various ways. We thank Nintendo for the nice game!

Related Skills

node-connect

351.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

351.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

351.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。