L-Eval: Instituting Standardized Evaluation for Long Context Language Models

Data Collection: L-Eval (preview on 🤗 HuggingFace Datasets • check our 📃 paper ) is a comprehensive Long Context Language Models (LCLMs) evaluation suite with 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k～200k tokens). L-Eval has 2 groups: closed-ended tasks and open-ended tasks. The closed-ended group primarily tests the reasoning and understanding ability regarding a longer context, and the open-ended group consists of more summarization tasks that require aggregation of long document information (download the data).

Long Context LLMs Evaluation: Closed-ended tasks typically do not present issues with evaluation fairness. However, in real-world long-context tasks, open-ended tasks tend to be more common. We have found that n-gram metrics such as ROUGE and F1 cannot accurately reflect the abilities of LCLMs. As such, L-Eval does not solely rely on metrics used in previous text generation benchmarks. Instead, L-Eval primarily utilizes Length-Instruction-Enhanced (LIE) evaluation, and LLM judges (battling with Turbo-16k or Llama2). Please refer to open-ended tasks evaluation).

We hope L-Eval could help researchers and developers track the progress of long-context language models (LCLMs) and understand the strengths/shortcomings of different methods. We will also keep up with the latest releases of instruction-following LCLMs.

Other features of this repo:

🧭️ Handle CUDA OOM with memory-efficient inference
🖇️ Build a retrieval-based baseline with Langchain
✏️ Flask web client for editing local jsonl files
🔖 View the Leaderboard
📨 How to submit your results
Previous long sequence datasets used in L-Eval

Long context abilities of LLMs on closed/open-ended tasks:

🔥 Updates of L-Eval

[2024-4-25] We add the results for Llama3 8b/70b.

| Model | TOEFL | QuALITY | Coursera | SFiction | GSM | CodeU | |--------|------|------|-------|-------|-------|-------| | Llama3-8b-Instruct | 82.89 | 64.85 | 53.77 | 69.53 | 79.00 | 2.22| | Llama3-70b-Instruct | 84.75 |80.19 | 75.87 | 72.65 | 90.00 | 6.67 | | GPT4-32k (2023) | 84.38 |82.17 | 75.58 | 74.99 | 96.00 | 25.55 |

[2023-10-7] Final version of our paper can be found here.
[2023-8-30] We have annotated two new closed-ended tasks: (i) A scientific fiction dataset to test the loyalty to input and (ii) a code understanding dataset. 📢 L-Eval has been supported by OpenCompass. You can test L-Eval together with other benchmarks for foundation models here.

Folders

The repository is structured as follows:

├── Baselines/ # scripts to generate the prediction files with baseline models
├── Baselines-light/ # scripts to generate the prediction files with 24G gpus
├── Evaluation/ # evaluation scripts
├── LEval-data/ # test samples
│   ├── Closed-ended-tasks/ # exact match tasks (like multiple-choice)
│   │   ├── test_file.jsonl 
│   │   └── ...
│   ├── Open-ended-tasks/ # generation tasks
│   │   ├── test_file.jsonl
│   │   └── ...
├── Predictions/ # output of models
│   ├── exam_eval/turbo-16k-0613
│   │              ├── <task_name>.pred.jsonl
│   │              └── ... 
│   ├── llm_gpt4_eval  
│   │             ├──<model_name>.pred.jsonl
│   ├── ngram_eval  
│   │             ├──model_name
│   │                     └──task_name.pred.jsonl
│   ├── ...
└── Tools/ # useful scripts

Quick use

Step 1. Download the data

It is easy to load the 20 test data in one line with huggingface datasets, and we give the example scripts:

from datasets import load_dataset, disable_caching

datasets = ["coursera", "gsm100", "quality", "topic_retrieval_longchat", "tpo", "codeU", "sci_fi" ,"financial_qa", "gov_report_summ", "legal_contract_qa", "meeting_summ", "multidoc_qa", "narrative_qa", "natural_question", "news_summ", "paper_assistant", "patent_summ", "review_summ", "scientific_qa", "tv_show_summ"]
# The corresponding NAMEs in the paper
# "coursera", "GSM(16-shot)", "QuALITY", "TopicRet", "TOFEL", "codeU", "SFiction", "LongFQA", "GovReport", "CUAD", "QMSum", "MultiDoc2Dial", "NarrativeQA", "NQ", "Multi-news", "Openreview", "BigPatent", "SPACE", "Qasper", "SummScreen"]

for testset in datasets:
    # disable_caching()  uncomment this if you cannot download codeU and sci_fi 
    data = load_dataset('L4NLP/LEval', testset, split='test')
    # evaluate your model

You can also directly clone this repo:

git clone https://github.com/OpenLMLab/LEval.git

The test data is in LEval-data.

Each long document has multiple queries and corresponding responses. The format of each sample is as follows:

{
    "instructions": ["What is the main goal of data science?\nA. Analyze and predict future trends\nB. Generate massive amounts of data\nC. Answer questions using data\nD. Increase the use of technology", "..."], // a list of instructions (questions need LLMs to answer)
    "outputs": ["C","A", "..."], // the ground truth or reference of corresponding instructions
    "input": "A very long document", // LLMs need to respond to instructions based on this long document.
    "source": "domain the document belongs to", // meeting, narrative_qa, etc.
    "evaluation": "Metrics used for evaluation" // e.g., exam, human, LLM, ROUGE, F1, etc.
}

Step 2. Generate your prediction results (Closed-ended tasks)

Examples of closed-ended tasks

Multiple Choice Question (single correct option). Example predicted answer: A, BCD
Math Word Problems. Example predicted answer: 3

We test all the baselines with a single 80G A800 GPU. If you encounter the OOM problem, please refer to multiple GPUs inference. To generate the output files, you need to add a new file to Baseline folder and then replace the model name with your own model. An example of testing gpt3.5-turbo-16k on closed-ended tasks:

python Baselines/turbo16k-test.py  --metric exam_eval (for closed-ended group)  --task_name quality [Optional, if you only want to test one task]

The script will save the prediction results to a local file. You need to press enter to confirm the path. Details about open-ended tasks can be found in the next section.

Step 3. Evaluate the prediction file

Given the prediction file generated in Step 2, please run the following command to calculate the metric:

python Evaluation/auto_eval.py --pred_file Predictions/exam_eval/turbo-16k-0613/quality.pred.jsonl

Evaluating LCLMs on open-ended tasks

In this part, we mainly introduce how to evaluate LCLMs on open-ended tasks.

Examples of open-ended tasks

Summarization. Example predicted answer: This paper proposes a new method for ...
Abstractive question answering. Example predicted answer: The main goal of data science is to answer questions using data.

Generate prediction results on open-ended tasks:

CMD: python Baselines/turbo16k-test.py --metric ngram_eval (for open-ended group)  --task_name narrative_qa [Optional, if you only want to test one task]

Generate prediction results on the 96-question subset (GPT-4 evaluation subset):

CMD: python Baselines/turbo16k-test.py --metric llm_gpt4_eval

Generate prediction results on the 85-question subset (human evaluation subset):

CMD: python Baselines/turbo16k-test.py --metric human_eval

Generate prediction results on the 2 subsets (181 questions) :

CMD: python Baselines/turbo16k-test.py --metric llm_turbo_eval

Automatic Metrics

we use the following automatic metrics to evaluate the performance of generation tasks:

GPT-4/3.5 Evaluation. We suggest using GPT-4 as a judge and battling with turbo-16k-0613. We report the win-rate in our paper. Turbo-16k serves as a strong baseline, and you could also opt for Llama2-4k to directly demonstrate the extent of your improvements.

python Evaluation/llm_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl --judge_model gpt-4 (or gpt-3.5-turbo) --battle_with Predictions/ngram_eval/turbo-16k-0613 (or llama2-13b-chat)/narrative_qa.pred.jsonl

Please add the following judgment prompt in Long context settings:

Additional details or information that are not mentioned in the reference answer cannot be considered as advantages and do not let them sway your judgment.

N-gram Match Evaluation (biased), traditional automatic metrics like F1, ROUGE, is very cheap and efficient to calculate. However, they are biased towards the length of the predicted answer.

python Evaluation/auto_eval.py --pred_file Predictions/ngram_eval/vicuna-13b-16k/narrative_qa.pred.jsonl

❗ Length-Instruction-Enhanced Evaluation

For open-ended tasks, models are informed of the ground truth length via a length instruction,e.g, We need a 20 words summary where 20 is the length of reference answer to reduce the length bias in automatic metrics. The figure below shows the improvement in Kendall-Tau correlation w

LEval

Install / Use

README