Loft
LOFT: A 1 Million+ Token Long-Context Benchmark
Install / Use
/learn @google-deepmind/LoftREADME
LOFT: A 1 Million+ Token Long-Context Benchmark
This repository houses the resources for LOFT, the Long Context Frontiers benchmark, introduced in the research paper Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop compositional reasoning, and more, totaling 35 datasets and 4 modalities.
Installation
$ git clone git@github.com:google-deepmind/loft.git
$ cd loft/
$ pip install -r requirements.txt
Download Datasets and Prompts
The script below downloads all the LOFT datasets under BASE_DIR.
$ BASE_DIR=your-choice-of-directory
$ sh download.sh $BASE_DIR
Each dataset is also available from the links in the Datasets table.
For a small subset, download.sh will additionally run preprocess.py, which
infills the missing fields in the queries and corpus files.
Once the download is completed, you will see the file structure as below:
$BASE_DIR
└── data
├── retrieval
│ ├── arguana
│ │ ├── 128k
│ │ │ ├── corpus.jsonl
│ │ │ ├── dev_queries.jsonl
│ │ │ ├── few_shot_queries.jsonl
│ │ │ └── test_queries.jsonl
│ │ ├── 1m
│ │ └── 32k
│ ├── fever
│ │ ├── ...
│ ├── ...
├── rag
├── sql
├── icl
└── mm
We also provide an example prompt in PROMPT_EXAMPLE.txt showing how
Corpus-in-Context (CiC) prompting can be done for the text retrieval task.
Inference and Evaluation
We currently support using Gemini (e.g., gemini-1.5-flash-002) from VertexAI
for inference.
Please prepare your PROJECT_ID from Google Cloud.
To run the inference with gemini-1.5-flash-002 and evaluate predictions:
BASE_DIR=$1
DATASET=$2
LENGTH="128k"
TASK_TYPE="retrieval"
SPLIT="dev"
PROMPT_TYPE="few_shot_with_cot"
PROMPT="${TASK_TYPE}_${DATASET}_${LENGTH}_${SPLIT}:${PROMPT_TYPE}"
echo "Prompt: ${PROMPT}"
mkdir -p ${BASE_DIR}/outputs/${TASK_TYPE}/${DATASET}/${LENGTH}
answer_file_extension="jsonl"
python run_inference.py \
--prompt_name ${PROMPT} \
--task_type ${TASK_TYPE} \
--base_dir ${BASE_DIR} \
--data_dir ${TASK_TYPE}/${DATASET}/${LENGTH} \
--split ${SPLIT} \
--context_length ${LENGTH} \
--output_path ${BASE_DIR}/outputs/${TASK_TYPE}/${DATASET}/${LENGTH}/${SPLIT}_predictions.jsonl \
--project_id ${PROJECT_ID} \
--overwrite
python run_evaluation.py \
--answer_file_path ${BASE_DIR}/data/${TASK_TYPE}/${DATASET}/${LENGTH}/dev_queries.${answer_file_extension} \
--pred_file_path ${BASE_DIR}/outputs/${TASK_TYPE}/${DATASET}/${LENGTH}/${SPLIT}_predictions.jsonl \
--task_type ${TASK_TYPE}
The same script can be found from infer_eval.sh.
We provide example queries and predictions files in evaluation/example_predictions/.
Each task_type outputs many different metric scores.
To understand which task_type to use for each dataset and also to see the primary evaluation metric reported in the paper for each dataset, see the Datasets table.
Get Prompts for 3P Evaluation
You can use the following command to get prompts for specific datasets. For instance, the prompts for LOFT-hard below are obtained as follows:
TASK="retrieval"
DATASET="qampari"
LENGTH="128k"
SPLIT="test"
PROMPT_NAME="${TASK}_${DATASET}_${LENGTH}_${SPLIT}:few_shot_with_cot"
python3 dump_prompts.py \
--prompt_name="${PROMPT_NAME}" \
--base_dir="${HOME}" \
--output_format=text \
--output_dir="${HOME}/prompts/${PROMPT_NAME}" \
--output_format=csv
Datasets
| Task | Dataset | Description | Task Type | Primary Metric | Infilling Needed? | Download |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Text Retrieval | ArguAna | Argument Retrieval | retrieval | recall@1 | - | Link |
| Text Retrieval | FEVER | Fact Checking | retrieval | recall@1 | - | Link |
| Text Retrieval | FIQA | Question Answering | retrieval | recall@1 | ✅ | Link |
| Text Retrieval | MS MARCO | Web Search | retrieval |recall@1 | ✅ | Link |
| Text Retrieval | NQ | Question Answering | retrieval |recall@1 | - | Link |
| Text Retrieval | Quora | Duplication Detection | retrieval |recall@1 | ✅ | Link |
| Text Retrieval | SciFact | Citation Prediction | retrieval |recall@1 | - | Link |
| Text Retrieval | Touché-2020 | Argument Retrieval | retrieval | recall@1 | ✅ | Link |
| Text Retrieval | TopiOCQA | Multi-turn QA | retrieval |recall@1 | - | Link |
| Text Retrieval | HotPotQA | Multi-hop QA | retrieval | mrecall@2 | - | Link |
| Text Retrieval | MuSiQue | Multi-hop QA | retrieval | mrecall@5 | - | Link |
| Text Retrieval | QAMPARI | Multi-target QA | retrieval | mrecall@5 | - | Link |
| Text Retrieval | QUEST | Multi-target QA | retrieval | mrecall@3 | - | Link |
| Visual Retrieval | Flickr30k | Image Retrieval | retrieval | recall@1 | - | Link |
| Visual Retrieval | MS COCO | Image Retrieval | retrieval | recall@1 | - | Link |
| Visual Retrieval | OVEN | Image-text Retrieval | retrieval | recall@1 | - | Link |
| Visual Retrieval | MSR-VTT | Video Retrieval | retrieval | recall@1| - | Link |
| Audio Retrieval | FLEURS-en | Audio Retrieval | retrieval | recall@1 | - | Link |
| Audio Retrieval | FLEURS-es | Audio Retrieval | retrieval | recall@1 | - | Link |
| Audio Retrieval | FLEURS-fr | Audio Retrieval | retrieval | recall@1| - | Link |
| Audio Retrieval | FLEURS-hi | Audio Retrieval | retrieval | recall@1 | - | Link |
| Audio Retrieval | FLEURS-zh | Audio Retrieval | retrieval | recall@1 | - | Link |
| RAG | NQ | Question Answering | rag | subspan_em | - | Link |
| RAG | TopiOCQA | Multi-turn QA | rag | subspan_em | - | Link |
| RAG | HotPotQA | Multi-hop QA | rag | subspan_em | - | Link |
| RAG | MuSiQue | Multi-hop QA | rag | subspan_em | - | Link |
| RAG | QAMPARI | Multi-target QA | multi_value_rag | subspan_em | - | Link |
| RAG | QUEST | Multi-target QA | multi_value_rag | subspan_em | - | Link |
| SQL | Spider | Single-turn SQL | sql | exec_acc | - | Link |
| SQL | SParC | Multi-turn SQL | sql | exec_acc | - | Link |
| Many-Shot ICL | BBH-date | Multiple-choice QA | icl | em | - | Link |
| Many-Shot ICL |BBH-salient | Multiple-choice QA | icl | em |
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
