MoreDocsSameLen
This repository contains code and datasets for our paper on the effects of document multiplicity while the context size is fixed in Retrieval-Augmented Generation (RAG) systems.
Install / Use
/learn @shaharl6000/MoreDocsSameLenREADME
This repository contains code and datasets for our paper on the effects of document multiplicity while the context size is fixed in Retrieval-Augmented Generation (RAG) systems. For detailed methodology, experiments, and analysis, please refer to the full paper 📰
:bulb: High-Level Conclusions
Our results show that adding more retrieved documents can hurt performance—up to a 10% drop in fixed-context setups—making document rich retrieval tasks harder. Llama-3.1 and Gemma-2 declined, Qwen-2 stayed steady, and smaller LLMs (7–9B) followed the trend less strongly. This suggests systems need to balance relevance and variety to cut conflicts, and future models might improve by filtering out contradictory details while using the range of documents.
🔬 Our Methodology:
<div style="max-width: 400px; margin: 0 auto;"> Starting with a Wikipedia-derived dataset, we created different sets with the same amount of tokens but fewer documents by adjusting the length of the key documents for each question. Our sets use the same multi-hop questions and supporting documents with key info <b>(pink)</b> , while varying distractor documents <b>(blue)</b>. We began with 20 documents, then omitted redundant ones while lengthening the remaining ones to match the original size. </div> <br> <br> <div align="center"> <img src="/Main_Fig_Horizontal.png" alt="Alt text" width="800"> </div>:desktop_computer: Reproduction Instructions:
Download the different benchmark datasets
Our custom benchmark datasets include a control set, the original dataset, and variants with replaced distractors for varying document multiplicity. You can Download them from here, or from Hugging Face.
Alternatively, regenerate them using scripts/create_various_sets.py.
Prepare the environment
To set up the running environment, run the following command:
gh repo clone shaharl6000/MoreDocsSameLen
cd MoreDocsSameLen
export PYTHONPATH=./
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Run predictions
For running in inference on the chosen benchmark dataset you need to define for each benchmark data set a config file under configuration folder files/configuration/predict.json.
The predict.json file contains the path to the generated benchmark from previous step, the batch size, and the decoding temperature for the LLMs.
We supply two option for running the code with small models (the code run locally), with large model (the code run with Together platform)
To run prediction with the small models, run the following command:
python scripts/run_model_predictions.py --config <PATH_TO_CONFIG> --model_name <MODEL_NAME>
For the large model add 'together_api_key.py' under the root path and define: API_KEY = XXXXX
then run the following command:
python scripts/run_model_predictions.py --config <PATH_TO_CONFIG> --model_name <MODEL_NAME> --run_together
Evaluate the predictions
To evaluate the predictions, you can use scripts/evaluate_dataset.py by providing
the path to the predictions from previous step, and output path where all results will be saved.
python scripts/evaluate_dataset.py --predictions_dir <OUTPUT_PATH_FROM_PREV_STEP> --output_path <RESULT_OUTPUT> --ds_name MusiQue
:newspaper: Citation
If you use this code or the datasets in your research, please cite:
@misc{levy2025documentslengthisolatingchallenge,
title={More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG},
author={Shahar Levy and Nir Mazor and Lihi Shalmon and Michael Hassid and Gabriel Stanovsky},
year={2025},
eprint={2503.04388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.04388},
}
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
