MoreDocsSameLen

This repository contains code and datasets for our paper on the effects of document multiplicity while the context size is fixed in Retrieval-Augmented Generation (RAG) systems.

Generate Convert Improve

Install / Use

/learn @shaharl6000/MoreDocsSameLen

About this skill

Quality Score

0/100

README

<div align="center"> <h1>More Documents, Same Length:<br>Isolating the Challenge of Multiple Documents in RAG</h1> <h3><a href="https://arxiv.org/abs/2503.04388" target="_blank">Link to arXiv</a></h3> </div>

This repository contains code and datasets for our paper on the effects of document multiplicity while the context size is fixed in Retrieval-Augmented Generation (RAG) systems. For detailed methodology, experiments, and analysis, please refer to the full paper 📰

:bulb: High-Level Conclusions

Our results show that adding more retrieved documents can hurt performance—up to a 10% drop in fixed-context setups—making document rich retrieval tasks harder. Llama-3.1 and Gemma-2 declined, Qwen-2 stayed steady, and smaller LLMs (7–9B) followed the trend less strongly. This suggests systems need to balance relevance and variety to cut conflicts, and future models might improve by filtering out contradictory details while using the range of documents.

🔬 Our Methodology:

<div style="max-width: 400px; margin: 0 auto;"> Starting with a Wikipedia-derived dataset, we created different sets with the same amount of tokens but fewer documents by adjusting the length of the key documents for each question. Our sets use the same multi-hop questions and supporting documents with key info <b>(pink)</b> , while varying distractor documents <b>(blue)</b>. We began with 20 documents, then omitted redundant ones while lengthening the remaining ones to match the original size. </div> <br> <br> <div align="center"> <img src="/Main_Fig_Horizontal.png" alt="Alt text" width="800"> </div>

:desktop_computer: Reproduction Instructions:

Download the different benchmark datasets

Our custom benchmark datasets include a control set, the original dataset, and variants with replaced distractors for varying document multiplicity. You can Download them from here, or from Hugging Face.

Alternatively, regenerate them using scripts/create_various_sets.py.

Prepare the environment

To set up the running environment, run the following command:

gh repo clone shaharl6000/MoreDocsSameLen
cd MoreDocsSameLen
export PYTHONPATH=./
python3.11 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Run predictions

For running in inference on the chosen benchmark dataset you need to define for each benchmark data set a config file under configuration folder files/configuration/predict.json.

The predict.json file contains the path to the generated benchmark from previous step, the batch size, and the decoding temperature for the LLMs.

We supply two option for running the code with small models (the code run locally), with large model (the code run with Together platform)

To run prediction with the small models, run the following command:

python scripts/run_model_predictions.py --config <PATH_TO_CONFIG> --model_name <MODEL_NAME>

For the large model add 'together_api_key.py' under the root path and define: API_KEY = XXXXX

then run the following command:

python scripts/run_model_predictions.py --config <PATH_TO_CONFIG> --model_name <MODEL_NAME> --run_together

Evaluate the predictions

To evaluate the predictions, you can use scripts/evaluate_dataset.py by providing the path to the predictions from previous step, and output path where all results will be saved.

python scripts/evaluate_dataset.py --predictions_dir <OUTPUT_PATH_FROM_PREV_STEP> --output_path <RESULT_OUTPUT> --ds_name MusiQue

:newspaper: Citation

If you use this code or the datasets in your research, please cite:

@misc{levy2025documentslengthisolatingchallenge,
      title={More Documents, Same Length: Isolating the Challenge of Multiple Documents in RAG}, 
      author={Shahar Levy and Nir Mazor and Lihi Shalmon and Michael Hassid and Gabriel Stanovsky},
      year={2025},
      eprint={2503.04388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.04388}, 
}

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。