SkillAgentSearch skills...

MIRAI

Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

Install / Use

/learn @yecchen/MIRAI
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

MIRAI : Evaluating LLM Agents for Event Forecasting

This repository contains the code and data for the paper "MIRAI: Evaluating LLM Agents for Event Forecasting".

Authors (*Equal Contribution): Chenchen Ye*, Ziniu Hu*, Yihe Deng*, Zijie Huang, Mingyu Derek Ma, Yanqiao Zhu, Wei Wang.

Homepage Data Demo Notebook Demo Video

Citation: If you find this repo useful for your research, please consider citing the paper

@misc{ye2024miraievaluatingllmagents,
      title={MIRAI: Evaluating LLM Agents for Event Forecasting}, 
      author={Chenchen Ye and Ziniu Hu and Yihe Deng and Zijie Huang and Mingyu Derek Ma and Yanqiao Zhu and Wei Wang},
      year={2024},
      eprint={2407.01231},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.01231}, 
}

https://github.com/yecchen/MIRAI/assets/109169855/f2c44af4-b2e1-4b5f-bc2a-e0ac6a74e58a

🔔 News

  • [07/01/2024] Our paper is released on arXiv: https://arxiv.org/abs/2407.01231.

Table of Contents

About MIRAI

We introduce MIRAI, a benchmark crafted for evaluating LLM agents for temporal forecasting in the realm of international events, with tool-use and complex reasoning. We consider forecasting as the process of collecting essential historical data and performing temporal reasoning to anticipate the outcomes of future events.

Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents’ abilities from short-term to long-term forecasting.

We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents’ capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

<p align="center"> <img src="images/MIRAI_task.png" width="80%"> <br> Task Figure: An example of forecasting the relations between Australia and China on Nov.18.2023. The database contains query-related historical relations and news articles, while the agent fails to predict the change of relation and makes a wrong forecast. </p> <p align="center"> <img src="images/MIRAI_data_stats.png" width="80%"> <br> Database Figure: MIRAI comprehensively covers global event data. (a) The circular chart shows the relation hierarchy and distribution in MIRAI. (b) The heatmap visualizes the intensity of these events globally, distinguishing between areas of conflict (red) and mediation (blue). (c) The heatmap illustrates the frequency of these events, highlighting regions with the most occurrences. </p> <p align="center"> <img src="images/MIRAI_agent.png" width="80%"> <br> Agent Figure: Overview of the LLM agent’s interaction with the multi-source environment in MIRAI using the ReAct strategy for forecasting a query event. The framework consists of three main steps: (1) Think: The agent analyzes the current status and plans the next action based on the query and the provided API specifications. (2) Act: The agent generates a Single Function call or a Code Block to retrieve and analyze relevant data from the database. (3) Execute: The Python interpreter runs the generated code with the API implementation and database and produces observations. These steps are iteratively performed until the agent reaches a final forecast for the future relation. </p>

Demos and Examples

We provide example reasoning and forecasting outputs of the agent. The examples are performed by the GPT-4o based agent with the ReAct strategy and action type as Code Block / Single Function. The raw outputs are stored in examples/outputs_raw and the markdown formatted outputs are stored in examples/outputs_md here.

We also provide a Google colab notebook for running the example forecasting with the GPT-4o based agent. The interactive demo notebook is available here.

Setup

Environment

The following steps provide the necessary environment setup.

  1. Create a Python virtual environment with Conda:
conda create -n mirai python=3.9
conda activate mirai
  1. Install the following Python dependencies to run the codes.
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
  1. Set up necessary environment variables.
export OPENAI_API_KEY="your_openai_api_key"
huggingface-cli login --token "${your_access_token}"

Data

Download the data from the following link: MIRAI Data and extract the contents to the data directory.

Dataset Construction

To construct the above data from scratch, we also provide the detailed scripts for dataset construction. The dataset construction scripts is contained in the dataset_construction directory, including the following files and running commands:

cd dataset_construction
  1. 1_download_kg_data.py: Download the GDELT raw data from the official website.
python 1_download_kg_data.py
  1. 2_clean_kg.py: Clean the raw data and standardize the event data.
python 2_clean_kg.py
  1. 3_filter_kg_by_source.py: Filter the event data by the source news articles, especially the number of daily mentions.
python 3_filter_kg_by_source.py
  1. 4_distribute_download_text.py: Download source news articles for each event.
python distribute_download_text.py --hosts "host1,host2" --username "your_username" --password "your_password" \
    --project_dir "/remote/project/directory" --conda_path "/remote/conda/path" --env_name "remote_conda_environment" \
    --script_path "/remote/script/path.py" --output_directory "/path/to/output" --log_directory "/path/to/logs"
  1. 5_clean_text.py: Clean the downloaded news articles. We follow part of the web document cleaning process from OBELICS. In this process, we use the SentencePiece tokenizer model and the FastText lang id model, lid.176.bin, which needs to be downloaded and placed in the obelics/models directory.
python 5_clean_text.py
  1. 6_generate_final_data.py: Generate the final dataset for MIRAI, including data_kg.csv and data_news.csv.
python 6_generate_final_data.py
  1. 7_generate_test_set.py: Generate the test set for MIRAI, which is built on the November 2023 data from the final dataset.
python 7_generate_test_set.py
  1. 8_generate_test_subset.py: Generate the test subset for MIRAI, which samples a balanced subset from the test set.
python 8_generate_test_subset.py
  1. 9_generate_relation_query.py: Generate the relation query for MIRAI, which is used for agent forecasting and evaluation.
python 9_generate_relation_query.py --dataset test
python 9_generate_relation_query.py --dataset test_subset

Getting Started

cd agents

Setting Arguments

The following arguments are used for running the code scripts react_agent.py and direct_agent.py:

  • --dataset: Selects the dataset to be used. Available options are:
    • test: Full test dataset.
    • test_subset: A balanced subset of the test dataset. Default is test_subset.
  • --timediff: Specifies the date difference from the query date to the current date, which is the temporal distance of the forecasting target. This is an integer value with a default of 1.
  • --model_name: Chooses the model for execution. Opt

Related Skills

View on GitHub
GitHub Stars92
CategoryDevelopment
Updated14d ago
Forks21

Languages

Python

Security Score

85/100

Audited on Mar 19, 2026

No findings