Worldcuisines
WorldCuisines is an extensive multilingual and multicultural benchmark that spans 30 languages, covering a wide array of global cuisines. Best Theme Paper ๐ NAACL 2025
Install / Use
/learn @worldcuisines/WorldcuisinesREADME
๐ WorldCuisines: Multilingual Multicultural VQA Benchmark ๐ฅ
Best Theme Paper ๐ NAACL 2025.
Introducing ๐ WorldCuisines, a massive-scale multilingual and multicultural VQA benchmark that challenges Vision-Language Models (VLMs) to understand cultural food diversity in over 30 languages and dialects, across 9 language families, with over 1 million data points available generated from 2.4k dishes with 6k images. As benchmark, we have three sets:
- Training Data (1m). We are in the process of preparing a comprehensive dataset for training purposes. For this benchmark, we have not utilized the training set to enhance the model. Instead, we are organizing this data to support future research endeavors.
- Test Small (12k). It is meant for compute-efficient evaluation.
- Test Large (60k). The 12k test set is a subset of the 60k test set.

Table of Contents
- ๐ Benchmark
- ๐ Paper
- ๐ Leaderboard and Results
- โก Environment Setup
- ๐งช Run Experiments
- ๐ Aggregate Experiment Result
- ๐๏ธ Visualize Results
- ๐ป Supported Models
- โ VQA Dataset Generation
- ๐ How to Contribute?
- โ๏ธ On Progress
๐ Benchmark
๐ WorldCuisines ๐ฅ comprises a balanced proportion of its 2 supported tasks. We provide over 1M training data and a 60k evaluation data. Our benchmark evaluates VLMs on two tasks: dish name prediction and dish location prediction. The settings include no-context, contextualized, and adversarial infused prompt as the model's input.
Our dataset is available at ๐ค Hugging Face Dataset. The supporting KB data can be found at ๐ค Hugging Face Dataset.

๐ Paper
This is the source code of the paper [Arxiv]. This code has been written using Python. If you use any code or datasets from this toolkit in your research, please cite the associated paper.
@article{winata2024worldcuisines,
title={WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines},
author={Winata, Genta Indra and Hudi, Frederikus and Irawan, Patrick Amadeus and Anugraha, David and Putri, Rifki Afina and Wang, Yutong and Nohejl, Adam and Prathama, Ubaidillah Ariq and Ousidhoum, Nedjma and Amriani, Afifa and others},
journal={arXiv preprint arXiv:2410.12705},
year={2024}
}
๐ Leaderboard and Results
If you wish to get the final result for all VLLMs that we evaluate, please refer to this leaderboard for the summary. The raw results are placed in the evaluation/score/json directory.
โก Environment Setup
Please run the following command to install the required libraries to reproduce the benchmark results.
Via pip
pip install -r requirements.txt
Via conda
conda env create -f env.yml
For pangea, please run the following
pip install -e "git+https://github.com/gentaiscool/LLaVA-NeXT@79ef45a6d8b89b92d7a8525f077c3a3a9894a87d#egg=llava[train]"
๐งช Run Experiments
All experiment results will be stored in the evaluation/result/ directory. The results are evaluated using accuracy for all tasks, specifically for open-ended task (OEQ), we use accuracy computed using multi-reference. You can execute each experiment using the following commands:
cd evaluation/
python run.py --model_path {model_path} --task {task} --type {type}
Main Arguments
| Argument | Description | Example / Default |
|------------------|---------------------------------------------------|---------------------------------------|
| --task | Task number to evaluate (1 or 2) | 1 (default), 2 |
| --type | Type of question to evaluate (oe or mc) | mc (default), oe |
| --model_path | Path to the model | Qwen/Qwen2-VL-72B-Instruct (default) + others |
| --fp32 | Use float32 instead of float16/bfloat16 | False (default) |
| --multi_gpu | Use multiple GPUs | False (default) |
| -n, --chunk_num | Number of chunks to split the data into | 1 (default) |
| -k, --chunk_id | Chunk ID (0-based) | 0 (default) |
| -s, --st_idx | Start index for slicing data (inclusive) | None (default) |
| -e, --ed_idx | End index for slicing data (exclusive) | None (default) |
Supported Models
We support the following models (you can modify our code to run evaluation with other models).
rhymes-ai/Ariameta-llama/Llama-3.2-11B-Vision-Instructmeta-llama/Llama-3.2-90B-Vision-Instructllava-hf/llava-v1.6-vicuna-7b-hfllava-hf/llava-v1.6-vicuna-13b-hfallenai/MolmoE-1B-0924allenai/Molmo-7B-D-0924allenai/Molmo-7B-O-0924microsoft/Phi-3.5-vision-instructQwen/Qwen2-VL-2B-InstructQwen/Qwen2-VL-7B-InstructQwen/Qwen2-VL-72B-Instructmistralai/Pixtral-12B-2409neulab/Pangea-7B(please install Llava as mentioned in โก Environment Setup)- WIP: Proprietary Models
๐ Aggregate Experiment Result
Edit evaluation/score/score.yml to determine scoring mode, evaluation set, and evaluated VLMs. Note that mc means multiple-choice and oe means open-ended.
mode: all # {all, mc, oe} all = mc + oe
oe_mode: multi # {single, dual, multi}
subset: large # {large, small}
models:
- llava-1.6-7b
- llava-1.6-13b
- qwen-vl-2b
- qwen2-vl-7b-instruct
- qwen2-vl-72b
- llama-3.2-11b
- llama-3.2-90b
- molmoe-1b
- molmo-7b-d
- molmo-7b-o
- aria-25B-moe-4B
- Phi-3.5-vision-instruct
- pixtral-12b
- nvlm
- pangea-7b
- gpt-4o-2024-08-06
- gpt-4o-mini-2024-07-18
- gemini-1.5-flash
In addition to the multi mode for generating the oe score, which compares the answer to the golden labels across all languages, we also support other golden label referencing settings:
singlereference: compares the answer only to the golden label in the original language.dualreference: compares the answer to the golden label in the original language and English.
Once set, run this command:
cd evaluation/score/
python score.py
๐๏ธ Visualize Results
We provide radar, scatter, and connected scatter-line plots to visualize scoring results for all VLMs in evaluation/score/plot/.
To generate all radar plot, use:
python evaluation/score/plot/visualization.py
Examples of Radar Plot

You can also modify evaluation/score/score.yml to select which VLMs to visualize and adjust plot labels in plot_mapper.yml.
Examples of Other Plots
<img src="assets/model_params.png" width="60%"> <img src="assets/model_scatter.png" width="60%"> <img src="assets/scatterplot.png" width="60%">
Other plot generation scripts are available in the *.ipynb files within the same directory.
๐ป Supported Models
Our codebase supports the usage of multiple models for the experiments, providing flexibility for customization of the list shown below:
Generative VLMs:
Open-Source
- Llava1.6 Vicuna llava-hf/llava-v1.6-vicuna-7b-hf llava-hf/llava-v1.6-vicuna-13b-hf
- Qwen2 VL Instruct Qwen/Qwen2-VL-2B-Instruct Qwen/Qwen2-VL-7B-Instruct Qwen/Qwen2-VL-72B-Instruct
- Llama 3.2 Instruct meta-llama/Llama-3.2-11B-Vision-Instruct meta-llama/Llama-3.2-90B-Vision-Instruct
- Molmo-E 1B allenai/MolmoE-1B-0924
- Molmo-D 7B allenai/Molmo-7B-D-0924
- Molmo-O 7B allenai/Molmo-7B-O-0924
- Aria 25B rhymes-ai/Aria
- Phi-3.5 Vision 4B microsoft/Phi-3.5-vision-instruct
- Pixtral 12B mistralai/Pixtral-12B-2409
- Pangea 7B neulab/Pangea-7B
- NVLM-D 72B nvidia/NVLM-D-72B
Proprietary
(last tested as of October 2024)
- GPT-4o
- GPT-4o Mini
- Gemini 1.5 Flash
โ VQA Dataset Generation
To generate a VQA dataset from the knowledge base, you can refer to the generate_vqa/sampling.py script. This script generates the dataset for various tasks in both training and testing sets.
Example Commands: To generate datasets for Test Small, Test Large, and Train sets, run the following commands:
cd generate_vqa
mkdir -p generated_data
# Test Small Ta
Related Skills
node-connect
347.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.0kQQBot ๅฏๅชไฝๆถๅ่ฝๅใไฝฟ็จ <qqmedia> ๆ ็ญพ๏ผ็ณป็ปๆ นๆฎๆไปถๆฉๅฑๅ่ชๅจ่ฏๅซ็ฑปๅ๏ผๅพ็/่ฏญ้ณ/่ง้ข/ๆไปถ๏ผใ
