FERMAT
A vLLM-based Pipeline for benchmarking various VLMs on HMER Dataset of AI4Bharat
Install / Use
/learn @AI4Bharat/FERMATREADME
FERMAT: Can Vision-Language Models Evaluate Handwritten Math?
We present FERMAT, a benchmark designed to assess VLMs’ ability to detect, localize and correct errors in handwritten mathematical content. Please refer to our paper for more details.
<p align="center" width="100%"> <img src="FERMAT.png" alt="We present FERMAT, a benchmark designed to assess VLMs’ ability to detect, localize and correct errors in handwritten mathematical content." style="width: 75%; min-width: 200px; display: block; margin: auto;"> </p>Loading Data
Steps to download data and store the images in benchmark_images, and csv in benchmark_csv. Steps to dowload data for the oikantik format
Setup
To run evaluation of VLMs against the FEMRAT dataset, you need to install the required packages by running the following command:
pip install -r requirements.txt
We self-hosted Pixtral-12B-2409 (https://huggingface.co/mistralai/Pixtral-12B-2409), Pixtral-Large-Instruct-2411, LLaMa-3.2-11B-Vision-Instruct, LLaMa-3.2-90B-Vision-Instruct, Phi-3.5-Vision-Instruct using VLLM (https://github.com/vllm-project/vllm)
We used hosted services for GPT-Family, Gemini-Family
For self-hosted models,
-
Set up environment variables:
export OPENAI_API_BASE=[ADD_THE_ENDPOINT_URL_OF_HOSTED_MODEL]Example: "http://localhost:8004/v1"
-
Start Evaluations:
python main.py --model [MODEL_NAME] --dir_name [DATA_DIR]- MODEL_NAME: Name of the model to be evaluated. Choices:
['pixtral', 'pixtral_large', 'phi', 'llama_large', 'llama'] - DATA_DIR: Path to the directory where the Benchmark Images are stored
- MODEL_NAME: Name of the model to be evaluated. Choices:
-
Fill-in CSV
Once the evaluation is done, the results will be stored in a JSON File with the format
state_<MODEL_NAME>.json. You can convert this JSON file to a CSV file using the following command:python fill_in_csv.py --model [MODEL_NAME] --csv-file [CSV_FILE] --json-file [JSON_FILE]- MODEL_NAME: Name of the model to be evaluated. Choices:
['pixtral', 'pixtral_large', 'phi', 'llama_large', 'llama'] - CSV_FILE: Path to the CSV file where the results need to be filled in.
- JSON_FILE: Path to the JSON file where the results are stored.
- MODEL_NAME: Name of the model to be evaluated. Choices:
Citation
If you used this repository or our models, please cite our work:
@article{nath2025vision1language,
title = {Can Vision-Language Models Evaluate Handwritten Math?},
author = {Oikantik Nath and Hanani Bathina and Mohammed Safi Ur Rahman Khan and Mitesh M. Khapra},
year = {2025},
journal = {arXiv preprint arXiv: 2501.07244}
}
Related Skills
node-connect
344.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
99.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.4kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
