FERMAT

A vLLM-based Pipeline for benchmarking various VLMs on HMER Dataset of AI4Bharat

Generate Convert Improve

Install / Use

/learn @AI4Bharat/FERMAT

About this skill

Quality Score

0/100

README

FERMAT: Can Vision-Language Models Evaluate Handwritten Math?

📜 Paper | 🤗 HF Dataset

We present FERMAT, a benchmark designed to assess VLMs’ ability to detect, localize and correct errors in handwritten mathematical content. Please refer to our paper for more details.

Loading Data

Steps to download data and store the images in benchmark_images, and csv in benchmark_csv. Steps to dowload data for the oikantik format

Setup

To run evaluation of VLMs against the FEMRAT dataset, you need to install the required packages by running the following command:

pip install -r requirements.txt

We self-hosted Pixtral-12B-2409 (https://huggingface.co/mistralai/Pixtral-12B-2409), Pixtral-Large-Instruct-2411, LLaMa-3.2-11B-Vision-Instruct, LLaMa-3.2-90B-Vision-Instruct, Phi-3.5-Vision-Instruct using VLLM (https://github.com/vllm-project/vllm)

We used hosted services for GPT-Family, Gemini-Family

For self-hosted models,

Set up environment variables:
```
export OPENAI_API_BASE=[ADD_THE_ENDPOINT_URL_OF_HOSTED_MODEL]
```
Example: "http://localhost:8004/v1"
Start Evaluations:
```
python main.py --model [MODEL_NAME] --dir_name [DATA_DIR]
```
- MODEL_NAME: Name of the model to be evaluated. Choices: ['pixtral', 'pixtral_large', 'phi', 'llama_large', 'llama']
- DATA_DIR: Path to the directory where the Benchmark Images are stored
Fill-in CSV

Once the evaluation is done, the results will be stored in a JSON File with the format state_<MODEL_NAME>.json. You can convert this JSON file to a CSV file using the following command:
```
python fill_in_csv.py --model [MODEL_NAME] --csv-file [CSV_FILE] --json-file [JSON_FILE]
```
- MODEL_NAME: Name of the model to be evaluated. Choices: ['pixtral', 'pixtral_large', 'phi', 'llama_large', 'llama']
- CSV_FILE: Path to the CSV file where the results need to be filled in.
- JSON_FILE: Path to the JSON file where the results are stored.

Citation

If you used this repository or our models, please cite our work:

@article{nath2025vision1language,
  title   = {Can Vision-Language Models Evaluate Handwritten Math?},
  author  = {Oikantik Nath and Hanani Bathina and Mohammed Safi Ur Rahman Khan and Mitesh M. Khapra},
  year    = {2025},
  journal = {arXiv preprint arXiv: 2501.07244}
}

Related Skills

node-connect

344.4k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

99.2k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.4k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.4k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。