Taqyim
Python intefrace for evaluation on chatgpt models
Install / Use
/learn @ARBML/TaqyimREADME
Taqyim تقييم
<p align="center"> <img width = "150px" src="https://github.com/ARBML/Taqyim/assets/15667714/6710535a-4d0b-4c1a-8c35-49b2e2110600"></img> </p>A library for evaluting Arabic NLP datasets on chatgpt models.
Installation
pip install -e .
Example
import taqyim as tq
pipeline = tq.Pipeline(
eval_name="ajgt-test",
dataset_name="arbml/ajgt_ubc_split",
task_class="classification",
task_description= "Sentiment Analysis",
input_column_name="content",
target_column_name="label",
prompt="Predict the sentiment",
api_key="<openai-key>",
train_split="train",
test_split="test",
model_name="gpt-3.5-turbo-0301",
max_samples=1,)
# run the evaluation
pipeline.run()
# show the output data frame
pipeline.show_results()
# show the eval metrics
pipeline.get_final_report()
Run on custom dataset
custom_dataset.ipynb has a complete example on how to run evaluation on a custom dataset.
parameters
eval_namechoose an eval nametask_classclass name from supported class namestask_descriptionshort description about the taskdataset_namedataset name for evaluationsubsetIf the dataset has subsettrain_splittrain split name in the datasettest_splittest split name in the datasetinput_column_nameinput column name in the datasettarget_column_nametarget column name in the datasetpromptthe prompt to be fed to the modeltask_descriptionshort string explaining the taskapi_keyapi key from keyspreprocessing_fnfunction used to process inputs and targetsthreadsnumber of threads used to fetch the apithreads_timeoutthread timeoutmax_samplesmax samples used for evaluation from the datasetmodel_namechoose eithergpt-3.5-turbo-0301orgpt-4-0314temperaturetemperature passed to the model between 0 and 2, higher temperature means more random resultsnum_few_shotnumber of fewshot samples to be used for evaluationresume_from_recordifTrueit will continue the run from the sample that has no results.seedseed to redproduce the results
Supported Classes and Tasks
Classificationclassification tasks see classification.py.Pos_Taggingpart of speech tagging tasks pos_tagging.py.Translationmachine translation translation.py.Summarizationmachine translation summarization.py.MCQmultiple choice question answering mcq.py.Ratingrating multiple LLMs outputs rating.py.Diacritizationmachine translation diacritization.py.
Evaluation on Arabic Tasks
|Tasks |Dataset |Size |Metrics |GPT-3.5 |GPT-4 |SoTA| | :--- | :---: | :---: | :---: | :---: | :---: |:---:| |Summarization |EASC |153 |RougeL |23.5 |18.25 |13.3| |PoS Tagging |PADT |680 |Accuracy |75.91 |86.29 |96.83| |classification |AJGT |360 |Accuracy |86.94 |90.30 |96.11| |transliteration |BOLT Egyptian✢ |6,653 |BLEU |13.76 |27.66 |65.88| |translation |UN v1 |4,000 |BLEU |35.05 |38.83 |53.29| |Paraphrasing |APB |1,010 |BLEU |4.295 |6.104 |17.52| |Diacritization |WikiNews✢✢ |393 |WER/DER |32.74/10.29 | 38.06/11.64 |4.49/1.21|
✢ BOLT requires LDC subscription
✢✢ WikiNews not public, contact authors to access the dataset
@misc{alyafeai2023taqyim,
title={Taqyim: Evaluating Arabic NLP Tasks Using ChatGPT Models},
author={Zaid Alyafeai and Maged S. Alshaibani and Badr AlKhamissi and Hamzah Luqman and Ebrahim Alareqi and Ali Fadel},
year={2023},
eprint={2306.16322},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
