ZhuangBench
[ACL'24 Findings] Teaching Large Language Models an Unseen Language on the Fly
Install / Use
/learn @luciusssss/ZhuangBenchREADME
Teaching Large Language Models an Unseen Language on the Fly
<div align=center> <img src="zhuang_first_fig.jpg" style="width:600px" /> </div>Data and code for the following papers:
ACL'24 Findings (Full-Length Paper) Teaching Large Language Models an Unseen Language on the Fly
ICLR'24 Tiny Paper Can LLMs Learn a New Language on the Fly? A Case Study on Zhuang
Dataset
We present ZhuangBench, a collection of NLP resources for Zhuang (壮语), a low-resource language spoken in China.
It consists of a Zhuang-Chinese dictionary, a Zhuang-Chinese parallel corpus, and Zhuang-Chinese machine translation test set.
Important: Preventing Test Set Contamination
We encrypted the source files of ZhuangBench in data.zip to prevent test set contamination.
The password is zhuangbench.
List of files:
dictionary_za2zh.jsonl: Zhuang-Chinese dictionary.dictionary_zh2za.jsonl: Chinese-Zhuang dictionary.parallel_corpus.json: Zhuang-Chinese parallel corpus.test_translation_set.json: Zhuang-Chinese machine translation test set.preprocessed/dictionary_za2zh_web+giza.jsonl: Zhuang-Chinese dictionary augmented with BLI from Giza++.preprocessed/dictionary_zh2za_web+giza+synonym.jsonl: Chinese-Zhuang dictionary augmented with BLI from Giza++ and synonyms.
Beta Version
Our ICLR'24 Tiny Paper uses a beta version of the dataset, ZhuangBench-Beta. We provide the data in data-beta-version.zip (password: zhuangbench-beta).
This data is for archival purposes only. We recommend using the newer data in data.zip, which is larger and includes typo corrections.
Code
We provide code of DiPMT++ to reproduce the results in the paper.
Install the dependencies:
pip install -r requirements.txt
Use the scripts in ./scripts to run the LLMs and evaluate the results.
License
The license for the code and data is MIT.
Citation
@inproceedings{zhang2024teaching,
title={Teaching Large Language Models an Unseen Language on the Fly},
author={Zhang, Chen and Liu, Xiao and Lin, Jiuheng and Feng, Yansong},
booktitle={Findings of the Association for Computational Linguistics ACL 2024},
pages={8783--8800},
year={2024}
}
@inproceedings{zhang2024can,
title={Can {LLM}s Learn a New Language on the Fly? A Case Study on Zhuang},
author={Chen Zhang and Mingxu Tao and Quzhe Huang and Zhibin Chen and Yansong Feng},
booktitle={The Second Tiny Papers Track at ICLR 2024},
year={2024},
}
Related Skills
node-connect
335.4kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
335.4kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.5kCommit, push, and open a PR
