Cendol

Indonesian T0 | Instruction-tuning for low-resource and extremely low-resource Austronesian languages

Generate Convert Improve

Install / Use

/learn @IndoNLP/Cendol

About this skill

Quality Score

0/100

README

Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages

Cendol is an open-source collection of fine-tuned generative large language models in Indonesian languages covering decoder-only and encoder-decoder transformer model architectures ranging in scale from 300 million to 13 billion parameters. This is the code repository for Cendol. Links to the models and datasets can be found below.

Model Details

Note: Use of Cendol is licensed under the Apache 2.0 license

Overview

IndoNLP developed and publicly released the Cendol family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 560 million to 13 billion parameters.

Cendol models cover two instruction-tuned versions:

Cendol-Instruct that is instruction-tuned on tasks-specific NLP data such as sentiment analysis, topic modeling, machine translation, summarization, question answering, paraphrasing, etc
Cendol-Chat that is continuously instruction-tuned from Cendol-Instruct on general knowledge and human-centric prompts.

Both Cendol-Instruct and Cendol-Chat are designed for a single-turn conversation. Cendol outperforms open-source multilingual and region-specific LLMs on most benchmarks we tested by a huge margin, with the smaller version (<1B parameters) of Cendol being highly competitive with other LLMs with 7B parameters.

Model Developers: IndoNLP

Variations

Cendol comes from 2 base models (mT5 and LLaMA-2) each with a range of parameter sizes. mT5-based Cendol comes with 300M (mT5-small), 580M (mT5-base), 1.2B (mT5-large), 3.7B (mT5-XL), and 13B (mT5-XXL) models, while LLaMA-2-based Cendol comes with 7B (LLaMA2-7B) and 13B (LLaMA2-13B) models. Both variants come with Cendol-Instruct and Cendol-Chat variations. All 13B parameter models are tuned with LoRA, while others are fully fine-tuned.

In our paper, we showcase that adapting region-specific LLMs using LoRA is ineffective and inefficient, i.e., the 13B (mT5-XXL) Cendol models perform slightly worse than the 1.2B (mT5-large) Cendol models, while having 3x slower training time and 4x slower inference time. As an alternative to LoRA, we showcase the benefits of vocabulary substitution as an effective and efficient strategy for region-specific adaptation, where we improve the efficiency by 11.50% and 18.71% for training and inference times, respectively. In terms of evaluation performance, we also showcase that the model performs on par with the Cendol model trained with the original vocabulary. We also release the Indonesian vocabulary-adapted model denoted as Indonesian-Vocab Instruct.

Input-Output: Models input and output are text only.

Model Architecture

|Model|Training Data|Params|Tuning Strategy|LR| |---|---|---|---|---| |Cendol mT5-small Instruct|Cendol Collection v1|300M|Fully-Finetuned|3.0 x 10-4| |Cendol mT5-base Instruct|Cendol Collection v1|580M|Fully-Finetuned|3.0 x 10-4| |Cendol mT5-large Instruct|Cendol Collection v1|1.2B|Fully-Finetuned|3.0 x 10-4| |Cendol mT5-xl Instruct|Cendol Collection v1|3.7B|Fully-Finetuned|3.0 x 10-4| |Cendol mT5-xxl Instruct|Cendol Collection v1|13B|LoRA|2.0 x 10-4| |Cendol LLaMA-2 (7B) Instruct|Cendol Collection v1|7B|Fully-Finetuned|2.0 x 10-5| |Cendol LLaMA-2 (7B) Indonesian-Vocab Instruct|Cendol Collection v1|7B|Fully-Finetuned|2.0 x 10-5| |Cendol LLaMA-2 (13B) Instruct|Cendol Collection v1|13B|LoRA|2.0 x 10-5| |Cendol mT5-small Chat|Cendol Collection v2|300M|Fully-Finetuned|3.0 x 10-5| |Cendol mT5-base Chat|Cendol Collection v2|580M|Fully-Finetuned|3.0 x 10-5| |Cendol mT5-large Chat|Cendol Collection v2|1.2B|Fully-Finetuned|3.0 x 10-5| |Cendol mT5-xl Chat|Cendol Collection v2|3.7B|Fully-Finetuned|3.0 x 10-5| |Cendol mT5-xxl Chat|Cendol Collection v2|13B|LoRA|2.0 x 10-4| |Cendol LLaMA-2 (7B) Chat|Cendol Collection v2|7B|Fully-Finetuned|1.0 x 10-5| |Cendol LLaMA-2 (13B) Chat|Cendol Collection v2|13B|LoRA|2.0 x 10-4|

Model Dates Cendol was trained between October 2023 and January 2024.

License Use of Cendol is licensed under the Apache 2.0 license

Research Paper "Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages"

Intended Use

Intended Use Cases Cendol is intended for research use especially on Indonesian languages. Cendol models are intended for a single turn instruction, with Cendol-Instruct models can be used for task-specific instruction, while Cendol-Chat models can be used for general knowledge instruction.

Out-of-scope Uses Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English and Indonesian languages. Use in any other way that is prohibited by the Acceptable Use Policy and Licensing Agreement for Cendol.

Evaluation Results

In this section, we report the results for the Cendol models on large-scale NLU and NLG benchmarks. For all the evaluations, we use our internal evaluations library.

NLU Performance

NLG Performance

Human evaluation

Ethical Considerations and Limitations

Cendol is a new technology that carries risks with its use. Testing conducted to date has been in Indonesian, and has not covered, nor could it cover all scenarios. For these reasons, as with all LLMs, Cendol’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Cendol, developers should perform safety testing and tuning tailored to their specific applications of the model.

Citation

If you are using any resources including Cendol models, code, or data, please cite the following articles:

@misc{cahyawijaya-etal-2024-cendol,
      title={Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages}, 
      author={Samuel Cahyawijaya and Holy Lovenia and Fajri Koto and Rifki Afina Putri and Emmanuel Dave and Jhonson Lee and Nuur Shadieq and Wawan Cenggoro and Salsabil Maulana Akbar and Muhammad Ihza Mahendra and Dea Annisayanti Putri and Bryan Wilie and Genta Indra Winata and Alham Fikri Aji and Ayu Purwarianti and Pascale Fung},
      year={2024},
      eprint={2404.06138},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@inproceedings{cahyawijaya-etal-2023-nusacrowd,
    title = "{N}usa{C}rowd: Open Source Initiative for {I}ndonesian {NLP} Resources",
    author = "Cahyawijaya, Samuel  and
      Lovenia, Holy  and
      Aji, Alham Fikri  and
      Winata, Genta  and
      Wilie, Bryan  and
      Koto, Fajri  and
      Mahendra, Rahmad  and
      Wibisono, Christian  and
      Romadhony, Ade  and
      Vincentio, Karissa  and
      Santoso, Jennifer  and
      Moeljadi, David  and
      Wirawan, Cahya  and
      Hudi, Frederikus  and
      Wicaksono, Muhammad Satrio  and
      Parmonangan, Ivan  and
      Alfina, Ika  and
      Putra, Ilham Firdausi  and
      Rahmadani, Samsul  and
      Oenang, Yulianti  and
      Septiandri, Ali  and
      Jaya, James  and
      Dhole, Kaustubh  and
      Suryani, Arie  and
      Putri, Rifki Afina  and
      Su, Dan  and
      Stevens, Keith  and
      Nityasya, Made Nindyatama  and
      Adilazuarda, Muhammad  and
      Hadiwijaya, Ryan  and
      Diandaru, Ryandito  and
      Yu, Tiezheng  and
      Ghifari, Vito  and
      Dai, Wenlian

Related Skills

node-connect

349.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。