SkillAgentSearch skills...

Kumzari

First comprehensive dataset and translation model for the Kumzari language

Install / Use

/learn @karimongitb/Kumzari

README

Kumzari

The first digital dataset and translation model for the endangered Kumzari language.

Due to copyright restrictions on some source materials, the full dataset cannot be publicly redistributed. The methodology and processing are fully open-source.

About

  • 7,000+ documented entries, with 3,000+ re-processed for improved accuracy.
  • Though Kumzari is not a written language, the Perso-Arabic (Persian) script has been used to represent it.
  • Built using vision-enabled language models on sources spanning from 1929 to the present
  • Part of an ongoing cultural and linguistic preservation effort

Dataset

A public sample of the dataset derived exclusively from public-domain sources is included.

The complete dataset incorporates materials that are subject to copyright and is thus not publicly released.

Status

This project is a work in progress.

View on GitHub
GitHub Stars5
CategoryDevelopment
Updated1mo ago
Forks0

Languages

Jupyter Notebook

Security Score

90/100

Audited on Mar 1, 2026

No findings