SkillAgentSearch skills...

Dakota1890

Using GRPO and a modified compositional reward function to train an opensource model on the 1890 Dakota Dictionary

Install / Use

/learn @HarleyCoops/Dakota1890
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

Dakota1890: Grammar-to-RL for Low-Resource Language Revitalization

<div align="right"> </div>

CI License: Apache-2.0 Python Code style: Ruff Smoke Tests Python Package

What This Repository Is

Dakota1890 is a proof case for a broader claim: a single historical source can be turned into a reproducible training pipeline for low-resource language revitalization.

The Dakota case matters on its own, but the larger contribution is methodological. This repository asks whether a historical grammar-and-dictionary source can bootstrap a language model, then whether reinforcement learning on executable grammar tasks materially outperforms a supervised fine-tuning baseline built from the same extracted source.

The Main Question

The central experiment in this repository is:

  • OpenAIFineTune/ is the supervised baseline
  • dakota_rl_training/ plus environments/dakota_grammar_translation/ is the RL intervention
  • both are derived from the same Dakota 1890 source material

The question is not just whether Dakota can be modeled. The question is whether grammar-gym RL provides a meaningful advantage over plain SFT when data is scarce and the source material is historical.

Why Dakota, Why 1890

Stephen Return Riggs' 1890 Dakota grammar and dictionary is the bootstrap source for this repository. The pipeline treats that source as both a lexical resource for synthetic training data and a structural resource for verifiable reward functions.

This is the key move. Grammar rules stop being static documentation and become executable feedback. Instead of asking a model to imitate text alone, the RL pipeline scores whether outputs satisfy orthographic, morphological, and task-level constraints derived from the source.

<div align="center" style="margin: 3rem 0;"> <img src="Public/grammar.jpg" alt="Dakota Grammar - Historical Text Detail" style="width: 100%; max-width: 1400px; height: auto; display: block; margin: 0 auto; border-radius: 4px; box-shadow: 0 8px 24px rgba(0,0,0,0.2);"> </div>

The key advantage: interpretability. You can actually see where in the latent space each linguistic level is being encoded. This makes debugging possible: "Oh, the model is failing on ć preservation because the character embedding gradient is being overwhelmed by the semantic gradient."

Where This Goes Next

The future-facing story of this repository is field generalization. The Dakota model is the first proof case. The next phase is to work with descendant communities connected to the linguistic and geographic record represented in the archival materials, keep the first British Columbia target unnamed until that work is ready to be public, and use this Dakota pipeline as the technical base for adaptation.

That gives the project a two-stage structure:

  • historical source to structured model-training environment
  • community-in-the-loop refinement toward contemporary local use

Technical Core

The canonical Dakota path in this repo is:

  1. Dictionary/ plus grammardictionar00riggrich.pdf
  2. dakota_extraction/
  3. data/rl_training_rules and dakota_rl_training/datasets
  4. environments/dakota_grammar_translation/
  5. dakota_rl_training/
  6. local and Hugging Face inference surfaces

The maintained comparison path is:

  1. extracted Dakota data
  2. synthetic conversational examples
  3. OpenAIFineTune/
  4. remote OpenAI SFT job as the baseline arm

A Compact Formal View

The method treats the historical source not just as text, but as a computable specification.

Let $\mathcal{T}$ be the historical source and let the extraction system map it into a structured grammar space $\mathcal{G}$:

$$ \mathcal{G} = \mathcal{E}(\mathcal{T}) $$

Each rule in $\mathcal{G}$ becomes a constraint on generated language rather than a note in a grammar book.

The RL reward is then decomposed into linguistic primitives:

$$ r(y_i, x) = \lambda_{diff}(x)\left[\alpha \cdot R_{char}(y_i, x) + \beta \cdot R_{morph}(y_i, \mathcal{G}) + \gamma \cdot R_{sem}(y_i, y^*)\right] $$

Where:

  • $R_{char}$ (Orthography): The Intersection-over-Union (or Recall) of required special unicode characters $\mathcal{C}_{spec}$ (e.g., ŋ, š, ć).

    $R_{char} = \frac{|chars(y_i) \cap chars(x)|}{|chars(x) \cap \mathcal{C}_{spec}|}$

  • $R_{morph}$ (Syntax): A binary or scalar check against specific grammar rules $g_k \in \mathcal{G}$ (e.g., affix presence regex).

    $R_{morph} = \frac{1}{|A|}\sum_{a \in A} \mathbb{I}(a \subset y_i) \quad \text{where } A \text{ are required affixes}$

  • $R_{sem}$ (Semantics): Semantic similarity to ground truth (or Dictionary lookup).

  • Weights: $(\alpha, \beta, \gamma) = (0.4, 0.4, 0.2)$ per the config.

  • $\lambda_{diff}$: The curriculum difficulty multiplier ($1.0 \dots 2.0$).

This is why RL is interesting here: the model gets feedback on structure, not only imitation. The repository keeps the SFT path intact precisely so that claim can be tested rather than asserted.

Training Results: RL Performance Visualizations

This section presents comprehensive visualizations from our successful Reinforcement Learning training runs, demonstrating the effectiveness of the grammar-to-RL methodology on the Dakota language. There have been two successful runs: one at 1000 steps (final) and one at 400 steps (initial).

Run 1: 1000-Step Training (Final)

Training Run Details

  • Project: dakota-rl-grammar
  • Entity: christian-cooper-us
  • Trainer Run: 7nikv4vp - dakota-0.6b-rl-trainer
  • Orchestrator Run: 29hn8w98 - dakota-0.6b-rl-orchestrator
  • Model: Qwen3-0.6B-Dakota-Grammar-RL
  • Training Steps: 1,000 steps (998 completed)
  • Total Samples: 256,000 samples processed
  • Training Duration: 1.54 hours (5,537 seconds)

Key Achievements

  • 190% improvement in overall reward (0.120 → 0.349)
  • 97.9% morphological accuracy - exceptional performance in affix application
  • 53.5% character preservation - significant improvement for complex orthography
  • 90% of improvement achieved in first 160 steps (16% of training) - demonstrating rapid learning
  • Stable training with controlled KL divergence throughout

Comprehensive Dashboard

The comprehensive dashboard provides an at-a-glance view of all training metrics, combining reward progression, component performance, loss dynamics, entropy, KL divergence, and throughput metrics into a single visualization.

Comprehensive Dashboard

What this shows: This multi-panel dashboard synthesizes all key training signals. The top panel shows reward progression with milestone markers indicating when 25%, 50%, 75%, and 90% of total improvement was achieved. The component comparison bar chart (middle-left) reveals the differential performance: morphological accuracy reached 97.9% while character preservation achieved 53.5%, reflecting the challenge of preserving Dakota's complex orthography (ć, š, ŋ, ḣ, ṡ, á, é, í, ó, ú) with a 0.6B parameter model. The loss and entropy panels demonstrate stable optimization, while the KL divergence metrics show controlled policy adaptation without catastrophic forgetting.

View full run: Trainer Run | Orchestrator Run

Reward Progression

The reward progression visualization demonstrates the learning trajectory over 1,000 training steps, showing both overall composite reward and individual component breakdown.

Reward Progression

What this shows: The top panel tracks overall reward progression from 0.120 (step 0) to 0.349 (step 999), representing a 190.1% improvement. Milestone markers highlight key learning efficiency points: 25% improvement at step 49 (4.9% of training), 50% at step 71 (7.1%), 75% at step 109 (10.9%), and 90% at step 160 (16%). The rapid initial learning validates the methodology's efficiency - grammar-based tasks provide dense learning signals compared to general language modeling. The bottom panel shows the component breakdown: Morphological Accuracy (green) achieved near-perfect performance (0.979), Character Preservation (orange) showed substantial improvement from 0.038 to 0.535 (14x increase), while the Overall Composite (blue) reflects the weighted combination including semantic components.

Interpretation: The divergence between component performances demonstrates that the model learned morphological patterns more effectively than orthographic preservation. This suggests potential areas for future improvement through specialized character-focused training or larger model capacity. The semantic component (20% weight) likely contributes to the composite score being lower than individual components, indicating multi-objective optimization challenges.

View full run: Orchestrator Run

Training Metrics

This visualization tracks the core training dynamics: policy loss, model entropy (confidence), KL divergence (policy adaptation), and inference probabilities.

Training Metrics

What this shows:

  • Policy Loss (top-left): Values ranged from approximately 1e-5 to 1e-3, typical of GRPO training with conservat

Related Skills

View on GitHub
GitHub Stars13
CategoryDevelopment
Updated2d ago
Forks0

Languages

HTML

Security Score

75/100

Audited on Apr 4, 2026

No findings