SkillAgentSearch skills...

NoLanguageLeftWaiting

Simultaneous translation model for 200 languages

Install / Use

/learn @QuentinFuxa/NoLanguageLeftWaiting
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align="center">NoLanguageLeftWaiting</h1> <p align="center"> <img src="demo.gif"width="730"> </p> <p align="center"> <img src="architecture_NLLW.png"width="730"> </p>

Converts NoLanguageLeftBehind translation model to a SimulMT (Simultaneous Machine Translation) model, optimized for live/streaming use cases.

Based offline models such as NLLB suffer from eos token and punctuation insertion, inconsistent prefix handling and exponentially growing computational overhead as input length increases. This implementation aims at resolving that.

Installation

pip install nllw

The textual frontend is not installed by default.

Quick Start

  1. Demo interface :
python textual_interface.py
  1. Use it as a package
import nllw

model = nllw.load_model(
    src_langs=["fra_Latn"],
    nllb_backend="transformers",
    nllb_size="600M" #Alternative: 1.3B
)
translator = nllw.OnlineTranslation(
    model,
    input_languages=["fra_Latn"],
    output_languages=["eng_Latn"]
)

tokens = [nllw.timed_text.TimedText('Ceci est un test de traduction')]
translator.insert_tokens(tokens)
validated, buffer = translator.process()
print(f"{validated} | {buffer}")

tokens = [nllw.timed_text.TimedText('en temps réel')]
translator.insert_tokens(tokens)
validated, buffer = translator.process()
print(f"{validated} | {buffer}")

Work In Progress : Partial Speculative Decoding

Local Agreement already locks a stable prefix for the committed translation, so we cannot directly adopt Self-Speculative Biased Decoding for Faster Live Translation. Our ongoing prototype instead borrows the speculative idea only for the new tokens that need to be validated by the larger model.

The flow tested in speculative_decoding_v0.py:

  • Run the 600M draft decoder once to obtain the candidate continuation and its cache.
  • Replay the draft tokens through the 1.3B model, but stop the forward pass as soon as the main model reproduces a token emitted by the draft (predicted_tokens matches the draft output). We keep those verified tokens and only continue generation from that point.
  • On mismatch, resume full decoding with the 1.3B model until a match is reached again, instead of discarding the entire draft segment.

This “partial verification” trims the work the main decoder performs after each divergence, while keeping the responsiveness of the draft hypothesis. Early timing experiments from speculative_decoding_v0.py show the verification pass (~0.15 s in the example) is significantly cheaper than recomputing a full decoding step every time.

<p align="center"> <img src="https://raw.githubusercontent.com/QuentinFuxa/NoLanguageLeftWaiting/05b8d868cc74a3f14c67e35bfbe460d8ff78d512/partial_speculative_decoding.png"width="730"> </p>

Input vs Output length:

Succesfully maintain output length, even if stable prefix tends to take time to grow.

<p align="center"> <img src="french_to_english.png"width="730"> </p>
View on GitHub
GitHub Stars395
CategoryDevelopment
Updated17h ago
Forks48

Languages

Python

Security Score

80/100

Audited on Mar 27, 2026

No findings