Langcodes

A Python library for working with and comparing language codes.

Generate Convert Improve

Install / Use

/learn @rspeer/Langcodes

About this skill

Quality Score

0/100

README

Langcodes: a library for language codes

langcodes knows what languages are. It knows the standardized codes that refer to them, such as en for English, es for Spanish and hi for Hindi.

These are IETF language tags. You may know them by their old name, ISO 639 language codes. IETF has done some important things for backward compatibility and supporting language variations that you won't find in the ISO standard.

It may sound to you like langcodes solves a pretty boring problem. At one level, that's right. Sometimes you have a boring problem, and it's great when a library solves it for you.

But there's an interesting problem hiding in here. How do you work with language codes? How do you know when two different codes represent the same thing? How should your code represent relationships between codes, like the following?

eng is equivalent to en.
fra and fre are both equivalent to fr.
en-GB might be written as en-gb or en_GB. Or as 'en-UK', which is erroneous, but should be treated as the same.
en-CA is not exactly equivalent to en-US, but it's really, really close.
en-Latn-US is equivalent to en-US, because written English must be written in the Latin alphabet to be understood.
The difference between ar and arb is the difference between "Arabic" and "Modern Standard Arabic", a difference that may not be relevant to you.
You'll find Mandarin Chinese tagged as cmn on Wiktionary, but many other resources would call the same language zh.
Chinese is written in different scripts in different territories. Some software distinguishes the script. Other software distinguishes the territory. The result is that zh-CN and zh-Hans are used interchangeably, as are zh-TW and zh-Hant, even though occasionally you'll need something different such as zh-HK or zh-Latn-pinyin.
The Indonesian (id) and Malaysian (ms or zsm) languages are mutually intelligible.
jp is not a language code. (The language code for Japanese is ja, but people confuse it with the country code for Japan.)

One way to know is to read IETF standards and Unicode technical reports. Another way is to use a library that implements those standards and guidelines for you, which langcodes does.

When you're working with these short language codes, you may want to see the name that the language is called in a language: fr is called "French" in English. That language doesn't have to be English: fr is called "français" in French. A supplement to langcodes, language_data, provides this information.

langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released as free software under the MIT license.

Standards implemented

Although this is not the only reason to use it, langcodes will make you more acronym-compliant.

langcodes implements BCP 47, the IETF Best Current Practices on Tags for Identifying Languages. BCP 47 is also known as RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also implements recommendations from the Unicode CLDR.

langcodes can also refer to a database of language properties and names, built from Unicode CLDR and the IANA subtag registry, if you install language_data.

In summary, langcodes takes language codes and does the Right Thing with them, and if you want to know exactly what the Right Thing is, there are some documents you can go read.

Documentation

Standardizing language tags

This function standardizes tags, as strings, in several ways.

It replaces overlong tags with their shortest version, and also formats them according to the conventions of BCP 47:

>>> from langcodes import *
>>> standardize_tag('eng_US')
'en-US'

It removes script subtags that are redundant with the language:

>>> standardize_tag('en-Latn')
'en'

It replaces deprecated values with their correct versions, if possible:

>>> standardize_tag('en-uk')
'en-GB'

Sometimes this involves complex substitutions, such as replacing Serbo-Croatian (sh) with Serbian in Latin script (sr-Latn), or the entire tag sgn-US with ase (American Sign Language).

>>> standardize_tag('sh-QU')
'sr-Latn-EU'

>>> standardize_tag('sgn-US')
'ase'

If macro is True, it uses macrolanguage codes as a replacement for the most common standardized language within that macrolanguage.

>>> standardize_tag('arb-Arab', macro=True)
'ar'

Even when macro is False, it shortens tags that contain both the macrolanguage and the language:

>>> standardize_tag('zh-cmn-hans-cn')
'zh-Hans-CN'

If the tag can't be parsed according to BCP 47, this will raise a LanguageTagError (a subclass of ValueError):

>>> standardize_tag('spa-latn-mx')
'es-MX'

>>> standardize_tag('spa-mx-latn')
Traceback (most recent call last):
    ...
langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string.

Language objects

This package defines one class, named Language, which contains the results of parsing a language tag. Language objects have the following fields, any of which may be unspecified:

language: the code for the language itself.
script: the 4-letter code for the writing system being used.
territory: the 2-letter or 3-digit code for the country or similar region whose usage of the language appears in this text.
extlangs: a list of more specific language codes that follow the language code. (This is allowed by the language code syntax, but deprecated.)
variants: codes for specific variations of language usage that aren't covered by the script or territory codes.
extensions: information that's attached to the language code for use in some specific system, such as Unicode collation orders.
private: a code starting with x- that has no defined meaning.

The Language.get method converts a string to a Language instance, and the Language.make method makes a Language instance from its fields. These values are cached so that calling Language.get or Language.make again with the same values returns the same object, for efficiency.

By default, it will replace non-standard and overlong tags as it interprets them. To disable this feature and get the codes that literally appear in the language tag, use the normalize=False option.

>>> Language.get('en-Latn-US')
Language.make(language='en', script='Latn', territory='US')

>>> Language.get('sgn-US', normalize=False)
Language.make(language='sgn', territory='US')

>>> Language.get('und')
Language.make()

Here are some examples of replacing non-standard tags:

>>> Language.get('sh-QU')
Language.make(language='sr', script='Latn', territory='EU')

>>> Language.get('sgn-US')
Language.make(language='ase')

>>> Language.get('zh-cmn-Hant')
Language.make(language='zh', script='Hant')

Use the str() function on a Language object to convert it back to its standard string form:

>>> str(Language.get('sh-QU'))
'sr-Latn-EU'

>>> str(Language.make(territory='IN'))
'und-IN'

Checking validity

A language code is valid when every part of it is assigned a meaning by IANA. That meaning could be "private use".

In langcodes, we check the language subtag, script, territory, and variants for validity. We don't check other parts such as extlangs or Unicode extensions.

For example, ja is a valid language code, and jp is not:

>>> Language.get('ja').is_valid()
True

>>> Language.get('jp').is_valid()
False

The top-level function tag_is_valid(tag) is possibly more convenient to use, because it can return False even for tags that don't parse:

>>> tag_is_valid('C')
False

If one subtag is invalid, the entire code is invalid:

>>> tag_is_valid('en-000')
False

iw is valid, though it's a deprecated alias for he:

>>> tag_is_valid('iw')
True

The empty language tag (und) is valid:

>>> tag_is_valid('und')
True

Private use codes are valid:

>>> tag_is_valid('x-other')
True

>>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this')
True

Language tags that are very unlikely are still valid:

>>> tag_is_valid('fr-Cyrl')
True

Tags with non-ASCII characters are invalid, because they don't parse:

>>> tag_is_valid('zh-普通话')
False

Getting alpha3 codes

Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the variety of human languages by assigning every language a 3-letter code, including the ones that already had 2-letter codes.

Unfortunately, this just led to more confusion. Some languages ended up with two different 3-letter codes for legacy reasons, such as French, which is fra as a "terminology" code, and fre as a "biblographic" code. And meanwhile, fr was still a code that you'd be using if you followed ISO 639-1.

In BCP 47, you should use 2-letter codes whenever they're available, and that's what langcodes does. Fortunately, all the languages that have two different 3-letter codes also have a 2-letter code, so if you prefer the 2-letter code, you don't have to worry about the distinction.

But some applications want the 3-letter code in particular, so langcodes provides a method for getting those, Language.to_alpha3(). It returns the 'terminology' code by default, and passing variant='B' returns the bibliographic code.

When this method returns, it always returns a 3-letter string.

>>> Language.get('fr').to_alpha3()
'fra'
>>> Language.get('fr-CA').to_alpha3()
'fra'
>>> Language.get('fr-CA').to_alpha3(variant='B')
'fre'
>>> Language.get('de').to_alpha3()
'deu'
>>> Language.get('n

Related Skills

node-connect

346.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

107.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

346.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

346.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。