Langcodes
A Python library for working with and comparing language codes.
Install / Use
/learn @rspeer/LangcodesREADME
Langcodes: a library for language codes
langcodes knows what languages are. It knows the standardized codes that
refer to them, such as en for English, es for Spanish and hi for Hindi.
These are IETF language tags. You may know them by their old name, ISO 639 language codes. IETF has done some important things for backward compatibility and supporting language variations that you won't find in the ISO standard.
It may sound to you like langcodes solves a pretty boring problem. At one level, that's right. Sometimes you have a boring problem, and it's great when a library solves it for you.
But there's an interesting problem hiding in here. How do you work with language codes? How do you know when two different codes represent the same thing? How should your code represent relationships between codes, like the following?
engis equivalent toen.fraandfreare both equivalent tofr.en-GBmight be written asen-gboren_GB. Or as 'en-UK', which is erroneous, but should be treated as the same.en-CAis not exactly equivalent toen-US, but it's really, really close.en-Latn-USis equivalent toen-US, because written English must be written in the Latin alphabet to be understood.- The difference between
arandarbis the difference between "Arabic" and "Modern Standard Arabic", a difference that may not be relevant to you. - You'll find Mandarin Chinese tagged as
cmnon Wiktionary, but many other resources would call the same languagezh. - Chinese is written in different scripts in different territories. Some
software distinguishes the script. Other software distinguishes the territory.
The result is that
zh-CNandzh-Hansare used interchangeably, as arezh-TWandzh-Hant, even though occasionally you'll need something different such aszh-HKorzh-Latn-pinyin. - The Indonesian (
id) and Malaysian (msorzsm) languages are mutually intelligible. jpis not a language code. (The language code for Japanese isja, but people confuse it with the country code for Japan.)
One way to know is to read IETF standards and Unicode technical reports. Another way is to use a library that implements those standards and guidelines for you, which langcodes does.
When you're working with these short language codes, you may want to see the
name that the language is called in a language: fr is called "French" in
English. That language doesn't have to be English: fr is called "français" in
French. A supplement to langcodes, language_data, provides
this information.
langcodes is maintained by Elia Robyn Lake a.k.a. Robyn Speer, and is released as free software under the MIT license.
Standards implemented
Although this is not the only reason to use it, langcodes will make you more acronym-compliant.
langcodes implements BCP 47, the IETF Best Current Practices on Tags for Identifying Languages. BCP 47 is also known as RFC 5646. It subsumes ISO 639 and is backward compatible with it, and it also implements recommendations from the Unicode CLDR.
langcodes can also refer to a database of language properties and names, built
from Unicode CLDR and the IANA subtag registry, if you install language_data.
In summary, langcodes takes language codes and does the Right Thing with them, and if you want to know exactly what the Right Thing is, there are some documents you can go read.
Documentation
Standardizing language tags
This function standardizes tags, as strings, in several ways.
It replaces overlong tags with their shortest version, and also formats them according to the conventions of BCP 47:
>>> from langcodes import *
>>> standardize_tag('eng_US')
'en-US'
It removes script subtags that are redundant with the language:
>>> standardize_tag('en-Latn')
'en'
It replaces deprecated values with their correct versions, if possible:
>>> standardize_tag('en-uk')
'en-GB'
Sometimes this involves complex substitutions, such as replacing Serbo-Croatian
(sh) with Serbian in Latin script (sr-Latn), or the entire tag sgn-US
with ase (American Sign Language).
>>> standardize_tag('sh-QU')
'sr-Latn-EU'
>>> standardize_tag('sgn-US')
'ase'
If macro is True, it uses macrolanguage codes as a replacement for the most common standardized language within that macrolanguage.
>>> standardize_tag('arb-Arab', macro=True)
'ar'
Even when macro is False, it shortens tags that contain both the macrolanguage and the language:
>>> standardize_tag('zh-cmn-hans-cn')
'zh-Hans-CN'
If the tag can't be parsed according to BCP 47, this will raise a LanguageTagError (a subclass of ValueError):
>>> standardize_tag('spa-latn-mx')
'es-MX'
>>> standardize_tag('spa-mx-latn')
Traceback (most recent call last):
...
langcodes.tag_parser.LanguageTagError: This script subtag, 'latn', is out of place. Expected variant, extension, or end of string.
Language objects
This package defines one class, named Language, which contains the results of parsing a language tag. Language objects have the following fields, any of which may be unspecified:
- language: the code for the language itself.
- script: the 4-letter code for the writing system being used.
- territory: the 2-letter or 3-digit code for the country or similar region whose usage of the language appears in this text.
- extlangs: a list of more specific language codes that follow the language code. (This is allowed by the language code syntax, but deprecated.)
- variants: codes for specific variations of language usage that aren't covered by the script or territory codes.
- extensions: information that's attached to the language code for use in some specific system, such as Unicode collation orders.
- private: a code starting with
x-that has no defined meaning.
The Language.get method converts a string to a Language instance, and the
Language.make method makes a Language instance from its fields. These values
are cached so that calling Language.get or Language.make again with the
same values returns the same object, for efficiency.
By default, it will replace non-standard and overlong tags as it interprets them. To disable this feature and get the codes that literally appear in the language tag, use the normalize=False option.
>>> Language.get('en-Latn-US')
Language.make(language='en', script='Latn', territory='US')
>>> Language.get('sgn-US', normalize=False)
Language.make(language='sgn', territory='US')
>>> Language.get('und')
Language.make()
Here are some examples of replacing non-standard tags:
>>> Language.get('sh-QU')
Language.make(language='sr', script='Latn', territory='EU')
>>> Language.get('sgn-US')
Language.make(language='ase')
>>> Language.get('zh-cmn-Hant')
Language.make(language='zh', script='Hant')
Use the str() function on a Language object to convert it back to its
standard string form:
>>> str(Language.get('sh-QU'))
'sr-Latn-EU'
>>> str(Language.make(territory='IN'))
'und-IN'
Checking validity
A language code is valid when every part of it is assigned a meaning by IANA. That meaning could be "private use".
In langcodes, we check the language subtag, script, territory, and variants for validity. We don't check other parts such as extlangs or Unicode extensions.
For example, ja is a valid language code, and jp is not:
>>> Language.get('ja').is_valid()
True
>>> Language.get('jp').is_valid()
False
The top-level function tag_is_valid(tag) is possibly more convenient to use,
because it can return False even for tags that don't parse:
>>> tag_is_valid('C')
False
If one subtag is invalid, the entire code is invalid:
>>> tag_is_valid('en-000')
False
iw is valid, though it's a deprecated alias for he:
>>> tag_is_valid('iw')
True
The empty language tag (und) is valid:
>>> tag_is_valid('und')
True
Private use codes are valid:
>>> tag_is_valid('x-other')
True
>>> tag_is_valid('qaa-Qaai-AA-x-what-even-is-this')
True
Language tags that are very unlikely are still valid:
>>> tag_is_valid('fr-Cyrl')
True
Tags with non-ASCII characters are invalid, because they don't parse:
>>> tag_is_valid('zh-普通话')
False
Getting alpha3 codes
Before there was BCP 47, there was ISO 639-2. The ISO tried to make room for the variety of human languages by assigning every language a 3-letter code, including the ones that already had 2-letter codes.
Unfortunately, this just led to more confusion. Some languages ended up with two
different 3-letter codes for legacy reasons, such as French, which is fra as a
"terminology" code, and fre as a "biblographic" code. And meanwhile, fr was
still a code that you'd be using if you followed ISO 639-1.
In BCP 47, you should use 2-letter codes whenever they're available, and that's what langcodes does. Fortunately, all the languages that have two different 3-letter codes also have a 2-letter code, so if you prefer the 2-letter code, you don't have to worry about the distinction.
But some applications want the 3-letter code in particular, so langcodes
provides a method for getting those, Language.to_alpha3(). It returns the
'terminology' code by default, and passing variant='B' returns the
bibliographic code.
When this method returns, it always returns a 3-letter string.
>>> Language.get('fr').to_alpha3()
'fra'
>>> Language.get('fr-CA').to_alpha3()
'fra'
>>> Language.get('fr-CA').to_alpha3(variant='B')
'fre'
>>> Language.get('de').to_alpha3()
'deu'
>>> Language.get('n
Related Skills
node-connect
346.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
107.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
346.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
346.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
