SkillAgentSearch skills...

Name2nat

name2nat: a Python package for nationality prediction from a name

Install / Use

/learn @Kyubyong/Name2nat
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

image image image

name2nat: a Python package for nationality prediction from a name

name2nat is a Python package that predicts the nationality of any name written in Roman letters. For example, it returns the correct output Korean for my name `Kyubyong Park'. Needless to say, it is not possible to guess somebody's nationality 100% right from their name. After all, nationality can change, you know. However, it is also true that there is a tendency between names and nationality. So it turns out statistical classifiers for this task works to some extent. Details are explained below.

Disclaimer

I am aware that this topic may be viewed from a political perspective. That is absolutely AGAINST my motivation.

NaNa Dataset

Construction

I constructed a new dataset for this project because I failed to find any available dataset that is big and comprehensive enough.

  • STEP 1. Downloaded and extracted the 20200601 English wiki dump (enwiki-20200601-pages-articles.xml).
  • STEP 2. Iterated all pages and collected the title and the nationality. I regarded the title as a person if the Category section at the bottom of each page included ... births (green rectangle), and identified their nationality from the most frequent nationality word in the section (red rectangles). <img src="wiki.png" >
  • STEP 3. Randomly split the data into train/dev/test in the ratio of 8:1:1 within each nationality group.

Stats

|Nationality|# Samples|Train|Dev|Test| |--|--|--|--|--| |Total |1,112,902|890,248|111,286|111,368| |Afghan|973|778|97|98| |Albanian|2,742|2,193|274|275| |Algerian|1,991|1,592|199|200| |American|302,215|241,772|30,221|30,222| |Andorran|236|188|24|24| |Angolan|630|504|63|63| |Argentine|11,158|8,926|1,116|1,116| |Armenian|2,001|1,600|200|201| |Aruban|117|93|12|12| |Australian|50,670|40,536|5,067|5,067| |Austrian|11,490|9,192|1,149|1,149| |Azerbaijani|1,664|1,331|166|167| |Bahamian|292|233|29|30| |Bahraini|297|237|30|30| |Bangladeshi|2,045|1,636|204|205| |Barbadian|466|372|47|47| |Basque|1,202|961|120|121| |Belarusian|2,923|2,338|292|293| |Belgian|9,884|7,907|988|989| |Belizean|186|148|19|19| |Beninese|249|199|25|25| |Bermudian|338|270|34|34| |Bhutanese|180|144|18|18| |Bolivian|822|657|82|83| |Bosniak|102|81|10|11| |Botswana|315|252|31|32| |Brazilian|14,043|11,234|1,404|1,405| |Breton|148|118|15|15| |British|57,403|45,922|5,740|5,741| |Bruneian|144|115|14|15| |Bulgarian|4,908|3,926|491|491| |Burkinabé|362|289|36|37| |Burmese|1,180|944|118|118| |Burundian|175|140|17|18| |Cambodian|451|360|45|46| |Cameroonian|1,286|1,028|129|129| |Canadian|42,691|34,152|4,269|4,270| |Catalan|2,147|1,717|215|215| |Chadian|174|139|17|18| |Chilean|3,548|2,838|355|355| |Chinese|11,868|9,494|1,187|1,187| |Colombian|3,276|2,620|328|328| |Comorian|68|54|7|7| |Congolese|44|35|4|5| |Cuban|2,423|1,938|242|243| |Cypriot|1,271|1,016|127|128| |Czech|9,056|7,244|906|906| |Dane|41|32|4|5| |Djiboutian|68|54|7|7| |Dominican|1,976|1,580|198|198| |Dutch|18,645|14,916|1,864|1,865| |Ecuadorian|1,093|874|109|110| |Egyptian|3,471|2,776|347|348| |Emirati|777|621|78|78| |English|96,449|77,159|9,645|9,645| |Equatoguinean|242|193|24|25| |Eritrean|167|133|17|17| |Estonian|2,536|2,028|254|254| |Ethiopian|917|733|92|92| |Faroese|355|284|35|36| |Filipino|4,910|3,928|491|491| |Finn|85|68|8|9| |French|51,052|40,841|5,105|5,106| |Gabonese|226|180|23|23| |Gambian|276|220|28|28| |Georgian|328|262|33|33| |German|52,986|42,388|5,299|5,299| |Ghanaian|2,546|2,036|255|255| |Gibraltarian|123|98|12|13| |Greek|7,469|5,975|747|747| |Grenadian|174|139|17|18| |Guatemalan|704|563|70|71| |Guinean|731|584|73|74| |Guyanese|448|358|45|45| |Haitian|702|561|70|71| |Honduran|626|500|63|63| |Hungarian|9,026|7,220|903|903| |I-Kiribati|51|40|5|6| |Indian|28,365|22,692|2,836|2,837| |Indonesian|3,525|2,820|352|353| |Iranian|6,263|5,010|626|627| |Iraqi|1,566|1,252|157|157| |Irish|14,806|11,844|1,481|1,481| |Israeli|6,437|5,149|644|644| |Italian|36,671|29,336|3,667|3,668| |Jamaican|1,778|1,422|178|178| |Japanese|26,520|21,216|2,652|2,652| |Jordanian|613|490|61|62| |Kazakh|31|24|3|4| |Kenyan|2,012|1,609|201|202| |Korean|9,871|7,896|987|988| |Kuwaiti|496|396|50|50| |Kyrgyz|20|16|2|2| |Lao|33|26|3|4| |Latvian|2,117|1,693|212|212| |Lebanese|1,558|1,246|156|156| |Liberian|368|294|37|37| |Libyan|339|271|34|34| |Lithuanian|2,474|1,979|247|248| |Macedonian|1,374|1,099|137|138| |Malagasy|290|232|29|29| |Malawian|274|219|27|28| |Malaysian|3,228|2,582|323|323| |Maldivian|191|152|19|20| |Malian|482|385|48|49| |Maltese|829|663|83|83| |Manx|188|150|19|19| |Marshallese|40|32|4|4| |Mauritanian|120|96|12|12| |Mauritian|329|263|33|33| |Mexican|10,810|8,648|1,081|1,081| |Moldovan|1,250|1,000|125|125| |Mongolian|631|504|63|64| |Montenegrin|1,194|955|119|120| |Moroccan|1,822|1,457|182|183| |Mozambican|263|210|26|27| |Namibian|736|588|74|74| |Nauruan|40|32|4|4| |Nepalese|967|773|97|97| |Nicaraguan|357|285|36|36| |Nigerian|5,075|4,060|507|508| |Nigerien|179|143|18|18| |Norwegian|16,891|13,512|1,689|1,690| |Omani|247|197|25|25| |Pakistani|4,703|3,762|470|471| |Palauan|44|35|4|5| |Palestinian|660|528|66|66| |Panamanian|593|474|59|60| |Paraguayan|1,266|1,012|127|127| |Peruvian|1,902|1,521|190|191| |Portuguese|5,918|4,734|592|592| |Qatari|685|548|68|69| |Romanian|8,189|6,551|819|819| |Russian|26,593|21,274|2,659|2,660| |Rwandan|337|269|34|34| |Salvadoran|634|507|63|64| |Sammarinese|248|198|25|25| |Samoan|746|596|75|75| |Saudi|1,871|1,496|187|188| |Senegalese|1,029|823|103|103| |Serb|56|44|6|6| |Singaporean|1,646|1,316|165|165| |Slovak|3,584|2,867|358|359| |Slovene|111|88|11|12| |Somali|145|116|14|15| |Sotho|62|49|6|7| |Sudanese|436|348|44|44| |Surinamese|250|200|25|25| |Swazi|143|114|14|15| |Syriac|98|78|10|10| |Syrian|1,309|1,047|131|131| |Taiwanese|2,433|1,946|243|244| |Tajik|77|61|8|8| |Tamil|1,749|1,399|175|175| |Tanzanian|784|627|78|79| |Thai|3,434|2,747|343|344| |Tibetan|332|265|33|34| |Togolese|264|211|26|27| |Tongan|570|456|57|57| |Tunisian|1,340|1,072|134|134| |Turk|99|79|10|10| |Tuvaluan|83|66|8|9| |Ugandan|1,316|1,052|132|132| |Ukrainian|7,748|6,198|775|775| |Uruguayan|2,834|2,267|283|284| |Uzbek|78|62|8|8| |Vanuatuan|146|116|15|15| |Venezuelan|2,422|1,937|242|243| |Vietnamese|1,572|1,257|157|158| |Vincentian|10|8|1|1| |Welsh|6,588|5,270|659|659| |Yemeni|403|322|40|41| |Zambian|638|510|64|64|

Downloadable Link

  • You can download the dataset here.

name2nat

Installation

pip install name2nat

Usage

>>> from name2nat import Name2nat

>>> my_nanat = Name2nat()

>>> names = ["Donald Trump", # American
         "Moon Jae-in", # Korean
         "Shinzo Abe", # Japanese
         "Xi Jinping", # Chinese
         "Joko Widodo", # Indonesian
         "Angela Merkel", # German
         "Emmanuel Macron", # French
         "Kyubyong Park", # Korean
         "Yamamoto Yu", # Japanese
         "Jing Xu"] # Chinese
>>> result = my_nanat(names, top_n=3)
>>> print(result)
# (name, [(nationality, prob), ...])
# Note that prob of 1.0 indicates the name exists
# in Wikipedia.
[
('Donald Trump', [('American', 1.0)])
('Moon Jae-in', [('Korean', 1.0)])
('Shinzo Abe', [('Japanese', 1.0)])
('Xi Jinping', [('Chinese', 1.0)])
('Joko Widodo', [('Indonesian', 1.0)])
('Angela Merkel', [('German', 1.0)])
('Emmanuel Macron', [('French', 1.0)])
('Kyubyong Park', [('Korean', 0.9985014200210571), ('American', 0.000289416522718966), ('Bhutanese', 0.00025851925602182746)])
('Yamamoto Yu', [('Japanese', 0.7050493359565735), ('Taiwanese', 0.12779785692691803), ('Chinese', 0.04263153299689293)])
('Jing Xu', [('Chinese', 0.8626819252967834), ('Taiwanese', 0.09901007264852524), ('American', 0.022995812818408012)])
]

Training

I use a powerful NLP library Flair to train a text classifier model. A bidirectional GRU layer is employed.

python train.py

Evaluation

python predict.py;
python eval.py --gt nana/test.tgt --pred test.pred

Results

|K | Precision@K | |--|--| |1| 61310/111368=55.1| |2|77480/111368=69.6| |3|86703/111368=77.9 | |4|92491/111368=83.0| |5|96697/111368=86.8|

Applications

Let's predict the nationalities of the first authors of the recent machine learning conferences.

  • Check conferences.py and conferences/lrec2020.md
  • Contributions (PRs) are welcome!

References

If you use this code for research, please cite:

@misc{park2018name2nat,
  author = {Park, Kyubyong},
  title = {name2nat: a Python package for nationality prediction from a name},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/Kyubyong/name2nat}}
}

Related Skills

View on GitHub
GitHub Stars118
CategoryDevelopment
Updated6d ago
Forks18

Languages

Python

Security Score

100/100

Audited on Mar 30, 2026

No findings