Predicting the Human Development Index

Building a supervised random forest machine learning model to predict the Human Development Index (HDI) by utilizing the World Bank's World Development Indicators and UNDP Human Development Data.

Table of Contents

<ol> <li><a href="#Project-Overview">Project Overview</a> <li><a href="#Data-Engineering">Data Engineering</a> <li><a href="#Exploratory-Data-Analysis">Exploratory Data Analysis</a> <li><a href="#Machine-Learning-Prediction-Models">Machine Learning</a> <ul> <li><a href="#random-forest-regression">Random Forest Regression</a> <li><a href="#random-forest-classification">Random Forest Classification</a> </ul> <li><a href="#discussion">Discussion</a> <li><a href="#Using-Actual-Indicators">Using Actual Indicators</a> <li><a href="#conclusion">Conclusion</a> <li><a href="#acknowledgements">Acknowledgements</a> </ol> <hr>

Project Overview

The World Bank has a large database called the World Development Indicators (WDI). According to the World Bank, the WDI are a "compilation of cross-country, relevant, high-quality, and internationally comparable statistics about global development and the fight against poverty." This data is free and open to the public for use. The WDI database contains a vast array of socioeconomic indicators related to population, GDP, education, health, human rights, labor, trade, land use, and so on. The WDI is one of the most significant international databases and contains around 1300 indicators for almost every country in the world, with the earliest indicators starting in 1960 (Van Der Mensbrugghe, 2016).

The United Nations Development Programme (UNDP) collects and stores international data for monitoring and reporting on multiple human development indices, such as poverty, gender equality, sustainability, and so on. This project will focus on predicting the Human Development Index (HDI). According to the UNDP, "the HDI was created to emphasize that people and their capabilities should be the ultimate criteria for assessing the development of a country, not economic growth alone."

The entire project is coded in R and consists of 4 key steps (each in separate R Markdown files):

<ol> <li>Data Engineering: Scraping, merging, cleaning, and transforming data. </li> <li>Exploratory Data Analysis: Analyzing variables for correlation and regression to build final data frame(s). </li> <li>Prediction with Machine Learning: Using the final variables to build 2 random forest models (regression and classification).</li> <li>Bonus: Using true indicators to predict HDI.</li> </ol>

Data Engineering

View the R Markdown file for this step

<details open="open"> <summary>Using the WDI API to scrape indicator data</summary> There are two methods for accessing WDI data. The first is to build a report using the World Bank’s web-based graphical user interface (GUI) and downloading the query results. The second method uses an Application Programming Interface (API). The API has been integrated into an R package that simplifies the extraction process and allows for download and use of the data directly in R. Each indicator has a vector code that is used for querying and downloading functions within R. There are several ways to find the vector codes for specific indicators or indicators containing a keyword. In R, the WDIsearch() function will population any indicator in a keyword search. There is also a [metadata glossary](https://databank.worldbank.org/metadataglossary/World-Development-Indicators/series) with detailed information and vector codes for all indicators. </details>

The WDI library is installed and loaded like any standard package:

install.packages("WDI")
library(WDI)

The WDI function to access and download data:

# download multiple indicators into one data frame
dataframe = WDI(indicator= c("vector code","vector code", etc.), country="all", start=year, end=year)

# download a single indicator into a data frame
dataframe = WDI(indicator='vector code', country="all", start=year, end=year)

To download data for this project, I first created individual data frames for each indicator I wanted to analyze. By creating a separate data frame for each indicator, I was able to more easily analyze and update each one as needed throughout the process. I included all countries and selected the years 1990 to 2018 because data in earlier years has more NULL values. The following WDI indicators were downloaded:

<ul> <li>Population <li>GDP per capita (constant 2010 US$) <li>GDP Per capita income <li>Population density (people per sq. km of land area) <li>Greenhouse Gas Emissions (kt) <li>Total C02 emissions (kt) <li>CO2 emissions (metric tons per capita) <li>PM2.5 air pollution, mean annual exposure (micrograms per cubic meter) <li>Birth rate, crude (per 1,000 people) <li>Fertility rate, total (births per woman) <li>Imports of goods and services (% of GDP) <li>Exports of goods and services (% of GDP) <li>Life expectancy at birth, total (years) <li>Mortality rate, infant (per 1,000 live births) <li>Mortality rate, under-5 (per 1,000 live births) <li>Unemployment, total (% of total labor force) (modeled ILO estimate) <li>Adjusted net enrolment rate, lower secondary <li>Adjusted net enrolment rate, primary <li>Adjusted net enrolment rate, upper secondary <li>Adult literacy rate, population 15+ years, both sexes (%) <li>Initial government funding of education as a percentage of GDP (%) <li>Expected Years Of School </ul>

Cleaning and joining WDI data

The API function only downloads each indicator with the region, country code, and indicator vector code. To prepare and clean the data, I renamed the indicator column to a recognizable name. Later I will need to join to country data.

# example
names(population)[3]="population"

I then calculated the percentage of NULL values for each indicator to determine any that would be eliminated due to sparse data.

# example
print(paste0("population"))
population.na <- as.data.frame(sum(is.na(population$population)))
population.n <- as.data.frame(nrow(population))
population.na$`sum(is.na(population$population))`/population.n$`nrow(population)`*100

This resulted in the following:

<table> <tr> <th>Indicator</th> <th>% NULL</th> </tr> <tr> <td>population</td> <td>0.61%</td> </tr> <tr> <td>gdp.pc</td> <td>9.8%</td> </tr> <tr> <td>gdp.pc.income</td> <td>12.2%</td> </tr> <tr> <td>pop.density</td> <td>1.8%</td> </tr> <tr> <td>greenhouse.gas</td> <td>31.0%</td> </tr> <tr> <td>co2</td> <td>13.4%</td> </tr> <tr> <td>co2.pc</td> <td>13.5%</td> </tr> <tr> <td>pollution.expose</td> <td>62.4%</td> </tr> <tr> <td>birth.rate</td> <td>5.2%</td> </tr> <tr> <td>fertility.rate</td> <td>7.0%</td> </tr> <tr> <td>imports.gs</td> <td>16.4%</td> </tr> <tr> <td>exports.gs</td> <td>16.4%</td> </tr> <tr> <td>life.exp</td> <td>7.1%</td> </tr> <tr> <td>infant.mort.rate</td> <td>9.5%</td> </tr> <tr> <td>under5.mort.rate</td> <td>9.5%</td> </tr> <tr> <td>unemployment</td> <td>14.8%</td> </tr> <tr> <td>edu.lower</td> <td>69.2%</td> </tr> <tr> <td>edu.primary</td> <td>52.2%</td> </tr> <tr> <td>edu.upper</td> <td>83.8%</td> </tr> <tr> <td>literacy</td> <td>74.6%</td> </tr> <tr> <td>edu.funding</td> <td>64.5%</td> </tr> <tr> <td>edu.years</td> <td>77.0%</td> </tr> </table>

I decided to exclude any indicators with more than 15% NULL values. Unfortunately, this meant I was left without any education indicators. Nonetheless, I joined the individual data frames with less than 15% NULL values to create a single data frame called WDI.key. I then joined this data frame to the country details in WDI in case I wanted or needed to analyze at various levels in the future. This is the resulting data frame structure.

Adding UNDP Data

UNDP Human Development Data can easily be downloaded as csv files at http://hdr.undp.org/en/data. I downloaded the files and cleaned up country names to match the WDI names using Excel before importing into R. It is possible to do this in R, but I felt Excel was more efficient. The UNDP data being joined to the WDI data includes:

<ul> <li>GNI Per Capita <li>Human Development Indicator (HDI) <li>Education Index <li>Income Index </ul>

After a bit of clean up, joining the UNDP data to the WDI.key data frame, and validation, this is the resulting final key.ind data frame of the Data Engineering phase that will be used for exploratory analysis.

Exploratory Data Analysis

View the R Markdown file for this step

Correlation Matrix

To begin analysis, I removed any rows with NULL values and all non-numerical columns from the key.ind data frame in order to create a correlation matrix. This matr

PredictingHDI

Install / Use

README

Predicting the Human Development Index

<i>Building a supervised random forest machine learning model to predict the Human Development Index (HDI) by utilizing the World Bank's World Development Indicators and UNDP Human Development Data.</i>

Project Overview

Data Engineering

Exploratory Data Analysis