Investigating Income Inequality, Corruption, Health and Literacy

Author: David Oguara

** May 2018 **

Introduction

For this project, we will be investigating income inequality, corruption perception, literacy and health. We will consider which countries are the best and worst in each of these metrics, and seek to uncover relationships between these metrics within countries of the world.

Dataset Selection

Gapminder data on literacy, poverty and inequality, and incomes and growth has been selected for use in this analysis. This data comes from multiple original sources, including World Bank, UNESCO, and Transparency International.

For consistency, we have selected a five-year period, 2007 to 2011, as the active window for investigation. Where gapminder is missing data within this window, we have sourced data directly from the source organization.

This analysis is not focused on trends over time; rather, it seeks to identify relationships among variables related to socio-economic indicators. Therefore, in most cases, averages over this five-year window are used to investigate these correlations. Five years is seen as short enough to minimise changes in trendline for each variable, and long enough that gaps in data can be filled from alternate years.

CSV files and their sources are listed below:

Dataset / Provider / Filename csv

Corruption Perception Index (CPI) / Transparency International / corruption_2008_2009.csv, corruption_2007_2010_2011.csv

Infant Mortality Rate (rate per 1000 births) / Various sources / infant.csv

Literacy rate, adult total (% of people ages 15 and above) / UNESCO / literacy.csv

Inequality index (Gini) / The World Bank (WEBSITE) / inequality.csv

Poverty (% people below USD1.90 a day) / The World Bank / poverty_2usd.csv

GDP/capita (USD, inflation-adjusted) / The World Bank / gdp.csv

Investigation Questions

The following questions will be investigated in this analysis.

Q1. Which five countries have the highest and lowest income inequality?

Q2. Is there a relationship between income inequality and corruption?

Q3. Is GDP per capita a good predictor of income inequality?

Q4. Does a correlation exist between literacy and health?

The packages inported below are required to conduct this analysis.

#Import packages

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

Data Wrangling

In this section, we will load the data, inspect it for fitness, and then perform any necessary cleaning and trimming operations to get the data ready for analysis.

General Properties

Corruption Perception Index (CPI) - Data Inspection

From Transparency International website https://www.transparency.org/cpi2011/results:

The Corruption Perception Index ranks countries/territories based on how corrupt their public sector is perceived to be. A country/territory’s score indicates the perceived level of public sector corruption on a scale of 0 - 10, where 0 means that a country is perceived as highly corrupt and 10 means that a country is perceived as very clean.

# Import Gapminder cpi data

df_corr_a = pd.read_csv('corruption_2008_2009.csv')
df_corr_a.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Country	2008	2009
0	New Zealand	9.4	9.3
1	Denmark	9.3	9.3
2	Sweden	9.2	9.2
3	Singapore	9.2	9.3
4	Switzerland	9.0	8.7

# Check dimensions

df_corr_a.shape

(180, 3)

df_corr_a.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 3 columns):
Country    180 non-null object
2008       180 non-null float64
2009       176 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.3+ KB

Gapminder data only contains 2008 and 2009 data. Note that this data is stored in table format, with years as columns. It will need to be converted to long format, with years in a single column.

Further, our analysis requires 2007 to 2011 data. We have sourced the missing years data from Transparency International website, https://www.transparency.org/research/cpi/overview

# Import missing years data

df_corr_b = pd.read_csv('corruption_2007_2010_2011.csv')
df_corr_b.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score
0	Denmark	2007	9.4
1	Finland	2007	9.4
2	New Zealand	2007	9.4
3	Singapore	2007	9.3
4	Sweden	2007	9.3

# Confirm data for desired years is present

df_corr_b.year.value_counts()

2011    183
2007    180
2010    178
Name: year, dtype: int64

# Inspect data types

df_corr_b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541 entries, 0 to 540
Data columns (total 3 columns):
country      541 non-null object
year         541 non-null int64
cpi_score    541 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 12.8+ KB

Data for 2007, 2010, 2011 is stored in ideal long format. Tasks for this dataset will be:

transform gapminder subset data into long format
check and convert data types as needed
combine gapminder data with manually downloaded missing years data.

Let us proceed to import and inspect the other datasets and identify any data-cleansing tasks.

Infant Mortality Rate (rate per 1000 births) - Data Inspection

According to a 2003 paper published in Journal of Epidiemology and Public Health:

The infant mortality rate (IMR), defined as the number of deaths in children under 1 year of age per 1000 live births in the same year, has in the past been regarded as a highly sensitive (proxy) measure of population health.2 This reflects the apparent association between the causes of infant mortality and other factors that are likely to influence the health status of whole populations such as their economic development, general living conditions, social well being, rates of illness, and the quality of the environment.

Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1732453/

We have therefore selected infant mortality rate as a measure of whole-country population health for this investigation.

Data has been sourced from Gapminder.

# Import infant_mortality data
df_infant = pd.read_csv('infant.csv')
df_infant.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Infant mortality rate	1800	1801	1802	1803	1804	1805	1806	1807	1808	...	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	82.3	80.4	78.6	76.8	75.1	73.4	71.7	69.9	68.1	66.3
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	17.4	16.7	16.0	15.4	14.8	14.3	13.8	13.3	12.9	12.5
4	Algeria	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	27.6	26.4	25.3	24.3	23.5	22.8	22.4	22.1	22.0	21.9

5 rows × 217 columns

Literacy rate, adult total (% of people ages 15 and above) - Data Inspection

Adult literacy rate is the percentage of people aged 15 years and above who can read and write. Data is almost entirely collected by UNESCO Institute for Statistis on behalf of UNESCO, and is collated by mostly surveys wherein participants self-declare. The global literacy rate for adults 15 years and older is 86% as at 2015.

Our dataset will investigate average literacy from 2007 to 2011, and identify any relationships to other metrics like health, income inequality and corruption perception.

# Import literacy data
df_literacy = pd.read_csv('literacy.csv')
df_literacy.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Adult (15+) literacy rate (%). Total	1975	1976	1977	1978	1979	1980	1981	1982	1983	...	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
0	Afghanistan	NaN	NaN	NaN	NaN	18.157681	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	39.000000
1	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	95.93864	NaN	NaN	96.845299
2	Algeria	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	69.8735	NaN	NaN	NaN	72.648679	NaN	NaN	NaN	NaN	NaN
3	Andorra	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	Angola	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	70.362420

5 rows × 38 columns

Inequality index (Gini) - Data Inspection

World Bank website, https://data.worldbank.org/indicator/SI.POV.GINI explains the GINI index:

Gini index measures the extent to which the distribution of income (or, in some cases, consumption expenditure) among individuals or households within an economy deviates from a perfectly equal distribution. A Lorenz curve plots the cumulative percentages of total income received against the cumulative number of recipients, starting with the poorest individual or household. The Gini index measures the area between the Lorenz curve and a hypothetical line of absolute equality, expressed as a percentage of the maximum area under the line. Thus a Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.

Gapminder data only goes to 2010. For convenience, GINI data 2007 to 2011 is downloaded from the Word Bank website above. This will allow us to match time period of data in other comparison variables.

# Import inequality index data data
df_inequality = pd.read_csv('inequality.csv')
df_inequality.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Country Name	1960	1961	1962	1963	1964	1965	1966	1967	1968	...	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
0	Aruba	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Angola	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	42.7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	30.0	NaN	NaN	NaN	29.0	NaN	NaN	NaN	NaN	NaN
4	Andorra	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 59 columns

Poverty (% people below USD1.90 a day)

According to World Bank website https://data.worldbank.org/indicator/SI.POV.DDAY:

Poverty headcount ratio at USD1.90 a day is the percentage of the population living on less than USD1.90 a day at 2011 international prices. As a result of revisions in PPP exchange rates, poverty rates for individual countries cannot be compared with poverty rates reported in earlier editions.

Given the above, we will select and use data for 2011 only

# Import poverty index data data
df_poverty = pd.read_csv('poverty_below1_90usd.csv')
df_poverty.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Country Name	1960	1961	1962	1963	1964	1965	1966	1967	1968	...	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017
0	Aruba	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Angola	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	30.1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	0.4	NaN	NaN	NaN	1.1	NaN	NaN	NaN	NaN	NaN
4	Andorra	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 59 columns

# Select only 2011 data
df_poverty = df_poverty.loc[:,['Country Name','2011']]
df_poverty.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Country Name	2011
0	Aruba	NaN
1	Afghanistan	NaN
2	Angola	NaN
3	Albania	NaN
4	Andorra	NaN

GDP/capita (USD, inflation-adjusted)

According to Investopedia website:

Per capita GDP is a measure of the total output of a country that takes the gross domestic product (GDP) and divides it by the number of people in that country. The per capita GDP is especially useful when comparing one country to another, because it shows the relative performance of the countries. A rise in per capita GDP signals growth in the economy and tends to reflect an increase in productivity.

Read more: Per Capita GDP https://www.investopedia.com/terms/p/per-capita-gdp.asp#ixzz5ER0jDGOi

GDP per capita data is sourced entirely from Gapminder

# Import gdp per capita data
df_gdp = pd.read_csv('gdp.csv')
df_gdp.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	Income per person (fixed 2000 US$)	1960	1961	1962	1963	1964	1965	1966	1967	1968	...	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011
0	Abkhazia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	Afghanistan	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	Akrotiri and Dhekelia	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	Albania	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	1313.722725	1381.040832	1454.022854	1525.723589	1594.495067	1681.613910	1804.419415	1857.352947	1915.424459	1965.707230
4	Algeria	1280.384828	1085.414612	855.947986	1128.41578	1170.323896	1215.015783	1127.614288	1200.558225	1291.863983	...	1871.921986	1971.512803	2043.135713	2115.186028	2124.957754	2155.485231	2173.787903	2192.703976	2231.980246	2255.225482

5 rows × 53 columns

We have now imported all the data we need for our analysis. During this process we have made a number of observations about the data, and identified initial actions that will need to be carried out to bring the data into a fit state for analysis.

Data Cleaning

These are our observations on imported data, and the cleaning actions that will need to be performed.

Data is held in table format and will need to be converted to long format, so that years appear in a single column
Some data will need to be combined, particularly corruption perception data
Since we only require 2007 to 2011, data will need to be trimmed to this window
Data types will need to be checked and converted as needed
The column names for the metrics we wish to investigate will need to be renamed for ease of referencing

Let us now proceed to carry out these cleaning operations.

Corruption Perception Index (CPI) - Data Cleaning

# Transpose Gapminder 2008/2009 data into long format, using pd.melt() function

df_corr_a = pd.melt(df_corr_a, id_vars='Country', var_name='year', value_name='cpi_score')

# Rename columns to lowercase to match both datasets
# Recall that df_corr_a holds 2008 & 2009 data, and df_corr_b contains data for missing years (2007, 2010, 2011)

df_corr_a.rename(columns=lambda x: x.lower(), inplace=True);
df_corr_b.rename(columns=lambda x: x.lower().replace(' ', '_'), inplace=True);

# Check both corruption_perception dataset columns are aligned

df_corr_a.columns == df_corr_b.columns

array([ True,  True,  True])

# Combine corruption_perception datasets into single dataframe

df_corruption = df_corr_a.append(df_corr_b, ignore_index=True)

# Check number of records in combined dataframe

df_corruption.shape

(901, 3)

# Confirm all years data present

df_corruption.year.value_counts()

2011    183
2007    180
2009    180
2008    180
2010    178
Name: year, dtype: int64

# Check datatypes

df_corruption.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 901 entries, 0 to 900
Data columns (total 3 columns):
country      901 non-null object
year         901 non-null object
cpi_score    897 non-null float64
dtypes: float64(1), object(2)
memory usage: 21.2+ KB

# Inspect 'NaN' records
df_corruption[pd.isnull(df_corruption).any(axis=1)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score
201	Saint Lucia	2009	NaN
210	Saint Vincent and the Grenadines	2009	NaN
219	Brunei Darussalam	2009	NaN
255	Suriname	2009	NaN

# Inpsect 'year' object

type(df_corruption['year'][0])

str

Observation: 'year' appears to be string object and will need to be converted to integer. Also, NaN values will need to be removed

# convert year to int, remove 'NaN' records
df_corruption.year = pd.to_numeric(df_corruption.year, errors='coerce')
df_corruption.dropna(inplace=True)

# Check data types and dataframe size

df_corruption.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 897 entries, 0 to 900
Data columns (total 3 columns):
country      897 non-null object
year         897 non-null int64
cpi_score    897 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 28.0+ KB

df_corruption.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score
0	New Zealand	2008	9.4
1	Denmark	2008	9.3
2	Sweden	2008	9.2
3	Singapore	2008	9.2
4	Switzerland	2008	9.0

We have now cleaned the 'corruption perception' dataset and brought it to a fit state for further work. Cleaning actions we successfully performed include:

Transposing 2008/2009 data into long format
Renaming columns
Combining 2008/2009 data with 2007/2010/2011 data
Converted data types
Removed NaN records

We will proceed to clean the next dataset, infant mortality rate.

Infant Mortality Rate (rate per 1000 births) - Data Cleaning

# Rename columns and Transpose data to long format
df_infant.rename(columns=lambda x: x.replace('Infant mortality rate', 'country'), inplace=True)
df_infant = pd.melt(df_infant, id_vars='country', var_name='year', value_name='infant_mortality_rate')

# Filter data down to desired years, 2007 to 2011
y1 = '2007'
y2 = '2011'
df_infant = df_infant.query('year >= @y1 and year <= @y2')

# convert year to int and infant_mortality_rate to float, remove 'NAs', reset indexes 
df_infant.year = pd.to_numeric(df_infant.year, errors='coerce')
df_infant.infant_mortality_rate = pd.to_numeric(df_infant.infant_mortality_rate, errors='coerce')
df_infant.dropna(inplace=True)
df_infant.reset_index(drop=True, inplace=True)
df_infant.shape

(983, 3)

# Check records per year
df_infant.year.value_counts()

2008    199
2007    199
2011    195
2010    195
2009    195
Name: year, dtype: int64

#Check data types, NaNs
df_infant.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 983 entries, 0 to 982
Data columns (total 3 columns):
country                  983 non-null object
year                     983 non-null int64
infant_mortality_rate    983 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 23.1+ KB

df_infant.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	infant_mortality_rate
0	Afghanistan	2007	80.4
1	Albania	2007	16.7
2	Algeria	2007	26.4
3	Andorra	2007	2.8
4	Angola	2007	117.1

Infant mortality rate data is now cleaned.

Literacy rate, adult total (% of people ages 15 and above) - Data Cleaning

# Rename columns and Transpose data to long format

df_literacy.rename(columns=lambda x: x.replace('Adult (15+) literacy rate (%). Total', 'country'), inplace=True)
df_literacy = pd.melt(df_literacy, id_vars='country', var_name='year', value_name='literacy_rate')
df_literacy.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	literacy_rate
0	Afghanistan	1975	NaN
1	Albania	1975	NaN
2	Algeria	1975	NaN
3	Andorra	1975	NaN
4	Angola	1975	NaN

# Filter data down to desired years, 2006 to 2010
y1 = '2007'
y2 = '2011'
df_literacy = df_literacy.query('year >= @y1 and year <= @y2')

# convert year to int and literacy_rate to float, remove 'NAs', reset indexes 

df_literacy.year = pd.to_numeric(df_literacy.year, errors='coerce')
df_literacy.literacy_rate = pd.to_numeric(df_literacy.literacy_rate, errors='coerce')
df_literacy.dropna(inplace=True)
df_literacy.reset_index(drop=True, inplace=True)
df_literacy.shape

(206, 3)

# Check records per year
df_literacy.year.value_counts()

2011    84
2010    35
2007    32
2009    28
2008    27
Name: year, dtype: int64

# Check data types, null values
df_literacy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206 entries, 0 to 205
Data columns (total 3 columns):
country          206 non-null object
year             206 non-null int64
literacy_rate    206 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 4.9+ KB

df_literacy.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	literacy_rate
0	Azerbaijan	2007	99.601906
1	Bolivia	2007	90.743470
2	Brazil	2007	90.009370
3	Burkina Faso	2007	28.729214
4	Cameroon	2007	70.679938

Literacy rate data is now cleaned.

Inequality index (Gini) - Data Cleaning

# Rename columns and Transpose data to long format
df_inequality.rename(columns=lambda x: x.replace('Country Name', 'country'), inplace=True)
df_inequality = pd.melt(df_inequality, id_vars='country', var_name='year', value_name='gini_index')
df_inequality.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	gini_index
0	Aruba	1960	NaN
1	Afghanistan	1960	NaN
2	Angola	1960	NaN
3	Albania	1960	NaN
4	Andorra	1960	NaN

# Filter data down to desired years, 2007 to 2011
y1 = '2007'
y2 = '2011'
df_inequality = df_inequality.query('year >= @y1 and year <= @y2')

# Remove 'NaN' values, reset indexes, convert year to int and gini_index to float
df_inequality.dropna(inplace=True)
df_inequality.reset_index(drop=True, inplace=True)
df_inequality.year = pd.to_numeric(df_inequality.year, errors='coerce')
df_inequality.gini_index = pd.to_numeric(df_inequality.gini_index, errors='coerce')
df_inequality.shape

(376, 3)

# Check records per year
df_inequality.year.value_counts()

2010    83
2009    76
2011    74
2007    72
2008    71
Name: year, dtype: int64

# Check data types and null values
df_inequality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 376 entries, 0 to 375
Data columns (total 3 columns):
country       376 non-null object
year          376 non-null int64
gini_index    376 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 8.9+ KB

df_inequality.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	gini_index
0	Argentina	2007	46.3
1	Armenia	2007	31.2
2	Austria	2007	30.6
3	Belgium	2007	29.2
4	Bulgaria	2007	36.1

Income inequality dataset is now cleaned and ready for further work.

Poverty (% people below USD1.90 a day) - Data Cleaning

# Rename columns and Transpose data to long format
df_poverty.rename(columns=lambda x: x.replace('Country Name', 'country'), inplace=True)
df_poverty = pd.melt(df_poverty, id_vars='country', var_name='year', value_name='poverty_index')
df_poverty.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	poverty_index
0	Aruba	2011	NaN
1	Afghanistan	2011	NaN
2	Angola	2011	NaN
3	Albania	2011	NaN
4	Andorra	2011	NaN

# Remove 'NaN' values, reset indexes, convert year to int and poverty_index to float
df_poverty.dropna(inplace=True)
df_poverty.reset_index(drop=True, inplace=True)
df_poverty.year = pd.to_numeric(df_poverty.year, errors='coerce')
df_poverty.poverty_index = pd.to_numeric(df_poverty.poverty_index, errors='coerce')
df_poverty.shape

(90, 3)

# Confirm only 2011 data present
df_poverty.year.value_counts()

2011    90
Name: year, dtype: int64

# Check data types, null values
df_poverty.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90 entries, 0 to 89
Data columns (total 3 columns):
country          90 non-null object
year             90 non-null int64
poverty_index    90 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.2+ KB

df_poverty.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	poverty_index
0	Argentina	2011	0.9
1	Armenia	2011	2.2
2	Austria	2011	0.5
3	Belgium	2011	0.2
4	Benin	2011	53.1

We have now completed cleaning of the poverty_index dataset.

GDP/capita (USD, inflation-adjusted) - Data Cleaning**

We will now perform cleaning on the GDP dataset

# Rename columns and Transpose data to long format
df_gdp.rename(columns=lambda x: x.replace('Income per person (fixed 2000 US$)', 'country'), inplace=True)
df_gdp = pd.melt(df_gdp, id_vars='country', var_name='year', value_name='gdp_per_capita')
df_gdp.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	gdp_per_capita
0	Abkhazia	1960	NaN
1	Afghanistan	1960	NaN
2	Akrotiri and Dhekelia	1960	NaN
3	Albania	1960	NaN
4	Algeria	1960	1280.384828

# Filter data down to desired years, 2007 to 2011
y1 = '2007'
y2 = '2011'
df_gdp = df_gdp.query('year >= @y1 and year <= @y2')

# Remove 'NaN' values, reset indexes, convert year to int and gdp_per_capita to float
df_gdp.dropna(inplace=True)
df_gdp.reset_index(drop=True, inplace=True)
df_gdp.year = pd.to_numeric(df_gdp.year, errors='coerce')
df_gdp.gdp_per_capita = pd.to_numeric(df_gdp.gdp_per_capita, errors='coerce')
df_gdp.shape

(930, 3)

# Check records per year
df_gdp.year.value_counts()

2007    193
2008    191
2009    189
2010    182
2011    175
Name: year, dtype: int64

# Check data types and nulls

df_gdp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 3 columns):
country           930 non-null object
year              930 non-null int64
gdp_per_capita    930 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 21.9+ KB

df_gdp.head(1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	gdp_per_capita
0	Albania	2007	1681.61391

GDP per capita data is now cleaned.

At this stage, we have cleaned the single-variable data sets. We shall now comine them into a single dataframe, to allow for comparisons across variable

Combined Data

In the cell below, we will combine all individual datasets into a single dataframe.

# Combine individual datasets into a single dataframe

df = pd.merge(df_corruption, df_infant, on=['country', 'year'], how='outer')
df = pd.merge(df, df_literacy, on=['country', 'year'], how='outer')
df = pd.merge(df, df_inequality, on=['country', 'year'], how='outer')
df = pd.merge(df, df_poverty, on=['country', 'year'], how='outer')
df = pd.merge(df, df_gdp, on=['country', 'year'], how='outer')

Let us inspect the combined data.

df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
0	New Zealand	2008	9.4	5.3	NaN	NaN	NaN	15011.18385
1	Denmark	2008	9.3	3.6	NaN	25.2	NaN	32320.10054
2	Sweden	2008	9.2	2.6	NaN	26.8	NaN	32798.73425
3	Singapore	2008	9.2	2.2	NaN	NaN	NaN	30131.61718
4	Switzerland	2008	9.0	4.0	NaN	33.8	NaN	39324.73112

df.shape

(1104, 8)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1104 entries, 0 to 1103
Data columns (total 8 columns):
country                  1104 non-null object
year                     1104 non-null int64
cpi_score                897 non-null float64
infant_mortality_rate    983 non-null float64
literacy_rate            206 non-null float64
gini_index               376 non-null float64
poverty_index            90 non-null float64
gdp_per_capita           930 non-null float64
dtypes: float64(6), int64(1), object(1)
memory usage: 77.6+ KB

df.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
count	1104.000000	897.000000	983.000000	206.000000	376.000000	90.000000	930.000000
mean	2009.009058	4.014158	28.860905	84.283146	36.582979	9.026667	8418.930436
std	1.437078	2.088395	26.159773	17.152699	8.551911	15.310287	13276.734101
min	2007.000000	1.000000	1.760000	25.307745	23.700000	0.000000	97.910183
25%	2008.000000	2.500000	7.750000	76.014593	30.175000	0.200000	672.367236
50%	2009.000000	3.300000	18.300000	91.845252	33.950000	1.200000	2503.908898
75%	2010.000000	5.100000	45.400000	96.178175	42.325000	8.425000	9783.327358
max	2011.000000	9.500000	120.500000	99.998262	63.400000	54.200000	108111.212800

# Check records per year
df.year.value_counts()

2011    233
2007    228
2010    217
2008    217
2009    209
Name: year, dtype: int64

# Check for duplicated values
df.duplicated().value_counts()

False    1104
dtype: int64

df.country.value_counts()

Pakistan                          5
Luxembourg                        5
Sudan                             5
Bosnia and Herzegovina            5
Angola                            5
North Korea                       5
Belize                            5
Malawi                            5
Russia                            5
Cameroon                          5
Canada                            5
Dominican Republic                5
Estonia                           5
Nepal                             5
Switzerland                       5
Slovak Republic                   5
Tanzania                          5
Mexico                            5
Finland                           5
United States                     5
Fiji                              5
Vietnam                           5
France                            5
Chad                              5
Afghanistan                       5
Belarus                           5
Turkey                            5
St. Vincent and the Grenadines    5
Iceland                           5
Armenia                           5
                                 ..
Kuweit                            1
Cayman Islands                    1
Channel Islands                   1
Europe & Central Asia             1
IDA total                         1
Viet Nam                          1
Congo  Republic                   1
Moldovaa                          1
Gambia, The                       1
Low income                        1
High income                       1
Congo-Brazzaville                 1
South Asia                        1
USA                               1
Sub-Saharan Africa                1
Korea, Dem. Rep.                  1
Lao PDR                           1
Central African Rep.              1
Democratic Republic of Congo      1
Upper middle income               1
Low & middle income               1
Korea (North)                     1
Brunei Darussalam                 1
East Asia & Pacific               1
Cabo Verde                        1
Aruba                             1
Czech Republik                    1
Middle East & North Africa        1
Yemen, Rep.                       1
Congo, Republic                   1
Name: country, Length: 261, dtype: int64

We have made a number of observations on this combined dataframe.

Our variables where correctly combined into a single dataframe, there appear to be no duplicates, and the data types are correct.

However, we observe that some country names are not consistently spelt, for example, 'Democratic Republic of Congo' and 'Democratic Republic of the Congo' likely refer to the same country. Similarly, 'Russia' and 'Russian Federation' both appear in the data.

Also, some records appear to be for non-state entities, for example, 'High income' and 'Latin America & Carribean' and 'Fragile and conflict affected situations'.

We also spot a row for 'Kuweit' and speculate there may be a separate row for 'Kuwait'.

#Check for occurence of 'Kuwait' or 'Kuweit'

a1 = 'kuweit'
a2 = 'kuwait'

df.query('country.str.lower() == @a1 or country.str.lower() == @a2')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
65	Kuwait	2008	4.1	9.7	93.906206	NaN	NaN	25308.08492
242	Kuwait	2009	4.5	9.5	NaN	NaN	NaN	23116.34061
415	Kuweit	2007	4.3	NaN	NaN	NaN	NaN	NaN
589	Kuwait	2010	4.5	9.2	NaN	NaN	NaN	23114.58667
768	Kuwait	2011	4.6	8.9	NaN	NaN	NaN	24286.11580
908	Kuwait	2007	NaN	9.8	93.664185	NaN	NaN	25100.02810

# Check for occurence of 'Russia' or 'Russian Federation'

a3 = 'russia'
a4 = 'russian federation'

df.query('country.str.lower() == @a3 or country.str.lower() == @a4')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
146	Russia	2008	2.2	11.6	NaN	NaN	NaN	3043.665599
322	Russia	2009	2.1	10.9	NaN	NaN	NaN	2806.414830
500	Russia	2007	2.3	12.4	NaN	NaN	NaN	2888.847355
697	Russia	2010	2.1	10.3	99.684267	NaN	NaN	2928.005033
861	Russia	2011	2.4	9.8	NaN	NaN	NaN	3054.727742
1058	Russian Federation	2007	NaN	NaN	NaN	42.3	NaN	NaN
1061	Russian Federation	2008	NaN	NaN	NaN	41.6	NaN	NaN
1063	Russian Federation	2009	NaN	NaN	NaN	39.8	NaN	NaN
1068	Russian Federation	2010	NaN	NaN	NaN	39.5	NaN	NaN
1069	Russian Federation	2011	NaN	NaN	NaN	39.7	0.0	NaN

As suspected, 'Kuwait' and 'Kuweit' both occur in the data, as do records against both 'Russia' and 'Russian Federation'

We therefore judge that a manual process of cleaning 'country' needs to be performed outside python, and the cleaned data then re-integrated into our analysis data.

To accomplish this, we shall export a unique list of countries, manually create a mapping table containing valid countries with names spelt correctly, and then import this back into our dataframe.

This edit will also afford us the opportunity to augment country data with Region categorization, to pose and answer additional questions. Regional classifications have been obtained from the website below.

https://meta.wikimedia.org/wiki/List_of_countries_by_regional_classification

Important to stress fixing errors in country names cannot be performed programatically.

# This snippet exports the original country list to a csv file
pd.DataFrame.to_csv(pd.DataFrame(df.country.value_counts()),'country_list.csv')
#pd.DataFrame(df.country.value_counts())

After performing offline cleaning of the country list and adding of 'Region' information, we are now ready to import and integrate this list into our dataset.

Filename holding our cleaned country list is: 'country_list_clean.csv'

country_list = pd.read_csv('country_list_clean.csv')
country_list.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	country_fixed	region
0	Uzbekistan	Uzbekistan	CIS
1	Tuvalu	Tuvalu	Asia & Pacific
2	Estonia	Estonia	Europe
3	Papua New Guinea	Papua New Guinea	Asia & Pacific
4	Slovakia	Slovakia	Europe

# Check that same number of records exist in both country lists

print('Dimensions - Original Country List: {}\nDimensions - Cleaned Country List: {}'.
      format(pd.DataFrame(df.country.value_counts()).shape, country_list.shape))

Dimensions - Original Country List: (261, 1)
Dimensions - Cleaned Country List: (261, 3)

# Import columns into analysis dataframe
df = pd.merge(df, country_list, how='left', left_on='country', right_on='country')
df.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region
0	New Zealand	2008	9.4	5.3	NaN	NaN	NaN	15011.18385	New Zealand	Asia & Pacific
1	Denmark	2008	9.3	3.6	NaN	25.2	NaN	32320.10054	Denmark	Europe
2	Sweden	2008	9.2	2.6	NaN	26.8	NaN	32798.73425	Sweden	Europe
3	Singapore	2008	9.2	2.2	NaN	NaN	NaN	30131.61718	Singapore	Asia & Pacific
4	Switzerland	2008	9.0	4.0	NaN	33.8	NaN	39324.73112	Switzerland	Europe

# Take a copy of the dataframe, and perfom cleaning in this copy
df_clean = df.copy()

One of the actions we performed in building a clean country list was to identify non-country records in our original data. These 'countries' have been marked as 'DELETE' in the country_fixed column of our cleaned data, so as to easily identify them for removal.

We will now display and then remove these records.

# List records for non-country entities, which have been identified with "DELETE" 
#   in 'country_fixed' column

df_clean.query('country_fixed == "DELETE"')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region
1070	East Asia & Pacific	2011	NaN	NaN	NaN	NaN	8.6	NaN	DELETE	NaN
1072	Fragile and conflict affected situations	2011	NaN	NaN	NaN	NaN	36.9	NaN	DELETE	NaN
1073	High income	2011	NaN	NaN	NaN	NaN	0.6	NaN	DELETE	NaN
1074	IDA total	2011	NaN	NaN	NaN	NaN	34.1	NaN	DELETE	NaN
1075	Latin America & Caribbean	2011	NaN	NaN	NaN	NaN	5.6	NaN	DELETE	NaN
1076	Low income	2011	NaN	NaN	NaN	NaN	50.5	NaN	DELETE	NaN
1077	Lower middle income	2011	NaN	NaN	NaN	NaN	19.6	NaN	DELETE	NaN
1078	Low & middle income	2011	NaN	NaN	NaN	NaN	16.4	NaN	DELETE	NaN
1079	Middle East & North Africa	2011	NaN	NaN	NaN	NaN	2.7	NaN	DELETE	NaN
1080	South Asia	2011	NaN	NaN	NaN	NaN	20.1	NaN	DELETE	NaN
1081	Sub-Saharan Africa	2011	NaN	NaN	NaN	NaN	44.9	NaN	DELETE	NaN
1082	Upper middle income	2011	NaN	NaN	NaN	NaN	5.7	NaN	DELETE	NaN
1083	World	2011	NaN	NaN	NaN	NaN	13.8	NaN	DELETE	NaN

# Remove records

df_clean = df_clean.query('country_fixed != "DELETE"')

# Check records no longer exist

df_clean.query('country_fixed =="DELETE"')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region

Deletion of non-country records was successful.

Next, let's look at instances where same countries have different spellings. Examples we spotted earlier are 'Kuwait vs Kuweit', and 'Russia vs Russian Federation'. Our manual check uncovered many more. In cells below however, let us display name anomalies for Kuwait and Russian Federation.

# Using earlier example of 'Kuwait' or 'Kuweit'

a1 = 'kuweit'
a2 = 'kuwait'

df.query('country.str.lower() == @a1 or country.str.lower() == @a2')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region
65	Kuwait	2008	4.1	9.7	93.906206	NaN	NaN	25308.08492	Kuwait	Arab States
242	Kuwait	2009	4.5	9.5	NaN	NaN	NaN	23116.34061	Kuwait	Arab States
415	Kuweit	2007	4.3	NaN	NaN	NaN	NaN	NaN	Kuwait	Arab States
589	Kuwait	2010	4.5	9.2	NaN	NaN	NaN	23114.58667	Kuwait	Arab States
768	Kuwait	2011	4.6	8.9	NaN	NaN	NaN	24286.11580	Kuwait	Arab States
908	Kuwait	2007	NaN	9.8	93.664185	NaN	NaN	25100.02810	Kuwait	Arab States

# Using the earlier seen 'Russia' vs 'Russian Federation' example

a3 = 'russia'
a4 = 'russian federation'
df_clean.query('country.str.lower() == @a3 or country.str.lower() == @a4')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region
146	Russia	2008	2.2	11.6	NaN	NaN	NaN	3043.665599	Russian Federation	Europe
322	Russia	2009	2.1	10.9	NaN	NaN	NaN	2806.414830	Russian Federation	Europe
500	Russia	2007	2.3	12.4	NaN	NaN	NaN	2888.847355	Russian Federation	Europe
697	Russia	2010	2.1	10.3	99.684267	NaN	NaN	2928.005033	Russian Federation	Europe
861	Russia	2011	2.4	9.8	NaN	NaN	NaN	3054.727742	Russian Federation	Europe
1058	Russian Federation	2007	NaN	NaN	NaN	42.3	NaN	NaN	Russian Federation	Europe
1061	Russian Federation	2008	NaN	NaN	NaN	41.6	NaN	NaN	Russian Federation	Europe
1063	Russian Federation	2009	NaN	NaN	NaN	39.8	NaN	NaN	Russian Federation	Europe
1068	Russian Federation	2010	NaN	NaN	NaN	39.5	NaN	NaN	Russian Federation	Europe
1069	Russian Federation	2011	NaN	NaN	NaN	39.7	0.0	NaN	Russian Federation	Europe

Column 'country_fixed' contains the correct country names.

However, our comparison metrics are split across variants of country spellings in 'country' column. How many rows are duplicated in this way?

# Number of rows with duplicates on 'country_fixed' and 'year' columns

df_clean.duplicated(['country_fixed','year']).value_counts()

False    1023
True       68
dtype: int64

We have 68 duplicates.

Let us illustrate a single instance of this problem, using Russia data for 2008.

# Show records for Russia, 2008

y = int(2008)
df_clean.query('country_fixed.str.lower() == @a4 and year == @y')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita	country_fixed	region
146	Russia	2008	2.2	11.6	NaN	NaN	NaN	3043.665599	Russian Federation	Europe
1061	Russian Federation	2008	NaN	NaN	NaN	41.6	NaN	NaN	Russian Federation	Europe

As observed earlier, records for Russia in year 2008 are split across two rows, 146 and 1061. All metrics for Russia in year 2008 need to appear in a single row. This principle applies to all similarly duplicated records.

We will use groupby() to consolidate values, using 'country_fixed', 'year' and 'region' as keys, and reset index to restore keys to columns.

# Groupby to consolidate values
df_clean = pd.DataFrame(df_clean.groupby(['country_fixed','year','region']).sum())

#reset index, to restore country, year and region to columns
df_clean.reset_index(level=df_clean.index.names, inplace=True)
df_clean.head(6)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country_fixed	year	region	cpi_score	infant_mortality_rate	literacy_rate	gdp_per_capita
0	Afghanistan	2007	Asia & Pacific	1.8	80.4	0.0	0.00000
1	Afghanistan	2008	Asia & Pacific	1.3	78.6	0.0	0.00000
2	Afghanistan	2009	Asia & Pacific	1.4	76.8	0.0	0.00000
3	Afghanistan	2010	Asia & Pacific	1.4	75.1	0.0	0.00000
4	Afghanistan	2011	Asia & Pacific	1.5	73.4	39.0	0.00000
5	Albania	2007	Europe	2.9	16.7	0.0	1681.61391

# Check to confirm "Russian Federation" record is now correctly displayed

y = int(2008)
df_clean.query('country_fixed.str.lower() == @a4 and year == @y')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country_fixed	year	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
745	Russian Federation	2008	Europe	2.2	11.6	0.0	41.6	0.0	3043.665599

# Check if any duplicates on 'country_fixed' and 'year' columns

df_clean.duplicated(['country_fixed','year']).value_counts()

False    1021
dtype: int64

We sucessfully used groupby() to ensure all metrics for each year, country and region combination appear on a single row, and then removed the resulting duplicated. We now have correct unique records per country and year. We have confirmed this by checking the record for Russia in 2008.

Let us now rename the country_fixed column, and proceed with further checks.

# Rename 'country_fixed' to 'country'

df_clean.rename(columns=lambda x: x.replace('country_fixed', 'country'), inplace=True)
df_clean.head(1)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	year	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
0	Afghanistan	2007	Asia & Pacific	1.8	80.4	0.0	0.0	0.0	0.0

df_clean.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
count	1021.000000	1021.000000	1021.000000	1021.000000	1021.000000	1021.000000	1021.000000
mean	2008.980411	3.522331	27.786748	17.005218	13.472282	0.541528	7668.565432
std	1.416848	2.373268	26.243390	34.703443	18.399647	4.362516	12895.942581
min	2007.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2008.000000	2.200000	7.000000	0.000000	0.000000	0.000000	498.570158
50%	2009.000000	3.000000	17.300000	0.000000	0.000000	0.000000	2134.037162
75%	2010.000000	4.700000	43.900000	0.000000	31.500000	0.000000	8151.712950
max	2011.000000	11.000000	120.500000	99.998262	63.400000	54.200000	108111.212800

Observe that counts on all columns are 1021. Values that were null in original datasets are now held as zeros, resulting in incorrect counts. Incorrect counts on columns will cause wrong values for mean, min, max and other aggregations.

We will need to change the zeros back to null.

# Replace all zeros with NaN, so our aggregations (count, mean, sum, min) are accurate

df_clean = df_clean.replace(0.0, np.nan)
df_clean.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	year	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
count	1021.000000	894.000000	983.000000	206.000000	376.000000	57.000000	930.000000
mean	2008.980411	4.022707	28.860905	84.283146	36.582979	9.700000	8418.930436
std	1.416848	2.101977	26.159773	17.152699	8.551911	16.006952	13276.734101
min	2007.000000	1.000000	1.760000	25.307745	23.700000	0.100000	97.910183
25%	2008.000000	2.500000	7.750000	76.014593	30.175000	0.500000	672.367236
50%	2009.000000	3.300000	18.300000	91.845252	33.950000	1.800000	2503.908898
75%	2010.000000	5.100000	45.400000	96.178175	42.325000	7.900000	9783.327358
max	2011.000000	11.000000	120.500000	99.998262	63.400000	54.200000	108111.212800

Obeserving the output in the last two cells, we note that the 'count' aggregation now correctly shows the number of non-zero values in each column. Also, min, mean and other aggregations are now correctly calculated.

We will now create a version of our dataset that contains average of each measure by country. This will be used for most of the following analysis. Why have we taken this decision?

Within the five year window of our dataset, countries will have missing data across years. Taking an average of yearly observations by country allows us to run comparisons among countries with reduced concern over availability of data in a given year.

# Create dataframe of average observations by country. 
df_avg = df_clean.drop('year', axis=1)
df_avg = pd.DataFrame(df_avg.groupby(['country','region']).mean())

# reset indexes and restore 'country' and 'region' columns
df_avg.reset_index(level=df_avg.index.names, inplace=True)

df_avg.head(10)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
0	Afghanistan	Asia & Pacific	1.48	76.86	39.000000	NaN	NaN	NaN
1	Albania	Europe	3.16	15.44	96.391969	30.00	NaN	1844.903592
2	Algeria	Arab States	2.90	24.46	NaN	27.60	0.5	2201.836568
3	Andorra	Europe	NaN	2.60	NaN	NaN	NaN	21719.572490
4	Angola	Africa	1.98	112.08	70.362420	42.70	NaN	611.714745
5	Antigua and Barbuda	South/Latin America	NaN	8.22	98.950000	NaN	NaN	11817.136463
6	Argentina	South/Latin America	2.92	13.38	97.858770	44.00	0.9	10321.740247
7	Armenia	CIS	2.70	17.02	99.568170	29.56	2.2	1391.495500
8	Aruba	South/Latin America	NaN	NaN	96.822640	NaN	NaN	NaN
9	Australia	Asia & Pacific	8.70	4.20	NaN	35.05	NaN	25092.326418

'Averages' dataframe was successfully created.

Let us inspect this dataframe, to identify any anomalies and plan any corrective actions.

df_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 8 columns):
country                  211 non-null object
region                   211 non-null object
cpi_score                185 non-null float64
infant_mortality_rate    199 non-null float64
literacy_rate            140 non-null float64
gini_index               134 non-null float64
poverty_index            57 non-null float64
gdp_per_capita           193 non-null float64
dtypes: float64(6), object(2)
memory usage: 13.3+ KB

df_avg.mean()

cpi_score                   4.025874
infant_mortality_rate      28.644548
literacy_rate              82.733571
gini_index                 38.365933
poverty_index               9.700000
gdp_per_capita           9170.833436
dtype: float64

df_avg.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	infant_mortality_rate	literacy_rate	gini_index	poverty_index	gdp_per_capita
count	185.000000	199.000000	140.000000	134.000000	57.000000	193.000000
mean	4.025874	28.644548	82.733571	38.365933	9.700000	9170.833436
std	2.090305	26.047186	18.664967	8.469815	16.006952	14625.856835
min	1.000000	1.930000	25.307745	24.540000	0.100000	103.104815
25%	2.460000	7.677500	71.369112	31.875000	0.500000	731.926135
50%	3.240000	17.680000	90.599093	36.340000	1.800000	2581.767415
75%	5.140000	44.650000	97.713798	43.550000	7.900000	10558.767324
max	9.380000	112.080000	99.998262	63.200000	54.200000	103885.246787

We notice there are null values in the metrics columns, with the worst being 'poverty_index' with only 57 valid records in our dataset of 211 records.

The paucity of values against 'poverty' suggests our ability to run comparisons across countries in this column will be limited. We will therefore remove the 'poverty_index' column from our df_avg dataset.

# Remove 'poverty_index' column

df_avg.drop('poverty_index', axis=1, inplace=True)
df_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 7 columns):
country                  211 non-null object
region                   211 non-null object
cpi_score                185 non-null float64
infant_mortality_rate    199 non-null float64
literacy_rate            140 non-null float64
gini_index               134 non-null float64
gdp_per_capita           193 non-null float64
dtypes: float64(5), object(2)
memory usage: 11.6+ KB

Null values in other columns will need to be filled in. Our approach will be to use the 'Region' averages to fill in missing values.

# Calculate and inspect Region mean values 

df_avg.groupby(['region']).mean()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
region
Africa	2.945891	61.859545	64.777314	45.027143	1256.879586
Arab States	3.478000	30.746667	81.398935	33.212500	7493.379077
Asia & Pacific	3.890270	27.043810	81.383965	37.144133	6875.002351
CIS	2.380000	27.272000	99.603976	30.420000	1361.668615
Europe	5.789512	5.607500	98.149657	31.779065	20446.367688
North America	7.990000	5.660000	NaN	37.225000	42977.916206
South/Latin America	3.983229	18.504306	92.565855	48.778125	5744.894997

We observe that with only one exception, we can obtain region mean for all metrics. These region mean values will be used to fill in the missing values.

Literacy rate appears to be missing for the entire North American region. We will fill this in by using the overall or global literacy rate mean.

First, however, we will now fill in the missing values using region averages where available.

# Temporary dataframe with missing values filled in with region averages 

df_avg_fill = df_avg.groupby('region').transform(lambda x: x.fillna(x.mean()))
df_avg_fill.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
0	1.480000	76.86	39.000000	37.144133	6875.002351
1	3.160000	15.44	96.391969	30.000000	1844.903592
2	2.900000	24.46	81.398935	27.600000	2201.836568
3	5.789512	2.60	98.149657	31.779065	21719.572490
4	1.980000	112.08	70.362420	42.700000	611.714745

We will now replace columns in our 'averages' dataset with columns from our 'filled-in-values' dataset

# Copy filled-in values into our working 'averages' dataset

df_avg.cpi_score = df_avg_fill.cpi_score
df_avg.infant_mortality_rate = df_avg_fill.infant_mortality_rate
df_avg.literacy_rate = df_avg_fill.literacy_rate
df_avg.gini_index = df_avg_fill.gini_index
df_avg.gdp_per_capita = df_avg_fill.gdp_per_capita

Using a temporary dataframe, we obtained corrected metrics columns and then used these fixed columns to replace columns in our dataset. With the exception of literacy_rate for North America, all null values should now be filled.

Let us now inspect our working dataset.

df_avg.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
0	Afghanistan	Asia & Pacific	1.480000	76.86	39.000000	37.144133	6875.002351
1	Albania	Europe	3.160000	15.44	96.391969	30.000000	1844.903592
2	Algeria	Arab States	2.900000	24.46	81.398935	27.600000	2201.836568
3	Andorra	Europe	5.789512	2.60	98.149657	31.779065	21719.572490
4	Angola	Africa	1.980000	112.08	70.362420	42.700000	611.714745

df_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 7 columns):
country                  211 non-null object
region                   211 non-null object
cpi_score                211 non-null float64
infant_mortality_rate    211 non-null float64
literacy_rate            208 non-null float64
gini_index               211 non-null float64
gdp_per_capita           211 non-null float64
dtypes: float64(5), object(2)
memory usage: 11.6+ KB

As observed earlier, 'literacy_rate' has some null values. Let us see details of those records.

# List records with Null

df_avg[pd.isnull(df_avg).any(axis=1)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
20	Bermuda	North America	7.99	5.66	NaN	37.225	65455.868678
33	Canada	North America	8.78	4.94	NaN	33.700	25781.976322
201	United States	North America	7.20	6.38	NaN	40.750	37695.903618

As decided earlier, we will obtain the global mean of literacy_rate and assign this to North American countries.

# We are interested in the mean of Literacy rate

df_avg.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
count	211.000000	211.000000	208.000000	211.000000	211.000000
mean	4.088697	27.883904	84.767862	39.064161	8890.097876
std	2.002674	25.568050	16.496973	7.960935	14024.659745
min	1.000000	1.930000	25.307745	24.540000	103.104815
25%	2.610000	7.550000	81.383965	33.035000	878.625684
50%	3.480000	17.780000	92.182549	37.144133	3036.536292
75%	5.170000	43.290000	98.149657	45.027143	9054.292336
max	9.380000	112.080000	99.998262	63.200000	103885.246787

All columns now appear to have values, with exception of literacy rate. As observed earlier, literacy rate is missing values for 3 countries in North America.

To remedy this, we will use the global literacy rate average, 84.7, for these North American countries.

# Fill in missing literacy rate values using global average

df_avg.literacy_rate = df_avg.literacy_rate.fillna(df_avg.literacy_rate.mean())

Let us confirm that the average literacy rate value has replaced the null North American values.

df_avg.query('region == "North America"')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
20	Bermuda	North America	7.99	5.66	84.767862	37.225	65455.868678
33	Canada	North America	8.78	4.94	84.767862	33.700	25781.976322
201	United States	North America	7.20	6.38	84.767862	40.750	37695.903618

df_avg[pd.isnull(df_avg).any(axis=1)]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita

Missing literacy rate for North America was correctly updated.

One final check of our data, to confirm all cleaning actions have been successfully carried out.

df_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 7 columns):
country                  211 non-null object
region                   211 non-null object
cpi_score                211 non-null float64
infant_mortality_rate    211 non-null float64
literacy_rate            211 non-null float64
gini_index               211 non-null float64
gdp_per_capita           211 non-null float64
dtypes: float64(5), object(2)
memory usage: 11.6+ KB

df_avg.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
count	211.000000	211.000000	211.000000	211.000000	211.000000
mean	4.088697	27.883904	84.767862	39.064161	8890.097876
std	2.002674	25.568050	16.378714	7.960935	14024.659745
min	1.000000	1.930000	25.307745	24.540000	103.104815
25%	2.610000	7.550000	81.383965	33.035000	878.625684
50%	3.480000	17.780000	91.559827	37.144133	3036.536292
75%	5.170000	43.290000	98.149657	45.027143	9054.292336
max	9.380000	112.080000	99.998262	63.200000	103885.246787

df_avg.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	cpi_score	infant_mortality_rate	literacy_rate	gini_index	gdp_per_capita
0	Afghanistan	Asia & Pacific	1.480000	76.86	39.000000	37.144133	6875.002351
1	Albania	Europe	3.160000	15.44	96.391969	30.000000	1844.903592
2	Algeria	Arab States	2.900000	24.46	81.398935	27.600000	2201.836568
3	Andorra	Europe	5.789512	2.60	98.149657	31.779065	21719.572490
4	Angola	Africa	1.980000	112.08	70.362420	42.700000	611.714745

Cleaning now appears complete, missing values appear to have been filled in.

What have we done so far? We have :

Consolidated our data from multiple files / sources into a single data set
Removed data for non-country entities
Resolved naming discrepancies, and combined observations where needed
Augmented our data by adding 'region' classification
Created a version of our dataset that holds the five-year average of observations in each metric
Identified missing values
Removed 'Poverty' metric from our dataset, and re-framed our questions, due to large number of missing values
Used Region averages to fill in missing values, and a global average where regional average was not available
Checked and confirmed all cleaning actions successsful and complete.

As a result of these steps, our data is now ready for exploratory analysis.

Before we dive into exploratory analysis however, let us take a quick overview of distributions within our metrics, followed by a brief description of the range of values in each.

df_avg.hist(figsize=(15,15));

A brief reminder of the columns, what they measure and the value range.

CPI Score: Ranks countries/territories based on how corrupt their public sector is perceived to be. 0 means that a country is perceived as highly corrupt and 10 means that a country is perceived as very clean.

GDP per Capita: Per capita GDP is a measure of the total output of a country that takes the gross domestic product (GDP) and divides it by the number of people in that country. Higher values for per capita GDP signals high productivity withiin a country. Values are in USD.

GINI Index: Measures the extent to which the distribution of income among individuals or households within an economy deviates from a perfectly equal distribution. A Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.

Infant Mortality Rate (IMR): IMR has been defined as the number of deaths in children under 1 year of age per 1000 live births in the same year.

Adult Literacy Rate: Adult literacy rate is the percentage of people aged 15 years and above who can read and write.

Exploratory Data Analysis

Let us now explore our data, using our research questions as guide.

Q1. Which five countries have the worst and best measures of income inequality?

A Gini index of 0 represents perfect equality, while an index of 100 implies perfect inequality.

# Quick view of distribution of GINI data

df_avg.gini_index.hist()
plt.title('Spread and Count of GINI Index Values')
plt.ylabel('Frequency')
plt.xlabel('GINI Index');

df_avg.gini_index.mean(), df_avg.gini_index.median()

(39.064160699836485, 37.144133333333336)

Data appears to skew to the right, implying that if we split the data around the middle point, we will find there are more countries with high income inequality (low values in the index) than there are with low inequality. Let us identify the five highest and lowest.

The next cell will identify the 5 highest and 5 lowest, and combine the results into a single dataframe.

# Lowest income inequality
l = df_avg.gini_index.drop_duplicates().nsmallest(5)
df_lowest_gini = df_avg.query('gini_index == @l')
df_lowest_gini_t = df_lowest_gini.drop(['cpi_score', 'infant_mortality_rate', 'literacy_rate', 'gdp_per_capita'], axis=1)
df_lowest_gini_t['gini_type'] = 'Lowest inequality'

# Highest income inequality
h = df_avg.gini_index.drop_duplicates().nlargest(5)
df_highest_gini = df_avg.query('gini_index == @h')
df_highest_gini_t = df_highest_gini.drop(['cpi_score', 'infant_mortality_rate', 'literacy_rate', 'gdp_per_capita'], axis=1)
df_highest_gini_t['gini_type'] = 'Highest inequality'

# Combine lowest and Highest GINI
df_gini_lh_t = df_lowest_gini_t.append(df_highest_gini_t)
df_gini_lh_t

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	gini_index	gini_type
51	Czech Republic	Europe	26.30	Lowest inequality
141	Norway	Europe	26.26	Lowest inequality
166	Slovakia	Europe	26.34	Lowest inequality
167	Slovenia	Europe	24.54	Lowest inequality
198	Ukraine	CIS	25.66	Lowest inequality
24	Botswana	Africa	60.50	Highest inequality
36	Central African Republic	Africa	56.20	Highest inequality
130	Namibia	Africa	61.00	Highest inequality
170	South Africa	Africa	63.20	Highest inequality
209	Zambia	Africa	55.60	Highest inequality

# Plot five highest and five lowest on a horizontal bar chart

df_gini_lh_t.sort_values(by=['gini_index'], inplace=True)
ax = df_gini_lh_t.plot(kind='barh')
ax.set_yticklabels(df_gini_lh_t.country)

ax.set_title('Top 5 and Bottom 5 GINI Countries')

# set individual bar labels
for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width(), i.get_y(), str(round((i.get_width()), 2)));
    
# Credit for annotations: http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html

All five worst inequality scores are in Africa, while countries with best (lowest) inequality scores are in Europe. Let's use the 'region' calssification to see if this trend extends beyong the five highest and lowest.

# GINI index by region 

df_avg.groupby('region')['gini_index'].mean()

region
Africa                 45.027143
Arab States            33.212500
Asia & Pacific         37.144133
CIS                    30.420000
Europe                 31.779065
North America          37.225000
South/Latin America    48.778125
Name: gini_index, dtype: float64

# Groupby 'region' in the main dataset, and plot a histogram on the value count

df_avg_mean = df_avg.groupby('region')['gini_index'].mean()
df_avg_mean.sort_values(inplace=True)
ax = df_avg_mean.plot(kind='barh')

ax.set_title('Average Income Inequality Score (GINI) by Region')
ax.set_xlabel('GINI Index')
ax.set_ylabel('Region')

for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width(), i.get_y(), str(round((i.get_width()), 2)));

Surprisingly, although Africa has the five worst countries with income inequality, the region with worst average income equality appears to be South America. Let us try to understand why by comparing inequality among three regions: Africa, Europe and South America

# Create masks
africa = df_avg.region == 'Africa'
europe = df_avg.region == 'Europe'
s_america = df_avg.region == 'South/Latin America'

# Filter and plot
df_avg.gini_index[africa].hist(alpha=0.5, bins=5, label='Africa', figsize=(10,6))
df_avg.gini_index[europe].hist(alpha=0.5, bins=5, label='Europe', figsize=(10,6))
df_avg.gini_index[s_america].hist(alpha=0.5, bins=5, label='S. America', figsize=(10,6))
#plt.figure(figsize=(15,15))
plt.title('GINI Index Distribution in Africa, Europe and South America')
plt.ylabel('Count of Observations')
plt.xlabel('GINI Index Bins')

plt.legend();

The above histogram helps us understand our earlier finding. The GINI observations in South America are tightly clustered between 44 and 55. With Africa on the other hand, observations are spread across a broad value range. A few African countries appear to have low levels of income inequality (low GINI indexes), and these countries push Africa to a superior average score relative to South America.

Lastly, let us identify these unusual African countries with low income inequality scores.

# Top five African countries with best income inequality measures

al = df_avg[africa].gini_index.drop_duplicates().nsmallest(5)
df_africa_lowest_gini = df_avg[africa].query('gini_index == @al')
df_africa_lowest_gini_t = df_africa_lowest_gini.drop(['cpi_score', 'infant_mortality_rate', 'literacy_rate', 'gdp_per_capita'], axis=1)
df_africa_lowest_gini_t['gini_type'] = 'Lowest inequality'
df_africa_lowest_gini_t.sort_values(by='gini_index')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	gini_index	gini_type
159	Sao Tome and Principe	Africa	30.8	Lowest inequality
115	Mali	Africa	33.0	Lowest inequality
62	Ethiopia	Africa	33.2	Lowest inequality
164	Sierra Leone	Africa	34.0	Lowest inequality
137	Niger	Africa	34.4	Lowest inequality

So our observation in relation to this question is that the top five countries with lowest income inequality are in Europe. On the other hand, the top 5 countries with highest income inquality are in Africa. However, a few Afrian countries seem to have admirably high levels of fair income distribution, so that on average, South America, not Africa, appears to have the worst income inequality.

Q2. Is there a relationship between income inequality and corruption?

Important Note:

With GINI, 0 is perfect income distribution and 100 is totally imperfect income distribution

With CPI, 0 is very corrupt and 10 is very clean

# Create dataframe with only these two columns

df_q2 = df_avg.drop(['infant_mortality_rate', 'literacy_rate', 'gdp_per_capita'], axis=1)

df_q2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
country       211 non-null object
region        211 non-null object
cpi_score     211 non-null float64
gini_index    211 non-null float64
dtypes: float64(2), object(2)
memory usage: 6.7+ KB

In earlier exploration, we have looked closely at the income inequality (GINI) data. Let us now examine data in the 'corruption perception' column.

As a reminder, CPI Score ranks countries based on how corrupt their public sector is perceived to be. 0 means that a country is perceived as highly corrupt and 10 means that a country is perceived as very clean.

#Plot  frequency distribution of cpi score

df_q2.cpi_score.plot(kind='hist')
plt.title('Frequency Distribution of CPI Score')
plt.xlabel('CPI Score');

df_q2.cpi_score.describe()

count    211.000000
mean       4.088697
std        2.002674
min        1.000000
25%        2.610000
50%        3.480000
75%        5.170000
max        9.380000
Name: cpi_score, dtype: float64

We note that the CPI distribution is skewed to the right, implying that a larger number of countries score lower than the average corruption perception.

Let us see how this differs by region.

# List all regions

df_q2.region.drop_duplicates()

0          Asia & Pacific
1                  Europe
2             Arab States
4                  Africa
5     South/Latin America
7                     CIS
20          North America
Name: region, dtype: object

# Create masks

asia = df_q2.region == 'Asia & Pacific'
europe = df_q2.region == 'Europe'
arab = df_q2.region == 'Arab States'
africa = df_q2.region == 'Africa'
s_america = df_q2.region == 'South/Latin America'
cis = df_q2.region == 'CIS'
n_america = df_q2.region == 'North America'

# Filter dataframe by masks and plot histogram

df_q2.cpi_score[asia].hist(alpha=0.5, bins=2, label='Asia', figsize=(15,8))
df_q2.cpi_score[europe].hist(alpha=0.5, bins=2, label='Europe', figsize=(15,8))
df_q2.cpi_score[arab].hist(alpha=0.5, bins=2, label='Arab States', figsize=(15,8))
df_q2.cpi_score[africa].hist(alpha=0.5, bins=2, label='Africa', figsize=(15,8))
df_q2.cpi_score[s_america].hist(alpha=0.5, bins=2, label='S. America', figsize=(10,8))
df_q2.cpi_score[cis].hist(alpha=0.5, bins=2, label='CIS', figsize=(15,8))
df_q2.cpi_score[n_america].hist(alpha=0.5, bins=2, label='N. America', figsize=(15,8))

# Set title and label axes
plt.title('Frequency Distribution of CPI Score by Region')
plt.xlabel('CPI Score')
plt.ylabel('Frequency')

plt.legend();

Some interesting observations here regarding CPI Score.

Most countries in Africa, Asia, North America have poor scores on corruption perception, whereas Europe appears to have a small marority of countries with higher than average scores.

Africa in particular appears to have most countries scoring below the global average.

Given that this is a measure of perception as opposed to presence of corruption (which is near impossible to objectively measure), it can be pointed out that a population dissatisfied with its standard of living is likely to percieve it's public servants as corrupt. Obviously this perception will be further heightened when public sector inefficiency is combined with ostentatious living of public officials, and where the justice system does not act consistently and promptly against accused corrupt officers.

Let us now examine the data to see if a relationship exists between income inequality and corruption perception.

Recall that CPI goes from zero to ten, where zero means a country is percieved as highly corrupt. GINI goes from zero to 100, where 100 denotes perfect income inequality, that is, one person earns all income. So the measures are not travelling in the same direction. Careful interpretation and careful use of terms will therefore be required.

# Plot a scatter diagram, using adjusted cpi scores

df_q2.plot(x='gini_index', y='cpi_score', kind='scatter', figsize=(8,8))
plt.title('Income Inequality (GINI) and Corruption Perception Index');

df_q2.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	cpi_score	gini_index
count	211.000000	211.000000
mean	4.088697	39.064161
std	2.002674	7.960935
min	1.000000	24.540000
25%	2.610000	33.035000
50%	3.480000	37.144133
75%	5.170000	45.027143
max	9.380000	63.200000

From above, we observe a very weak negative correlation between income inequality and corruption. The data therefore is inconclusive with regards to a relationship between income inequality and corruption perception.

We will nevertheless seek to confirm this graphical observation by computing the correlation coefficient of these two data columns.

q2_correl = df_q2.gini_index.corr(df_q2.cpi_score)
print('Correlation coefficient between GINI and CPI: {}'.format(q2_correl))

Correlation coefficient between GINI and CPI: -0.22182758050583504

Negative value of -0.2 suggests a very weak negative correlation between income inequality and corruption perception, in line with visual observation.

Q3. Is GDP per capita a good predictor of income inequality?

# Create dataframe with only these two columns

df_q3 = df_avg.drop(['cpi_score', 'infant_mortality_rate', 'literacy_rate'], axis=1)
df_q3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
country           211 non-null object
region            211 non-null object
gini_index        211 non-null float64
gdp_per_capita    211 non-null float64
dtypes: float64(2), object(2)
memory usage: 6.7+ KB

df_q3.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	gini_index	gdp_per_capita
0	Afghanistan	Asia & Pacific	37.144133	6875.002351
1	Albania	Europe	30.000000	1844.903592
2	Algeria	Arab States	27.600000	2201.836568
3	Andorra	Europe	31.779065	21719.572490
4	Angola	Africa	42.700000	611.714745

df_q3.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	gini_index	gdp_per_capita
count	211.000000	211.000000
mean	39.064161	8890.097876
std	7.960935	14024.659745
min	24.540000	103.104815
25%	33.035000	878.625684
50%	37.144133	3036.536292
75%	45.027143	9054.292336
max	63.200000	103885.246787

We have earlier examined GINI data and observed its right-ward skew, confirmed by the median being lower than the mean.

Looking at the gdp_per_capita data, we also observe the median to be significantly below the mean, suggesting there are outliers with large values and the distribution is skewed heavily to the right.

Let us use a histogram to check this.

df_q3.gdp_per_capita.hist()
plt.title('Frequency Distribution of GDP Per Capita')
plt.ylabel('Frequency')
plt.xlabel('GDP Per Capita USD');

We can quickly look to see which countries have the highest GDP Per Capita, to better understand the chart.

# Highest GDP Per Capita
gdp_high = df_q3.gdp_per_capita.drop_duplicates().nlargest(5)
df_largest_gdp = df_q3.query('gdp_per_capita == @gdp_high')
df_largest_gdp.sort_values(by='gdp_per_capita', ascending=False)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	gini_index	gdp_per_capita
123	Monaco	Europe	31.779065	103885.246787
106	Liechtenstein	Europe	31.779065	82380.816170
20	Bermuda	North America	37.225000	65455.868678
108	Luxembourg	Europe	31.500000	53424.811696
38	Channel Islands	Europe	31.779065	45226.031700

Monaco is the source of the large outlier value, and this is supported by other publicly available data sources, such as CIA World Factbook:

https://www.cia.gov/library/publications/the-world-factbook/rankorder/2004rank.html

One other point of concern is the suggestion, from the mean and the 25th percentile, that a significant number of countries have very low GDP per capita values. It will be instructive to see the countries with the ten lowest values, and possibly also see the distribution spread by region

# Lowest GDP per capita

gdp_low = df_q3.gdp_per_capita.drop_duplicates().nsmallest(10)
df_smallest_gdp = df_q3.query('gdp_per_capita in @gdp_low')
df_smallest_gdp.sort_values(by='gdp_per_capita')

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	gini_index	gdp_per_capita
43	Congo, Dem. Rep.	Africa	45.027143	103.104815
30	Burundi	Africa	45.027143	136.974269
60	Eritrea	Africa	45.027143	152.838384
76	Guinea-Bissau	Africa	50.700000	160.105497
112	Malawi	Africa	45.500000	172.459054
137	Niger	Africa	34.400000	175.920115
62	Ethiopia	Africa	33.200000	203.703223
36	Central African Republic	Africa	56.200000	228.850633
104	Liberia	Africa	36.500000	247.134621
111	Madagascar	Africa	42.400000	249.453045

All ten countries with lowest GDP per capita are in Africa, with the lowest value being Democratic Republic of Congo, showing GDP per capita of USD103. Contrasting this with Monaco at over USD103,000 or even with war-torn Afghanistan at USD6,875 is a shocking illustration of differences in standards of living across the world.

Let us see the average GDP Per Capita by region.

# GDP Per Capita by Region

df_q3_gdp_region = df_q3.groupby('region')['gdp_per_capita'].mean()
df_q3_gdp_region.sort_values(inplace=True)

ax = df_q3_gdp_region.plot(kind='barh')

ax.set_title('GDP Per Capita by Region')
ax.set_xlabel('GDP Per Capita USD')
ax.set_ylabel('Region')

for i in ax.patches:
    # get_width pulls left or right; get_y pushes up or down
    ax.text(i.get_width(), i.get_y(), str(round((i.get_width()), 2)));

Chart above illustrates the huge discrepancies in national wealth and productivity, given that GDP Per Capita is gross national income divided by population.

North America has by far the highest values, most likely driven by the huge USD19 trillion GDP of the USA.

One important point to add is that the GDP Per Capita figure is not adjusted for cost of living differences. Cost of living comparison website https://www.numbeo.com/cost-of-living/ for example, estimates that consumer prices including rent are 40% lower in South Africa compared to USA. This number will be higher for poorer African countries, suggesting that living conditions may not be as stark as the raw difference in GDP per capita suggests.

We will now draw a scatter plot to check for a relationship between GINI index and GDP per capita

# Plot a scatter diagram, using adjusted cpi scores

df_q3.plot(x='gini_index', y='gdp_per_capita', kind='scatter', figsize=(8,8))
plt.title('Income Inequality (GINI) and GDP per Capita')
plt.ylabel('GDP Per Capita USD')
plt.xlabel('Income Inequality (GINI)');

From scatter plot above, there appears to be a weak negative correlation between income inequality and GDP per capita. This suggests that the lower gdp per capita in a given country, the greater its income inequality is likely to be. Put another way, richer countries tend to have a fairer income distribution among their citizens.

Again, this association is weak and is not considered causative, merely correlative.

We will confirm this obervation by calculating the correlation coefficient.

q3_correl = df_q3.gini_index.corr(df_q3.gdp_per_capita)

print('Correlation coefficient between income inequality and GDP per Capita: {}'.format(q3_correl))

Correlation coefficient between income inequality and GDP per Capita: -0.3038398435955632

Negative coefficient of -0.3 confirms the visual observation.

Q4. Does a correlation exist between literacy and health?

In this section, we seek to understand if any correlation exists between literacy and population health. This question is relevant because, for example, it will be useful for public policy if it can be shown that a literate population is more likely to be healthy.

As a reminder:

Adult Literacy Rate: Adult literacy rate is the percentage of people aged 15 years and above who can read and write.

Infant Mortality Rate (IMR): IMR has been defined as the number of deaths in children under 1 year of age per 1000 live births in the same year.

Let us examine the data.

# Create dataframe with only these relevant columns

df_q4 = df_avg.drop(['cpi_score', 'gdp_per_capita', 'gini_index'], axis=1)
df_q4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 211 entries, 0 to 210
Data columns (total 4 columns):
country                  211 non-null object
region                   211 non-null object
infant_mortality_rate    211 non-null float64
literacy_rate            211 non-null float64
dtypes: float64(2), object(2)
memory usage: 6.7+ KB

df_q4.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	infant_mortality_rate	literacy_rate
0	Afghanistan	Asia & Pacific	76.86	39.000000
1	Albania	Europe	15.44	96.391969
2	Algeria	Arab States	24.46	81.398935
3	Andorra	Europe	2.60	98.149657
4	Angola	Africa	112.08	70.362420

df_q4.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	infant_mortality_rate	literacy_rate
count	211.000000	211.000000
mean	27.883904	84.767862
std	25.568050	16.378714
min	1.930000	25.307745
25%	7.550000	81.383965
50%	17.780000	91.559827
75%	43.290000	98.149657
max	112.080000	99.998262

# Histogram of Literacy Rate

df_q4.literacy_rate.plot(kind='hist')
plt.title('Frequency Distribution of Adult Literacy Rate')
plt.xlabel('Literacy Rate %');

Literacy rate is the percentage of adults aged 15 and over who can read and write.

While it's heartening to observe that most countries seem to have a high percentage of educated adults, we observe that in a small number of countries, there are disturbingly high numbers of adults wo are unable to read and write.

Let us quickly try to dentify these countries.

# Copy dataframe and remove unneeded column
literacy_nsmall = df_q4.copy()
literacy_nsmall = literacy_nsmall.drop('infant_mortality_rate', axis=1)

# Filter to 10 lowest literacy rate countries
literacy_nsmall_list = literacy_nsmall.literacy_rate.drop_duplicates().nsmallest(10)
literacy_nsmall = literacy_nsmall.query('literacy_rate in @literacy_nsmall_list')

# Sort dataframe
literacy_nsmall.sort_values('literacy_rate', inplace=True)
literacy_nsmall

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	literacy_rate
75	Guinea	Africa	25.307745
29	Burkina Faso	Africa	28.729214
115	Mali	Africa	32.270483
37	Chad	Africa	35.391470
62	Ethiopia	Africa	38.995982
0	Afghanistan	Asia & Pacific	39.000000
104	Liberia	Africa	42.941084
164	Sierra Leone	Africa	43.283100
161	Senegal	Africa	49.695127
128	Mozambique	Africa	50.583811

The ten countries with lowest literacy rates are overwhelmingly in Africa, with Guinea, Burkina Faso and Mali as the lowest three at 25%, 29%, and 32%.

Let us perform the same quick check of countries with hightest literacy rates.

# Copy dataframe and remove unneeded column
literacy_nlarge = df_q4.copy()
literacy_nlarge = literacy_nlarge.drop('infant_mortality_rate', axis=1)

# Filter to 10 highest literacy rate countries
literacy_nlarge_list = literacy_nlarge.literacy_rate.drop_duplicates().nlargest(10)
literacy_nlarge = literacy_nlarge.query('literacy_rate in @literacy_nlarge_list')

# Sort dataframe
literacy_nlarge.sort_values(by='literacy_rate', ascending=False, inplace=True)
literacy_nlarge

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	literacy_rate
140	North Korea	Asia & Pacific	99.998262
49	Cuba	South/Latin America	99.834250
61	Estonia	Europe	99.796890
101	Latvia	Europe	99.784240
68	Georgia	CIS	99.732470
94	Kazakhstan	CIS	99.732411
150	Poland	Europe	99.730190
198	Ukraine	CIS	99.718740
185	Tajikistan	CIS	99.707060
107	Lithuania	Europe	99.703550

Some surprise appearances are North Korea, Cuba and Kazakhstan.

We should take a similar look at our infant mortality rate data: distribution, ten highest and ten lowest. After taking an overview of the data in this way, we will finally attempt to answer the research question.

Recall that infant mortality rate is the number of deaths in children under 1 year of age per 1000 live births, and this measure is seen as a good indicator of population health.

Let us start by looking at the distribution of infant mortality rates.

df_q4.infant_mortality_rate.plot(kind='hist')
plt.title('Distribution of Infant Mortality Rates')
plt.ylabel('Frequency')
plt.xlabel('Infant Mortality Rate');

Data is skewed to right, with most countries having low IMR. Unfortunately, however, a few countries have high IMR values. We will take a look at the ten highest and ten lowest IMR countries.

# Copy dataframe and remove unneeded column
imr_nlarge = df_q4.copy()
imr_nlarge.drop('literacy_rate', axis=1, inplace=True)

# Filter to 10 highest imr
imr_nlarge_list = imr_nlarge.infant_mortality_rate.drop_duplicates().nlargest(10)
imr_nlarge = imr_nlarge.query('infant_mortality_rate in @imr_nlarge_list')

# Sort dataframe
imr_nlarge.sort_values(by='infant_mortality_rate', ascending=False, inplace=True)
imr_nlarge

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	infant_mortality_rate
4	Angola	Africa	112.08
164	Sierra Leone	Africa	111.54
36	Central African Republic	Africa	103.48
169	Somalia	Arab States	99.80
37	Chad	Africa	95.08
43	Congo, Dem. Rep.	Africa	87.00
115	Mali	Africa	85.30
138	Nigeria	Africa	84.44
59	Equatorial Guinea	Africa	81.34
47	Cote d'Ivoire	Africa	79.26

All ten countries with highest IMR are in Africa, including Somalia which our region classification shows as an Arab State. Let us also look at the ten countries with lowest IMR.

# Copy dataframe and remove unneeded column
imr_small = df_q4.copy()
imr_small.drop('literacy_rate', axis=1, inplace=True)

# Filter to 10 lowest imr
imr_small_list = imr_small.infant_mortality_rate.drop_duplicates().nsmallest(10)
imr_small = imr_small.query('infant_mortality_rate in @imr_small_list')

# Sort dataframe
imr_small.sort_values(by='infant_mortality_rate', inplace=True)
imr_small

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	country	region	infant_mortality_rate
106	Liechtenstein	Europe	1.93
82	Iceland	Europe	1.98
108	Luxembourg	Europe	2.06
165	Singapore	Asia & Pacific	2.22
92	Japan	Asia & Pacific	2.44
181	Sweden	Europe	2.54
3	Andorra	Europe	2.60
64	Finland	Europe	2.62
141	Norway	Europe	2.74
167	Slovenia	Europe	2.86

The ten lowest IMR countries are shown above. The metric suggests high levels of populations health for these countries.

We will now investigate if any relationship exists between adult literacy and health.

# Single shot view of distributions and correlations

pd.plotting.scatter_matrix(df_q4);

# Scatter plot between literacy and IMR

q4_1 = df_q4.plot(kind='scatter', x='literacy_rate', y='infant_mortality_rate')
q4_1.set_title('Literacy and Infant Mortality Rate')
q4_1.set_xlabel('Literacy Rate')
q4_1.set_ylabel('Infant Mortality Rate');

There appears to be a strong negative correlation between literacy and infant mortality. The higher the rate of adult literacy, the fewer infant deaths are recorded, indicating a healthier populations are associated with literate populations.

Let us confirm this by calculating the correlation coefficient.

# Calculate correlation coefficient between literacy and infant mortality

df_q4.literacy_rate.corr(df_q4.infant_mortality_rate)

-0.7603129044114025

High negative coefficient of -0.76 confirms the visual observation.

It must be repeated that this merely indicates a correlation and does not confirm causation.

Conclusions

We will now summarize the findings from our data exploration, performed under the guidance of four questions.

It must be stressed that the main tool of analysis here was descriptive statistics. And even that was limited to mainly correlation. This analysis therefore makes no claims about a causal relation between any of the measures used.

Nevetheless, this analysis has explored multiple socio-economic indicators across countries of the world. For each indicator, we have identified global and regional frequency distributions, as well as top and bottom countries. So even though the research questions focused on correlation-type questions, our journey through the exploration process exposed some interesting learnings.

For example, our exploration showed that while we find the worst cases of income inequality in Africa, on average by region it is South America that shows the worst levels.

We also found, surprisingly for me, that North Korea, Cuba and Kazakhstan are in the top six most literate countries in the world.

These insights and correlations exposed in this project can serve as a useful entry point into deeper analysis of the research questions posed here.

Question 1 Findings: Which five countries have the highest and lowest income inequality?

Our observation in relation to this question is that the five countries with lowest income inequality are in Europe. They are Slovenia, Ukraine, Norway, Czech Republic, Slovakia.

On the other hand, the five countries with highest income inequality are in Africa. They are South Africa, Namibia, Botswana, Central African Republic, and Zambia.

However, a few Afrian countries seem to have admirably high levels of fair income distribution, so that on average, South America, not Africa, appears to have the worst income inequality.

Question 2 Findings: Is there a relationship between income inequality and corruption?

Our observation is that a very weak negative correlation exists between income inequality and corruption. The data therefore is inconclusive with regards to a relationship between income inequality and corruption perception.

Question 3 Findings: Is GDP per capita a good predictor of income inequality?

From analysis of our data, it appears there is a negative correlation between income inequality and GDP per capita. This suggests that the lower the gdp per capita in a country, the greater its income inequality is likely to be. Put another way, richer countries tend to have a fairer income distribution among their citizens.

Again, this association is not considered to be causative, merely correlative.

Question 4 Findings: Does a correlation exist between literacy and health?

There appears to be a strong negative correlation between literacy and infant mortality. This implies that the higher the rate of adult literacy, the fewer infant deaths are recorded, or put another way, healthy populations are likely to be literate.

Resources

Coding resources

Python For Data Analysis 2nd Edition, Wes Mckinney, O'REILLY, ISBN: 9781491957660
https://stackoverflow.com/
http://robertmitchellv.com/blog-bar-chart-annotations-pandas-mpl.html
https://matplotlib.org/api/
http://pandas.pydata.org/pandas-docs/stable/
https://morphocode.com/pandas-cheat-sheet/

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
README.md		README.md
corruption_2007_2010_2011.csv		corruption_2007_2010_2011.csv
corruption_2008_2009.csv		corruption_2008_2009.csv
country_list.csv		country_list.csv
country_list_clean.csv		country_list_clean.csv
gdp.csv		gdp.csv
gini_cpi.csv		gini_cpi.csv
inequality.csv		inequality.csv
infant.csv		infant.csv
investigating-income-inequality-corruption-health-literacy.ipynb		investigating-income-inequality-corruption-health-literacy.ipynb
literacy.csv		literacy.csv
output_168_0.png		output_168_0.png
output_174_0.png		output_174_0.png
output_178_0.png		output_178_0.png
output_181_0.png		output_181_0.png
output_183_0.png		output_183_0.png
output_191_0.png		output_191_0.png
output_196_0.png		output_196_0.png
output_200_0.png		output_200_0.png
output_210_0.png		output_210_0.png
output_217_0.png		output_217_0.png
output_220_0.png		output_220_0.png
output_230_0.png		output_230_0.png
output_238_0.png		output_238_0.png
output_244_0.png		output_244_0.png
output_245_0.png		output_245_0.png

ApophisAlpha/income_inequality

Folders and files

Latest commit

History

Repository files navigation

Investigating Income Inequality, Corruption, Health and Literacy

Author: David Oguara

Table of Contents

Introduction

Dataset Selection

Investigation Questions

Data Wrangling

General Properties

Infant Mortality Rate (rate per 1000 births) - Data Inspection

Literacy rate, adult total (% of people ages 15 and above) - Data Inspection

Inequality index (Gini) - Data Inspection

Poverty (% people below USD1.90 a day)

GDP/capita (USD, inflation-adjusted)

Data Cleaning

Corruption Perception Index (CPI) - Data Cleaning

Infant Mortality Rate (rate per 1000 births) - Data Cleaning

Literacy rate, adult total (% of people ages 15 and above) - Data Cleaning

Inequality index (Gini) - Data Cleaning

Poverty (% people below USD1.90 a day) - Data Cleaning

GDP/capita (USD, inflation-adjusted) - Data Cleaning**

Combined Data

Exploratory Data Analysis

Q1. Which five countries have the worst and best measures of income inequality?

Q2. Is there a relationship between income inequality and corruption?

Q3. Is GDP per capita a good predictor of income inequality?

Q4. Does a correlation exist between literacy and health?

Conclusions

Question 1 Findings: Which five countries have the highest and lowest income inequality?

Question 2 Findings: Is there a relationship between income inequality and corruption?

Question 3 Findings: Is GDP per capita a good predictor of income inequality?

Question 4 Findings: Does a correlation exist between literacy and health?

Resources

Coding resources

Definitions and Additional Information on metrics

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages