World Bank Case Study

World Bank Case Study:
Effects of Education on Economic Outcomes



Research Question: Can selected educational indicators be shown to correlate to economic growth?


One of the projects I completed for my Google Data Analysis Certification through Coursera is a analysis of a fractional subsection of data available through the World Bank. After viewing several years' worth of data for a wide range of variables, I drew ten measures from the vast lists of indicators available there, from table 2.8: Participation in Education. I selected only data from 2014, a year in which many records were more complete than for other years. 

I sought to determine whether any of the variables selected (listed in Table 1 below) explain or show relationship to growth outcomes for selected countries, such as GDP Per Capita. In my analysis, I looked for any correlation in the Government Expenditure, Internet Users per 100, or Enrollment in Upper Secondary variables. With AI becoming a dominant force in most sectors of developed economies, including public and higher education, countries with stronger indicators in public education may prove to have better outcomes economically. 

Table 1
indicator_nameindicator_codevector
Duration of compulsory education (years)SE_COM_DURSx1
Enrolment in primary education, both sexes (number)SE_PRM_ENRLx2
Enrolment in secondary education, both sexes (number)SE_SEC_ENRLx3
GDP per capita (current US$)NY_GDP_PCAP_CDx4
Government expenditure on education as % of GDP (%)SE_XPD_TOTL_GD_ZSx5
Internet users (per 100 people)IT_NET_USER_P2x6
Population of compulsory school age, both sexes (number)UIS_SAP_CEx7
Population, totalSP_POP_TOTLx8
Enrolment in upper secondary education, both sexes (number)UIS_E_3y1
Unemployment, total (% of total labor force)SL_UEM_TOTL_ZSy2

Selection and Preparation of Data

The initial file contained 1824 rows, as each country had data for at least one of the indicators in Table 1. After pivoting in Sheets to show one row per country, there were 241 records. Of these, 98 had no data for UIS_E_3, 17 had no data for SL_UEM_TOTL_ZS, two of the variables I was most interested in; these were removed. Another 22 rows gave data for regions or economic subcategories; these were also removed. The remaining file with 104 rows was then exported to CSV and uploaded to Google BigQuery. 

Using SQL I finished cleaning the data, set the column types and standardized the float for applicable columns. I ensured that there were no duplicate entries, and all empty fields showed null. Most columns with null data are secondarily important to my questions, so those records were left as-is. I then exported the cleaned file to CSV again, and uploaded it to R for analysis. The dataset can be found here.


Analysis

I verified data types for each column and previewed the data, shown in Table 2. I chose to assign vectors to each column in order to create a correlation table (shown in Table 1). 


Table 2
> str(worldbank2014)
data.frame': 104 obs. of 12 variables:
$ country_name : chr "Afghanistan" "Albania" "Armenia" "Australia" ...
$ country_code : chr "AFG" "ALB" "ARM" "AUS" ...
$ SE_COM_DURS : int 9 9 12 10 9 10 9 11 9 12 ...
$ SE_PRM_ENRL : int 6217756 195720 142517 2168531 327247 517708 103806 21171 368783 ...
$ SE_SEC_ENRL : int 2602734 333291 244543 2371284 697388 NA 90092 20603 648541 1210112 ...
$ UIS_E_3 : int 968769 151937 103385 1104162 363188 372641 44100 8410 200900 776413 ...
$ NY_GDP_PCAP_CD : num 612 4579 3995 62215 51733 ...
$ SE_XPD_TOTL_GD_ZS: num 3.78 NA 2.25 5.23 5.5 2.63 2.47 6.57 4.99 6.59 ...
$ IT_NET_USER_P2 : int 7 60 55 84 81 75 91 75 59 85 ...
$ UIS_SAP_CE : int 8034251 357878 NA NA 743482 1135639 NA 41471 801028 1471994 ...
$ SP_POP_TOTL : int 32758020 2889104 2906220 23460694 8541575 9535079 1336397 283385 ...
$ SL_UEM_TOTL_ZS : num 8.6 17.5 17.6 6.1 5.6 4.9 1.2 12.2 0.5 8.5 ...

The correlation table showed many relationships with no observable data, because of the null fields. Of the remaining categories, the strongest correlations observed were:
    
    y1 (Enrolment in upper secondary education) to x8 (Population, total): 0.995     
    x4 (GDP per capita (current US$)) to x6 (Internet users (per 100 people)): 0.749

y2 (Unemployment, total (% of total labor force)) was not closely correlated to any of the variables selected. Its greatest correlation coefficient at r = -0.189 is with GDP per capita.

To explore further relationship between GDP per capita and Internet users, I created a new column calculating their ratio. I then ran a linear regression (model1), selecting GDP per capita as the dependent variable; I used the inverse as model2. These models give a p-value of 2.2x10-16. Variance of 0.561 in both explains more than half the variation in GDP per capita alone--though with the caveat of a substantial error size. It may be a meaningful relationship, but there is of course much left unexplained.

Model3, a multivariate linear regression, shows Enrolment in upper secondary education best explained by Population, Population of compulsory school age and Enrolment in secondary education, respectively, indicating a near proportional relationship between upper secondary student enrollment and population overall (the r = 0.995 from above).
Model4, also multivariate, shows Unemployment, however, as best explained by GDP per capita and Internet users, the relationship with the second-highest correlation coefficient. This is more interesting as it is not as easily explained as model3. To analyze this relationship further I created a new column calculating their ratio, titled GDP_to_InternetUsers.

Variance of this ratio column is very high. However, the correlation coefficient at 0.749 is closer to 1 than 0, showing a strong positive correlation. The relationship I selected to visualize, therefore, is that between the number of internet users per 100 and the GDP per capita.

All coding for this R analysis can be found on my RPubs page.

Visualization

The visualization I chose, at the top of this article, utilizes one of my favorite views. A circle view, with size and color of each circle representing data, give a very visual sense of how countries differ. At a glance, it is easy to identify areas of the world with the highest GDP values (by color) and ratio of GDP to number of internet users (by size). Of the sample set of countries represented in this study, the western world and the OPEC countries can be easily spotted as having the highest GDPs. In this study, the circle view allows for a visually engaging representation of the data as well as more meaningful information.

For view 1 (shown above) I selected my ratio variable, GDP_to_InternetUsers, to represent with circles sized proportionally. I used GDP per capita, shown here by its indicator code 'NY_GDP_PCAP_CD', for color distribution; the darker the color, the higher the GDP Per Capita. I also included Internet users ('IT_NET_USER_P2') in the detail level, shown when a country is hovered over, as below.

Image of legend for Bosnia and Herzegovina, including values for the three data points described in the preceding paragraph.


The GDP_to_InternetUsers column had values from $46 (Kyrgyz Republic) to $1343 (Macao SAR, China). I found, however, that the resulting visualization had very little variation in color, with a tiny group of dark green dots in a sea of gold. To enhance the visual quality of this representation, I edited the filter for this variable to narrow the range to have a top value of $748. This removed the points for four countries and maintained 100. The four excluded records (Macau, Norway, Switzerland and Qatar) are still in the dataset, just not the viz.

Finally, I've included a more standard representation in which the borders of each country are filled by the color representing GDP_to_InternetUsers. This is a more standard visualization but it displays only the one variable. This may be valuable for a single-purpose illustration but it lacks the complexity and interest of view 1. It is previewed below, and is available as view 2 in my Tableau Public portfolio.

Map view of Earth with over 100 countries shaded in five gradient shades of green by variable "GDP Per Capita to Internet Users"

Conclusions and Further Analysis


This relationship, between the GDP per capita and the internet users per 100 people, is only possibly significant, but it may have implications for the success of countries measured here. With the growth and dominance of AI modeling in all sectors of the economy, the gap between the world’s richest and poorest countries will continue to widen. In this sample, no educational factor, from primary enrollment to upper secondary enrollment and completion, showed as strong a correlation in this sample on unemployment or GDP as does the rate of internet users per 100.
The top nations in the sample for internet users were Norway, Denmark, Sweden, Netherlands, United Kingdom, Qatar, Bahrain, and United Arab Emirates, all with 90% or above using the internet to one degree or another. For all nations to succeed economically, each of their GDP-to-Internet Users ratios must decrease in value, maximizing the divisor to as close to 100 (representing 100% of the population) as possible.
Further exploration of the data should include cleaning up the columns with null values, either by researching and replacing the missing data, or by selecting another sample of countries with complete records. The variables without valid correlation coefficients may have very interesting relationships to GDP Per Capita, Enrollment in Upper Secondary, Unemployment, or other economic indicators.





Popular posts from this blog

Texas School District Cell Phone Bans

Background