In a previous R Tutorial, Web Scraping Wikipedia World Population with rvest() in R we were able to scrape global population from Wikipedia.
Since we were working with 10 countries in the previous tutorial, we will continue to do so moving forward.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 | install.packages("ggplot2") install.packages("gridExtra") install.packages("maptools") install.packages("RColorBrewer") library(ggplot2) library(gridExtra) library(maptools) library(RColorBrewer) |
Download and Load the Global Population Wikipedia Dataset
Now that the packages installed and loaded, let’s take a look at the data by importing the data as read.csv(). The dataset is available for download on the dataset page or directly from here Global Population Wikipedia Export – global_population.csv
Input:
1 | global_pop <- read.csv("global_population.csv", stringsAsFactors = FALSE) |
View the Global Population Wikipedia Dataset
head() function
Let’s take a look at the population data for the first 10 lines.
Input:
1 | head(global_pop$Population, n = 10) |
Output:
1 2 | [1] "1,388,970,000" "1,327,290,000" "326,547,000" "261,890,900" "210,432,000" [6] "208,598,000" "193,392,500" "163,916,000" "146,877,088" "126,590,000" |
lapply() function
The data that we will mostly use are Country, Population, and Percentage. Let’s confirm that Population and Percentage are numeric and not character by using lapply() function.
Input:
1 | lapply(global_pop, class) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | $Rank [1] "character" $Country [1] "character" $Population [1] "character" $Date [1] "character" $Percentage [1] "character" $Source [1] "character" |
Clean-up Population and Percentage
In order to convert Population and Percentage to numeric, we must remove the commas, percentage signs, and decimals.
gsub() function
The below gsub() will remove all commas(,) within the population column and then re-run the previous head() function.
Input:
1 2 | global_pop$Population <- gsub(",","", global_pop$Population, fixed = TRUE) head(global_pop$Population, n = 10) |
Output:
1 2 | [1] "1388970000" "1327290000" "326547000" "261890900" "210432000" "208598000" "193392500" [8] "163916000" "146877088" "126590000" |
Now that the population column is now confirmed, so removing the percent sign (%) and convert the percentage to decimal will be complete.
Input:
1 | head(global_pop$Percentage, n = 10) |
Output:
1 | [1] "18.3%" "17.5%" "4.3%" "3.45%" "2.77%" "2.75%" "2.54%" "2.16%" "1.93%" "1.67%" |
as.numeric() function
The below as.numeric(gsub()) function will remove the percentage sign (%) and convert the percentage to a decimal by dividing the number by 100.
Input:
1 2 | global_pop$Percentage <- as.numeric(gsub("%", "", global_pop$Percentage))/100 head(global_pop$Percentage, n = 10) |
Output:
1 | [1] 0.1830 0.1750 0.0430 0.0345 0.0277 0.0275 0.0254 0.0216 0.0193 0.0167 |
Now that both the Population and Percentage column are now cleaned up, we can convert both of these columns to numeric. Once we convert the columns to numeric, we will re-verify the columns were converted by running the lapply() function.
Input:
1 2 3 | global_pop$Population <- as.numeric(global_pop$Population) global_pop$Percentage <- as.numeric(global_pop$Percentage) lapply(global_pop, class) |
Top 10 World Population Plot
Below is a basic ggplot() + geom_boxplot() to view the population total for the top 10 countries.
Input:
1 2 | ggplot(pop_top, aes(x=Country, y=Population)) + geom_boxplot() |
Output:
Also, the below function will stop scientific notation and provide the full population number.
Input:
1 | options(scipen = 999) |
And the below will return scientific notation.
Input:
1 | options(scipen = 000) |
reorder() the countries in descending order
In addition to disabling scientific notation, we should plot the countries by population in descending order (most populous to least populous). In a previous tutorial, we used a factor(levels)) functions to produce this result by manually inputting the countries in descending order. The below reorder() function can achieve the same output with a fraction of the time.
Below we will create two plots, the first will be a bar chart without re-ordering the countries by population. The second plot will re-order the countries based on population size. Lastly, we will use grid.arrage() to output both plots into a single output. The grid package provides low-level functions to create graphical objects (grobs), and position them on a page in specific viewports.
Input:
1 2 3 4 5 | pop_top1 <- ggplot(pop_top, aes(x = Country, y = Population)) + geom_bar(stat = "identity") pop_top2 <- ggplot(pop_top, aes(x=reorder(Country, -Population), y = Population)) + geom_bar(stat = "identity") grid.arrange(arrangeGrob(pop_top1, pop_top2)) |
Output:
We can also use reorder(Category, Count) to have the population in ascending order(least populous to most populous).
Input:
1 2 | ggplot(pop_top, aes(x = reorder(Country, Population), y = Population)) + geom_bar(stat = "identity") |
Output:
Now that we have created a plot by descending order of population, we can label the x-axis and y-axis, scale the y-axis, and angle the xlab so the countries are readable.
Input:
1 2 3 4 5 6 7 8 | ggplot(pop_top, aes(x = reorder(Country, -Population), y = Population)) + geom_bar(stat = "identity") + ylim(0,1500000000) + xlab("Country") + ylab("Population") + ggtitle("2018 Top 10 Global Population") + theme(axis.text.x = element_text(angle = 45,hjust = 1)) + theme(plot.title = element_text(hjust = 0.5)) |
Output:
Our population plot looks presentable, but we could add one additional effect for the plot by adding fill for the countries.
Input:
1 2 3 4 5 6 7 8 | ggplot(pop_top, aes(x = reorder(Country, -Population), y = Population, fill = Country)) + geom_bar(stat = "identity") + ylim(0,1500000000) + xlab("Country") + ylab("Population") + ggtitle("2018 Top 10 Global Population") + theme(axis.text.x = element_text(angle = 45,hjust = 1)) + theme(plot.title = element_text(hjust = 0.5)) |
Output:

Could we Plot the Countries on a Map?
We can plot the countries on a map with the wrld_simpl (Simplified World Country Polygons) by installing the library maptools. The object loaded is a SpatialPolygonsDataFrame object containing a slightly modified version of Bjoern Sandvik’s improved version of world_borders.zip – TM_WORLD_BORDERS_SIMPL-0.2.zip dataset from the Mapping Hacks geodata site.
Now once we load wrld_simpl object, the object will consist of a Large SpatialPolygonsDataFrame with 5 slots; the data clot contains a data.frame with 246 obs. of 11 variables. Now let’s load the object and view wrld_simpl as a plot.
Input:
1 2 | data("wrld_simpl") plot(wrld_simpl) |
Output:
Now let’s take a look at wrld_simpl and see how we can connect this object to our top 10 global data. We must see how to connect the countries within our data.frame pop__top into the wrld_simpl. The easiest way to view the variable is by using the summary() function.
Input:
1 | summary(wrld_simpl) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | Object of class SpatialPolygonsDataFrame Coordinates: min max x -180 180.00000 y -90 83.57027 Is projected: FALSE proj4string : [+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0] Data attributes: FIPS ISO2 ISO3 UN NAME : 3 AD : 1 ABW : 1 Min. : 4.0 Aaland Islands: 1 AC : 1 AE : 1 AFG : 1 1st Qu.:215.0 Afghanistan : 1 AE : 1 AF : 1 AGO : 1 Median :429.0 Albania : 1 AF : 1 AG : 1 AIA : 1 Mean :431.8 Algeria : 1 AG : 1 AI : 1 ALA : 1 3rd Qu.:650.5 American Samoa: 1 AJ : 1 AL : 1 ALB : 1 Max. :894.0 Andorra : 1 (Other):238 (Other):240 (Other):240 (Other) :240 AREA POP2005 REGION SUBREGION LON Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.00 Min. :-178.13 1st Qu.: 44.5 1st Qu.: 127508 1st Qu.: 2.00 1st Qu.: 14.00 1st Qu.: -50.16 Median : 5515.5 Median : 3192616 Median : 19.00 Median : 30.00 Median : 17.66 Mean : 52696.1 Mean : 24636644 Mean : 65.43 Mean : 54.84 Mean : 13.28 3rd Qu.: 34708.8 3rd Qu.: 12401752 3rd Qu.:142.00 3rd Qu.: 61.00 3rd Qu.: 50.01 Max. :1638094.0 Max. :1312978855 Max. :150.00 Max. :155.00 Max. : 179.22 LAT Min. :-80.4460 1st Qu.: -0.3025 Median : 16.5110 Mean : 16.4289 3rd Qu.: 39.1067 Max. : 78.8300 |
With the above output, we can easily view the variables and choose the variable that will meet the needs of our country name. With the above output, let’s view our countries by name again.
Input:
1 | head(pop_top, n = 10) |
Output:
1 2 3 4 5 6 7 8 9 10 11 | Rank Country Population Date Percentage Source 1 1 China 1388970000 January 31, 2018 0.1830 Official population clock 2 2 India 1327290000 January 31, 2018 0.1750 Official population clock 3 3 United States 326547000 January 31, 2018 0.0430 Official population clock 4 4 Indonesia 261890900 July 1, 2017 0.0345 Official annual projection 5 5 Pakistan 210432000 January 31, 2018 0.0277 Official population clock 6 6 Brazil 208598000 January 31, 2018 0.0275 Official population clock 7 7 Nigeria 193392500 March 21, 2016 0.0254 Annual official estimate 8 8 Bangladesh 163916000 January 31, 2018 0.0216 Official population clock 9 9 Russia 146877088 January 1, 2018 0.0193 Official estimate 10 10 Japan 126590000 January 1, 2018 0.0167 Monthly provisional estimate |
As we can see from the above outputs, the NAME variable for wrld_simple matches our Country variable for data.frame pop_top. We will need to create a variable for the countries and match our variable Country from pop_top with the variable NAME. In addition, the plot will produce the map from above with our top 10 countries mapped in blue.
Input:
1 2 3 | countries <- pop_top$Country countries <- wrld_simpl@data$NAME %in% (countries) plot(wrld_simpl, col = c(gray(.80), "blue")[countries+1]) |
The above is color maps each of our countries into the blue areas that are mapped. We know that the above mapped correctly based on our matching of wrld_simpl@data$NAME to our variable countries. However, let’s take a look at the shades of blue by installing and loading the library package RColorBrewer. Once the package is loaded, run the below for 9 shades of blue.
Input:
1 | display.brewer.pal(9,"Blues") |
Matching Color Population Against Country
We’re making progress on the map, but we must matching color population to our country by creating two additional variables.
What are we even doing below?
- Creating color_population variable based on different shades of blue for each input of unique population
- Creating color_country by matching pop_top$Country and wrld_simpl@data$NAME against color_population variable
Input:
1 2 3 4 5 | color_population <- colorRampPalette(brewer.pal(9, 'Blues'))(length(pop_top$Population)) color_population <- color_population[with(pop_top, findInterval(population, sort(unique(Population))))] color_country <- rep(grey(0.8), length(wrld_simpl@data$NAME)) color_country[match(pop_top$Country, wrld_simpl@data$NAME)] <- color_population |
Now let’s take a look at the map with the blue shading for each country.
Input:
1 | plot(wrld_simpl, col = color_country) |
Output:
Feel free to work with this data and create other plots to visualize the data in other packages. Another well-known package for mapping is rworldmap. Take a look at this package and see what you can create!