Preparation: Datavisualization with Preston Curve

For Session 7

Authors

Kim Antunez, François Briatte

This exercise focuses on

This is not an easy exercise: work with your group, and get ready to spend a couple of hours on it.

Scenario

You are interning at the World Bank, and have been asked to plot the most recent version of the Preston curve.

A previous intern has left you some code, which is included in this folder, as well as a plot showing the expected result. However, the code to reproduce the plot, which is also reproduced below, is actually missing from the script.

Instructions

Execute and fill in the script to get as close as possible to the expected result.

Hints:

  • Everything except the text labels happens with the ggplot2 package. Check its online documentation as much as needed.
  • The text labels are the hard part to get. They are produced by using the ggrepel package: read its vignette, and see if you can manage something close enough to the expected result.
  • The expected result uses the ‘classic’ ggplot2 theme (?theme_classic), with a base text size of 12 points.
  • The size of the data points in the expected result range from 1.5 to 10.5. Use scale_size_continuous to specify that range.

Download dataset on your computer

  1. wdi.csv

Load data and install useful packages

library(tidyverse) # {dplyr}, {ggplot2}, {readxl}, {stringr}, {tidyr}, etc.
library(countrycode)
repository <- "data"
# read the 'wdi' dataset
wdi <- readr::read_csv(paste0(repository, "/wdi.csv"), show_col_types = FALSE)

Exercise

Exercise

Draw this curve!

You will need 2 steps.

Step 1: Preprocessing

  • Using the countrycode package, recode the variable iso3c from iso3c format to iso3c format. This step puts NA for non-country ISO-3C.
  • Create a new variable called region that transforms the same iso3c variable from iso3c to region format
  • remove NA
  • take the latest (max) date grouping by iso3c

Step 2: Draw the graph

Of course, you’ll use ggplot2 package and more precisely:

  • geom_smooth
  • geom_point (twice!).

Note : Look how some circles are labelled and circled in black. It concerns countries for which pop > 200 * 10^6. You may need to filter the dataset at some point!

  • ggrepel::geom_label_repel
  • labs and to modify the axes

You can work on the theme with the following code:

# control the minimal and maximal point sizes
scale_size_continuous(range = c(1.5, 10.5)) +
guides(fill = "none", size = "none") +
# final cosmetics
  theme_classic(base_size = 12) +
  theme(legend.position = c(0.87, 0.25),
        legend.title = element_blank()) 

Step 1: Preprocessing

wdi <- wdi %>%
  mutate(
    # remove non-country ISO-3C does
    iso3c = countrycode::countrycode(iso3c, "iso3c", "iso3c"),
    region = countrycode::countrycode(iso3c, "iso3c", "region")
  ) %>%
  # drop rows with missing values
  tidyr::drop_na() %>%
  # subset to most recent year
  group_by(iso3c) %>%
  filter(year == max(year)) %>% 
  as_tibble()

Step 2: Draw the graph

ggplot(wdi, aes(y = lexp, x = gdpc)) +
  # draw the Preston curve
  geom_smooth(se = FALSE, method = "loess", color = "grey50") +
  # draw the underlying data points
  geom_point(aes(size = pop, color = region)) +
  # highlight a few very populous countries
  ggrepel::geom_label_repel(
    data = filter(wdi, pop > 200 * 10^6),
    aes(label = country),
    box.padding = 1.75, segment.color = "grey50", fill = "white", label.size = 0, seed = 42) +
  # redraw the highlighted points, with an additional border
  geom_point(data = filter(wdi, pop > 200 * 10^6),
             aes(size = pop, fill = region),
             shape = 21, color = "black") +
  # control the minimal and maximal point sizes
  scale_size_continuous(range = c(1.5, 10.5)) +
  # final cosmetics
  guides(fill = "none", size = "none") +
  theme_classic(base_size = 12) +
  # bug CI
  #theme(legend.position.inside = c(0.87, 0.25), legend.title = element_blank()) +
  theme(legend.title = element_blank()) +
  labs(y = "Life expectancy", x = "GDP per capita")

# export final result
# ggsave("preston-curve.png", width = 9, height = 6)

Side note

The graph above shows the outlier status of the United States as a country, but even at the individual level, the life expectancy of US residents is starkingly lower than it should be (given US income levels).

John Burn-Murdoch, from the Financial Times, has an article and a well-illustrated Twitter thread on the topic. Quoting from it:

“Beyond age 70, US mortality/survival rates are very similar to other rich countries. But between teenage years and early middle age there is a vast gulf… More years of American lives were erased by drugs, guns & road deaths in 2021 alone than from Covid during the whole pandemic.”

I recommend taking a look at the whole thing.

Source

The data is obtained using the WDI package, using the code below:

library(WDI)
what <- c(
  "lexp" = "SP.DYN.LE00.IN",
  "gdpc" = "NY.GDP.PCAP.PP.CD",
  "pop" = "SP.POP.TOTL"
)
wdi <- WDI::WDI(indicator = what, start = 2019)
write.csv(wdi, "wdi.csv", row.names = FALSE)