Exercise 1: Anscombe’s quartet

Session 5

Authors

Kim Antunez, François Briatte

Download datasets on your computer

  1. anscombe.tsv

Load data and install useful packages

library(tidyverse) # {dplyr}, {ggplot2}, {readxl}, {stringr}, {tidyr}, etc.
repository <- "data"
# read Anscombe's quartet data
anscombe <-  readr::read_tsv(paste0(repository, "/anscombe.tsv"))
Rows: 44 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
dbl (3): set, x, y

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(anscombe)
Rows: 44
Columns: 3
$ set <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ x   <dbl> 10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5, 10, 8, 13, 9, 11, 14, 6, 4, …
$ y   <dbl> 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68,…

Note how X and Y are similar!

anscombe %>% 
  group_by(set) %>% 
  summarise(
    mu_x = mean(x),
    var_x = var(x),
    mu_x = mean(y),
    var_x = var(y)
  )
# A tibble: 4 × 3
    set  mu_x var_x
  <dbl> <dbl> <dbl>
1     1  7.50  4.13
2     2  7.50  4.13
3     3  7.5   4.12
4     4  7.50  4.12

Note how X and Y are different!

ggplot(anscombe, aes(x, y)) +
  geom_point() + 
  facet_wrap(~ set)

Fundamentals of the ggplot2 plotting system

R has ‘base graphics’…

plot(density(anscombe$x))

But ggplot2 just looks better!

ggplot(anscombe, aes(x)) +
  geom_density()

‘Base graphics’ will get you everywhere…

plot(anscombe$x, anscombe$y)

… but ggplot2 has a consistent syntax!

ggplot(anscombe, aes(x, y)) +
  geom_point()

ggplot2 can modify the ‘appearance’ of your data points…

ggplot(anscombe, aes(x, y)) +
  geom_point(size = 5, color = "tomato", fill = "gold", shape = 21)

Question 1

See how this won’t work properly, understand why and get a solution.

ggplot(anscombe, aes(x, y, color = set)) +
  geom_point()

What is the type of the variable set? Use str(dataset$variable).

str(anscombe$set)
 num [1:44] 1 1 1 1 1 1 1 1 1 1 ...

set is considered as a numerical variable, we prefere it to be a factor:

ggplot(anscombe, aes(x, y, color = factor(set))) +
  geom_point()

Let’s go use facets (small multiples).

ggplot(anscombe, aes(x, y)) +
  geom_point() + 
  facet_wrap(~ set, nrow = 1) +
  coord_equal()

And let’s finally add another geometry.

ggplot(anscombe, aes(x, y)) +
  geom_point() + 
  geom_smooth(method = "lm", fill = NA, fullrange = TRUE) + 
  facet_wrap(~ set)
`geom_smooth()` using formula = 'y ~ x'

Question 2

Try removing geom_smooth options to see what happens.

The lines are removed.

ggplot(anscombe, aes(x, y)) +
  geom_point() + 
  #geom_smooth(method = "lm", fill = NA, fullrange = TRUE) + 
  facet_wrap(~ set)

Source

Data source

datasets::anscombe (R package by the R Core Team), which cites Tufte (1989) as its source, and Anscombe (1973) as the initial source:

Tufte, Edward R. (1989). The Visual Display of Quantitative Information, Graphics Press, pp. 13–14.

Anscombe, Francis J. (1973). Graphs in statistical analysis. The American Statistician, 27, 17–21. doi:10.2307/2682899.

The R code to produce the ‘tidy’ version of the dataset was not preserved, but probably looked somewhat like this:

library(tidyverse)
datasets::anscombe %>%
  tidyr::pivot_longer(everything()) %>%
  mutate(coord = str_sub(name, 1, 1), set = str_sub(name, 2, 2)) %>%
  select(-name, set, coord, value) %>%
  tidyr::pivot_wider(names_from = "coord", values_from = "value") %>%
  tidyr::unnest(everything()) %>%
  readr::write_tsv("data/anscombe.tsv")

Rationale

The point of this demo is to show you the existence of different plotting systems in R. We cover only the ggplot2 one in class, called so in reference to the ‘grammar of graphics’ logic that it follows, but there are at least two other systems:

You’ll be fine learning just the ggplot2 one, which has also been ported to the Python language.