Preparation: Code and data essentials [1/2]

For Session 4

Authors

Kim Antunez, François Briatte

This exercise focuses on

using RStudio to set up a project (code and data)
code essentials: creating a script, loading required packages
data essentials: creating data frames from external data sources
data cleaning and preparation

This is not an easy exercise: work with your group, and get ready to spend a couple of hours on it.

Prerequisites

Prior to completing the following exercise, open RStudio and set the folder of your choice as the working directory.
Also check that you have installed the packages used in previous classes, as they will probably come in handy.
Do not forget to load them at the beginning of your script if you use their functions.
Do not forget to comment your code appropriately.

Scenario

You are interning at the Brussels office of Human Rights Watch, and have been asked to verify whether some form of disaffection with democracy shows up in Eurobarometer survey data for Poland and Hungary.

Instructions

Note – the hints below cite some R help pages, but you can also browse those pages online, by going, for instance, to the dplyr website. You should also feel free to find solutions online: use your best search skills.

Question 1

Download the European Parliament trend lines for ‘satisfaction with democracy in your country’ (end of the page) to the data folder. Feel free to open it in a spreadsheet editor to take a first look at it.

Clue

A copy of the dataset can be downloaded here.

Question 2

The data you just downloaded comes as an Excel spreadsheet. Create a new script in the working directory, and write the required code to load the ‘EU’, ‘HU’ and ‘PL’ sheets of the spreadsheet into three data frames.

Solution

library(tidyverse) # {dplyr}, {ggplot2}, {readxl}, {stringr}, {tidyr}, etc.

This is the path to the spreadsheet.

repository <- "data"

Let’s now open the ‘EU’ sheet, skipping the rows we do not need at the top.

data_eu <-  readxl::read_excel(paste0(repository,
                                 "/Satisfaction_with_democracy_in_your_country.xls"),
                          sheet = "EU", skip = 4)

New names:
• `` -> `...1`

We are going to assemble this dataset with two other similar datasets that come from the same spreadsheet, just from different sheets within it.

data_pl <-  readxl::read_excel(paste0(repository,
                                 "/Satisfaction_with_democracy_in_your_country.xls"),
                          sheet = "PL", skip = 4)

New names:
• `` -> `...1`

data_hu <-  readxl::read_excel(paste0(repository,
                                 "/Satisfaction_with_democracy_in_your_country.xls"),
                          sheet = "HU", skip = 4)

New names:
• `` -> `...1`

Question 3

Add a country column to each data frame to identify the ‘country’ as either "EU", "PL" or "HU", and then assemble all three data frames into a single one.

Clue

There are multiple ways to do this, but an efficient way to achieve it is to use:

dplyr::mutate for the first step.
dplyr::bind_rows for the second step.

Solution

If you want to assemble the datasets, we will need a way to know which rows came from each sheet, so let’s add an identifier column called country.

For that, let’s do a mutate to all three datasets:

data_eu <- data_eu %>%  mutate(country = "EU")
data_pl <- data_pl %>%  mutate(country = "PL")
data_hu <- data_hu %>%  mutate(country = "HU")

Note : another way to do this is with tibble::add_column, with an extra .before optional argument so that you can see the new column add_column(data_eu, country = "EU", .before = 1)

Oce we have done that to each of the three imported sheets, all is left to do is to assemble those datasets, using dplyr::bind_rows – which has a very simple syntax bind_rows( dataset_1, dataset_2, ... ). So, in our case (watch the number of rows change):

eb <- bind_rows(data_eu, data_pl, data_hu)

The only requirement is that the column names should match across all datasets, otherwise chaos (that is, multiple columns and missing values) will ensue, but fortunately, the spreadsheet we are using does not pose any such problem.

Notice how the %>% pipes can help to take all different data import and transformation steps in just a few lines:

eb <- paste0(repository, "/Satisfaction_with_democracy_in_your_country.xls")
eb <- bind_rows(
  readxl::read_excel(eb, sheet = "EU", skip = 4) %>%
    add_column(country = "EU"),
  readxl::read_excel(eb, sheet = "PL", skip = 4) %>%
    add_column(country = "PL"),
  readxl::read_excel(eb, sheet = "HU", skip = 4) %>%
    add_column(country = "HU")
)

New names:
New names:
New names:
• `` -> `...1`

Question 4

Rename the following variables as below and keep only those variables in the dataset:

country => country (no change)
...1 => year (no change)
Total 'Not satisfied' => not_satisfied (no change)

Clue

Adapt the following code to your own in order to restrict the dataset to specific columns while renaming them at the same time:

    select(
      your_dataset,
      country,
      year = "...1",
      not_satisfied = "Total 'Not satisfied'"
      )

Solution

eb <- eb %>%
  select(
    country,
    year = "...1",
    not_satisfied = "Total 'Not satisfied'"
  )

See what happens above? We use the eb dataset we just created, and ask to keep only some columns:

the identifying country variable we created right before,
the ...1 variable, which we rename to year by the same occasion, and
the ‘not satisfied’ variable, which we also rename on-the-fly

Note the quotes, which are required to select columns when their names include spaces or other special characters, such as ‘.’. You can you double quotes, simple quotes and even backticks.

For the specific example of the name "Total 'Not satisfied'" you need to use double quotes because single quotes are already used in the name of the column.

The end of this exercise will be next week!