library(tidyverse) # {dplyr}, {ggplot2}, {readxl}, {stringr}, {tidyr}, etc.
Preparation: Code and data essentials [1/2]
For Session 4
This exercise focuses on
- using RStudio to set up a project (code and data)
- code essentials: creating a script, loading required packages
- data essentials: creating data frames from external data sources
- data cleaning and preparation
This is not an easy exercise: work with your group, and get ready to spend a couple of hours on it.
Prerequisites
Prior to completing the following exercise, open RStudio and set the folder of your choice as the working directory.
Also check that you have installed the packages used in previous classes, as they will probably come in handy.
Do not forget to load them at the beginning of your script if you use their functions.
Do not forget to comment your code appropriately.
Scenario
You are interning at the Brussels office of Human Rights Watch, and have been asked to verify whether some form of disaffection with democracy shows up in Eurobarometer survey data for Poland and Hungary.
Instructions
Note – the hints below cite some R help pages, but you can also browse those pages online, by going, for instance, to the dplyr
website. You should also feel free to find solutions online: use your best search skills.
A copy of the dataset can be downloaded here.
This is the path to the spreadsheet.
<- "data" repository
Let’s now open the ‘EU’ sheet, skipping the rows we do not need at the top.
<- readxl::read_excel(paste0(repository,
data_eu "/Satisfaction_with_democracy_in_your_country.xls"),
sheet = "EU", skip = 4)
New names:
• `` -> `...1`
We are going to assemble this dataset with two other similar datasets that come from the same spreadsheet, just from different sheets within it.
<- readxl::read_excel(paste0(repository,
data_pl "/Satisfaction_with_democracy_in_your_country.xls"),
sheet = "PL", skip = 4)
New names:
• `` -> `...1`
<- readxl::read_excel(paste0(repository,
data_hu "/Satisfaction_with_democracy_in_your_country.xls"),
sheet = "HU", skip = 4)
New names:
• `` -> `...1`
There are multiple ways to do this, but an efficient way to achieve it is to use:
dplyr::mutate
for the first step.dplyr::bind_rows
for the second step.
If you want to assemble the datasets, we will need a way to know which rows came from each sheet, so let’s add an identifier column called country
.
For that, let’s do a mutate
to all three datasets:
<- data_eu %>% mutate(country = "EU")
data_eu <- data_pl %>% mutate(country = "PL")
data_pl <- data_hu %>% mutate(country = "HU") data_hu
Note : another way to do this is with tibble::add_column
, with an extra .before
optional argument so that you can see the new column add_column(data_eu, country = "EU", .before = 1)
Oce we have done that to each of the three imported sheets, all is left to do is to assemble those datasets, using dplyr::bind_rows
– which has a very simple syntax bind_rows( dataset_1, dataset_2, ... )
. So, in our case (watch the number of rows change):
<- bind_rows(data_eu, data_pl, data_hu) eb
The only requirement is that the column names should match across all datasets, otherwise chaos (that is, multiple columns and missing values) will ensue, but fortunately, the spreadsheet we are using does not pose any such problem.
Notice how the %>%
pipes can help to take all different data import and transformation steps in just a few lines:
<- paste0(repository, "/Satisfaction_with_democracy_in_your_country.xls")
eb <- bind_rows(
eb ::read_excel(eb, sheet = "EU", skip = 4) %>%
readxladd_column(country = "EU"),
::read_excel(eb, sheet = "PL", skip = 4) %>%
readxladd_column(country = "PL"),
::read_excel(eb, sheet = "HU", skip = 4) %>%
readxladd_column(country = "HU")
)
New names:
New names:
New names:
• `` -> `...1`
Adapt the following code to your own in order to restrict the dataset to specific columns while renaming them at the same time:
select(
your_dataset,
country,
year = "...1",
not_satisfied = "Total 'Not satisfied'"
)
<- eb %>%
eb select(
country,year = "...1",
not_satisfied = "Total 'Not satisfied'"
)
See what happens above? We use the eb
dataset we just created, and ask to keep only some columns:
- the identifying
country
variable we created right before, - the
...1
variable, which we rename toyear
by the same occasion, and - the ‘not satisfied’ variable, which we also rename on-the-fly
Note the quotes, which are required to select columns when their names include spaces or other special characters, such as ‘.’. You can you double quotes, simple quotes and even backticks.
For the specific example of the name "Total 'Not satisfied'"
you need to use double quotes because single quotes are already used in the name of the column.
The end of this exercise will be next week!