4 Data 2

Session 4

Author

François Briatte
(small modifs by Kim Antunez & ChatGPT)

session date

February 27, 2024

By the end of this session, you have learned more on dplyr tools.

In a nutshell

Utilizing more dplyr functions for data manipulation

merging datasets

wide and long formats

string detection

…

[Bonus] Understanding databases, including dbplyr for connections.

4.1 Continue to wrangle

4.1.1 Exercises

Exercise 1

The article “Measuring Public Support for European Integration across Time and Countries: The ‘European Mood’ Indicator” by Isabelle Guinaudeau and Tinette Schnatterer addresses the challenge of measuring public support for the European Union (EU), in the context of increasing contestation of European integration. They emphasize the need for a consistent and comparable instrument to measure citizens’ preferences and to analyze the dynamic relationship between public opinion, party politics, and policy making.

Exercise 2

Source: https://search.gesis.org/research_data/ZA7581

In this article, Hermann Schmitt, Sara B. Hobolt, and Wouter van der Brug describe the results of a post-election study, conducted in all 28 EU member states after the elections to the European Parliament were held between 23 and 26 May 2019. The main objective of the 2019 EES Voter study is to study electoral participation and voting behaviour in European Parliament elections.

In particular, question 10 ask how probable is it that people will ever vote for different parties in their country.

Source: https://www.europeanelectionstudies.net

4.1.2 More on Tidying data

Reshaping (pivoting)

Reshaping involves restructuring data to transform it from a wide format to a long format or vice versa.

Splitting (separating)

Splitting involves breaking a dataset into smaller groups based on a specified variable, allowing for separate analysis and manipulation of each group’s data.

Merging (joining)

Merging involves combining multiple datasets based on shared variables to create a single unified dataset:

Left Join: Includes all records from the left dataset and matching records from the right dataset.
Right Join: Includes all records from the right dataset and matching records from the left dataset.
Inner Join: Includes only matching records from both datasets.
Full Join: Includes all records from both datasets, filling in with NA where there are no matches.

4.1.3 More Cheatsheets

Source : https://raw.githubusercontent.com/rstudio

Exam 1

Instructions here
- in group & graded
- real-world dataset (road traffic accidents in France)
- data cleaning and data analysis tasks
- Submit a script called exam1_solution.R (Subject : “DSR Exam 1 Submission”, To : kim.antunez@sciencespo.fr)
- Deadline : mar. 12th, before 7:15 pm.
- Make sure your script is well-organized

Homework for next week

Exercise 2 from today (questions 1 to 7)
1 preparation exercise (in group & ungraded, for now)
- Use the hints
- Search for help online (e.g. StackOverflow, more than ChatGPT)
- be persistent (you will need it) and do your best!
Handbooks, videos, cheatsheets
- 3 chapters (1 Irizarry, 1 Healy, 1 Wickham et al)
- 3 compulsory Cheatsheets

4.2 Bonus : databases

Databases serve as organized repositories of data structured into tables with defined relationships:

They employ common ‘id’ columns to establish connections between different tables
Optimised for speed with large data tables/streams
Requires learning a DB query language

Two main approaches are available for managing data within databases:

Standard, SQL : predominantly row-oriented and comes in various versions to suit different database systems.
Alternatively, MonetDB : employs a column-oriented structure

The dbplyr package acts as a connector that extends the capabilities of dplyr. It relies on ODBC/DBI drivers, which enable R to connect and communicate with a diverse array of database management systems:

MySQL
PostgreSQL
SQLite
Microsoft SQL Server
Oracle

There are also specialized drivers available for MonetDB such as:

MonetDB.R
MonetDBLite

But also :

bigrquery R package for Google BigQuery, a cloud-based data warehousing solution.
…

You can refer to Chapter 22 (ʻDatabasesʼ) in the book “R for Data Science” by Grolemund and Wickham or here.