Appendix A — Readings
A large majority is from François’s wiki
Note on availability: almost all resources below are free to read online, in one form or another. In a few exceptional cases, I will supply the readings on Google Drive.
In bold are REQUIRED READINGS below for each session.
General
Handbooks
Early in the course, we rely on R-focused handbooks, all of which are readable online, and on a single, short introductory statistics textbook:
Gerring and Christenson, Applied Social Science Methodology (Cambridge University Press, 2017)
This handbook does not contain any R code, but it provides the simplest and most concise introduction to all statistical topics up to linear regression.
Irizarry, Introduction to Data Science (CRC Press, 2022)
This handbook is our ‘primary’ R handbook in the first course sessions.
Ismay and Kim, Statistical Inference via Data Science. A Modern Dive into R and the tidyverse (CRC Press, 2023)
This handbook is our ‘secondary’ R handbook in the first course sessions.
Healy, Data Visualization. A Practical Introduction (Princeton University Press, 2019)
This handbook covers various aspects of plotting things with R and the {ggplot2} package.
Wickham et al., R for Data Science (2nd ed., O’Reilly, 2023)
This is the ‘master’ R handbook to use throughout the course whenever needed, as a reference for both the R language and the
{tidyverse}
functions.
Later in the course, we continue with R handbooks, but also turn to more research-focused ones for their statistical parts:
Baumer et al., Modern Data Science with R (2nd ed., CRC Press, 2021)
A handbook that goes a bit deeper than we do, and which I will frequently recommend as an optional, more advanced read.
Hanck et al., Introduction to Econometrics with R (2023)
With R examples. Also useful for students who are training in econometrics and need chapters on e.g. panel data, instrumental variables and time series (on that latter topic, you should really check Hyndman and Athanasopoulos, Forecasting: Principles and Practice, 2021).
Imai, Quantitative Social Science (Princeton University Press, 2018)
With R examples.
Llaudet and Imai, Data Analysis for Social Science (Princeton University Press, 2022)
With R examples.
Li, Using R for Data Analysis in Social Sciences. A Research Project-Oriented Approach (Oxford University Press, 2018)
With R examples.
Rodrigues, Modern R with the tidyverse (2022)
R-focused, and a useful complement to the Irizarry handbook.
Sanchez and Marzban, All Models Are Wrong: Concepts of Statistical Learning (2020)
Theory-only, and harder than most of our other handbooks, but highly insightful and very clearly illustrated.
Videos
All sessions will come with a handful of videos to watch, if that’s something that works for you. The main sources that you will encounter are:
Bail, SICSS Boot Camp and SICSS 2020
Videos from the Summer Institute in Computational Social Science (SICSS). The ‘Boot Camp’ is a series of introductory videos on R and RStudio. The ‘SICSS 2020’ series is an entire summer school.
Rooduijn et al., Basic and Inferential Statistics
An excellent introductory statistics online course, from the University of Amsterdam.
Pew Research Center, Methodological Research
This institute has published some very accessible ‘Methods 101’ videos on specific topics.
Robinson, Tidy Tuesday R Screencasts
A series of live R-coded screencasts that show how to perform exploratory data analysis on real-world datasets, with lots of data visualization through the
{ggplot2}
package.Silge, Tidy Tuesday R Screencasts
Another series of screencasts showing how to analyse real-world datasets, with many examples of how to run machine learning algorithms through the
{tidymodels}
package bundle.Starmer, StatQuest
Hundreds (literally) of very understandable videos on basic statistics and machine learning.
El Khadir, Visually Explained
Short explanatory videos that show the logic behind many machine learning techniques.
Optional
Each session has a list of optional resources that will often go further than what we covered in class.
Some sessions will also recommend using R cheatsheets, although some students have reported that those can be overwhelming.
Many more helpful resources, in multiple languages, are listed on the Rzine website, and the final session contains bonus sections of where to find more R resources.
Session 1: DataScience & Software
Handbooks
The chapters below cover essentially the same topics. Read at least one of them, and explore the contents of all handbooks if you are curious about what is to come.
- Ismay and Kim, ch. 1: ‘Getting started’
- Irizarry, ch. 1: ‘Getting started with R and RStudio’
- Irizarry, ch. 2: ‘R basics’
- Rodrigues, ch. 1: ‘Getting to know RStudio’
Videos
The videos for this session do not involve any R code.
Bail: ‘Installing R and RStudio’
If needed. Also covers RStudio basics.
Rosling, ‘The best stats you’ve ever seen’ (TED, 2007)
An inspirational video that you should definitely watch to understand why some people like me love stats (and plots). The screenshot in the slides is from another, similar video (BBC Four, 2010). You can also check the Gapminder website, and the related book, Factfulness. Rest in peace, Hans.
Tierney, ‘Statistics for Journalists’ (2013)
A 30-minute video that covers the absolute basics about ’stats and numbers’ — big and little numbers, surveys (and polls), averages, uncertainty, p-values, correlation and causation, rare events, risk, and more. Should be compulsory watching for every student (journalist, policymaker, expert) around the world.
Bonus: Help with R
I predict that there will be a point where the course material will feel like it does not provide the right answer to your questions on how to use R to do specific things. When you reach that point, do the following:
Check the course material, again.
The code provided in each session covers a lot of use cases, and the readings also cover a lot of ground. Also note that you can search the entire course by keyword online on GitHub (example search).
If your question is about a specific function or package, check the R help pages.
The help pages are very technical: they have to cover all function arguments, and might be overwhelming, which is why it might be best in many cases to look for package vignettes. Go, for instance, to the {tidyverse} website, and check the ‘Get started’ page for the
{dplyr}
package.Google searches will often lead you to Stack Overflow, where many users have asked R-related questions.
Using the R code from the answers that you will find there might be a double-edged sword, as it might lead you to use more base R syntax than we will do in class, or even use different R packages (after installing them). This can be more overwhelming than helpful, but R is a language, and there are many ways to express yourself in it, so feel free to go that way if it helps you finding solutions to our exercises.
Bonus: R stuff in French
This course is 100% in English, but its wiki has a page with stuff in French. The Rzine website also links to resources in other languages, like Spanish.
Optional
Briatte, Going with Python (2023)
The wiki page where I provide a few pointers on how to learn data science with Python. We will not use Python at all in class, but RStudio can run Python, and it might be useful to some of you to learn some Python later on, especially if you are interested in things like Machine Learning (ML), Natural Language Processing (NLP) or Web scraping. We will come back to that in our last session.
Huntington-Klein, Library of Statistical Techniques (LOST) (c. 2023)
A website that shows you how to code various things in Python, R and Stata. Useful as a reference guide, and as an introduction to the main topics that we will cover together in this course, namely: data manipulation, data visualization and linear regression (ordinary least squares), plus a quick look at logistic regression, geospatial tools and a few more things.
McCullough and Yalta, ‘Spreadsheets in the Cloud – Not Ready Yet’ (Journal of Statistical Software, 2013)
This article explains one of the reasons why we are using statistical software for this course, instead of a spreadsheet editor. It’s not just that spreadsheets are error-prone, that their workflows are mostly irreproducible, and that they have caused a huge amount of mistakes in the past: it’s also that they have a history of being computationally inaccurate. Statistical software is more versatile, but also more numerically reliable, than spreadsheet editors.
McNamara, ’Key Attributes of a Modern Statistical Computing Tool’ (The American Statistician, 2019)
Free preprint. This article lists the qualities that are gradually being built into statistical software like R. Note the importance of the learning curve for beginners, and the importance of visualization.
Session 2: Workflow
Handbooks
Irizarry, ch. 2: ‘R basics’ (again)
This chapter covers basic R syntax, with all its oddities. Treat R as a language: do not expect to learn it in a single week!
Irizarry, ch. 4: ‘The tidyverse’ (up to Section 4.8)
This chapter introduces the main bundle of packages that we will be using throughout the course.
Irizarry, ch. 5: ‘Importing data’
This chapter anticipates on our next session. The essential concept that you should take out of it at that stage is that of setting the working directory, which is essential in order for R to find your files (your datasets).
Rodrigues, ch. 2: ‘Objects, their classes and types, and useful R functions to get you started’
This chapter covers basic R syntax and some specific data formats, like factors and dates. You will hear again about those data formats during our next session.
Rodrigues, ch. 3: ‘Reading and writing data’
Videos
Bail: ‘R basics’
A very easy-to-follow introduction to the core mechanics of R. Think of it as the ‘greetings’ lesson that every language course starts with: ‘hello, my name is, I am x years-old, what is your name?’ and so on.
Cheatsheets
Compulsory:
-
This cheatsheet will show you the many different parts of RStudio. You will not need to use even 10% of the software in class: just focus on setting the working directory, opening scripts, and executing code. Some keyboard shortcuts will be very useful for that: learn them as soon as possible.
-
This cheatsheet documents the ‘base R’ syntax, which you will need to understand well enough to manipulate R objects like data frames. This course will teach you the basics. Remember to ask for explanations in class if you do not understand the syntax of a function: I need you to tell me when you need help!
Optional:
-
This cheatsheet will show that R actually has three sub-syntaxes: ‘base R’ (with lots of
hard[brackets]
and$
signs), ‘formula’ syntax (of they ~ x1 + x2:x3
form), and ‘tidy’ syntax, with lots of%>%
pipes. We will use all three syntaxes in class, but will give priority to tidy syntax when possible. -
A cheatsheet for Stata users. Obviously recommended only if you are proficient enough in Stata to find this useful. For more advanced users: see also stata2r.github.io, which explains how to translate Stata into R using the
{data.table}
(for data manipulation) and{fixest}
(for regression models) packages.
Session 3: Data 1
Handbooks
Irizarry, ch. 4: ‘The tidyverse’ (up to Section 4.8)
This chapter was already assigned last week – if you have not yet read it, do so now! It covers the
{tidyverse}
package bundle and shows how to subset (filter
), aggregate (group_by
) and summarise (summarise
) your data, using functions from the{dplyr}
package.
Irizarry, ch. 22: ‘Joining tables’
Read this chapter to learn everything you need to know about a very common data operation: merging – or ’joining’ – datasets. This will be very helpful, very soon. The chapter also uses the
{dplyr}
package and its ‘join’ functions, which are inspired by the SQL language.
The next handbook readings come from the R for Data Science handbook, by some of the authors of the {tidyverse}
package bundle. I am listing them here primarily for future reference, by which I mean, know that those chapters exist if you need (or rather, when you will need) help later with data management:
- Wickham et al., ch. 4: ‘Data transformation’
- Wickham and Grolemund, ch. 10: ‘Tibbles’ (from the 1st edition of the handbook)
- Wickham et al., ch. 8: ‘Data import’
- Wickham et al., ch. 21: ‘Joins’ (on joining/merging)
Furthermore, the following chapters might also be useful, for dealing with specific aspects of data management. All of them come from the ‘Transform’ section of the handbook, which has even more to offer:
- Wickham et al., ch. 6: ‘Data tidying’ (on pivoting/reshaping)
- Wickham et al., ch. 16: ‘Strings’
- Wickham et al., ch. 18: ‘Factors’
Cheatsheets
-
An overview of how to read common data formats into R. Do not worry too much about this: most of the data that we will use in class will come from CSV and TSV datasets, which are easy to read with the
{readr}
package. We will also use Stata and possibly SPSS datasets, which can be read with the{haven}
package. Last, we will very occasionally read spatial data formats with the{sf}
package. -
An overview of ‘tidyverse’ functions to perform data wrangling. Do not feel overwhelmed: you will learn many of those on the fly, as we go. Just remember this cheatsheet exists if you need a quick guide to ‘how to do x to a dataset’ (or more precisely, a ‘data frame’ or a ‘tibble’) in R.
Videos
Bail: ‘Data wrangling’
This video will introduce you to some of the functions available via the
{dplyr}
package, which is part of the{tidyverse}
package bundle. The video is simple enough to reinforce your understanding of R basics, e.g. loading packages.Bail: ‘Data visualization’
You might have noticed that I have slipped a few plots in the course material. Usually, I build my plots with the
{ggplot2}
package. This video provides a good example of how to use this package to produce nice plots.As said in the previous section, visualization will be covered more at length in our next workshop.
There are a few more videos mentioned towards the end of my slides.
The total runtime of those videos is too high for you to be able to watch that many of them, but just like for the (many) readings above, you should note that they exist, that they are available for future reference, and that you will still be learning data wrangling by the end of this course.
Optional
Ismay and Kim, ch. 3: ‘Data wrangling’
A simple introduction to the main ‘verbs’ (functions) of the
{dplyr}
package, in very similar fashion to what we did in class.
Briatte, Quantitative Social Science Data (2023)
A Web page that lists many social science datasets, in case you need some for other research projects.
Broman and Woo, ‘Data Organization in Spreadsheets’ (The American Statistician, 2018)
Free to read online. While spreadsheet editors are not suitable for analysing data (see the McCullough and Yalta 2013 reading from Session 1), organising data within spreadsheets is a different topic. This article covers the basics of how to do so in order for the result to be ‘machine-readable’ (i.e. understandable by a computer for import).
Elff, Data Management with R (Sage, 2020)
A full book on the topic of data wrangling, with many ‘notebooks’ to illustrate how to handle survey, spatial and text data in R. Cited in the slides. I am citing this book for reference: it actually covers more than we need (for now), and you have limited time, so you might prefer sticking to the other readings.
Ellis and Leek, ‘How to Share Data for Collaboration’ (PeerJ Preprints, 2017)
Free to read online. Like the Broman and Woo 2018 reading above, part of a series of preprints on data science, and a useful read if you end up working in e.g. a research team and need to learn about good practices in data sharing.
Weidmann, Data Management for Social Scientists (Cambridge University Press, 2023)
Free to read online (open access). A book that touches all bases, from using spreadsheets to relational databases, with chapters on special (spatial, text, network) data types. Uses R, of course.
Wickham, ‘Tidy Data’ (Journal of Statistical Software, 2014)
Free to read online. A paper that explains why it makes sense to strive for ‘tidy’ data, which we will cover in class. (A related argument is that ‘tidy’ data allows for ‘split-apply-combine’ operations, but that’s less central to our goals right now.)
Session 4: Data 2
Handbooks
- Irizarry, ch. 7: ‘Introduction to data visualization’
An alternative to the chapters above is Healy’s Data Visualization handbook, which goes a bit further into how to use {ggplot2}
effectively:
- Healy, ch. 3: ‘Make a plot’
Last, note that there is also an entire ‘Visualize’ section in the Wickham and Grolemund handbook:
- Wickham et al., ch. 11: ‘Layers’
Cheatsheet
- Data transformation (again)
Some cheatsheets will come in handy when you face special data formats, such as:
- Strings
- Dates
- Labelled data (used in surveys)
Videos
Bail: ‘Data wrangling’
This video was also already assigned last week. It is still relevant to watch it now if you did not next week, as it covers the basics of data wrangling.
Session 5: Visualization
Handbooks
- Irizarry, ch. 10: ‘Data visualization in practice’
- Irizarry, ch. 11: ‘Data visualization principles’
- Healy, ch. 4: ‘Show the right numbers’
- Healy, ch. 5: ‘Graph tables, add labels, make notes’
The Healy handbook even has a chapter on maps, which we only touched upon and will come back to at the end of the course:
- Healy, ch. 7: ‘Draw maps’
And the end of the ‘Visualize’ section in the Wickham and Grolemund handbook:
- Wickham et al., ch. 12: ‘Exploratory data analysis’
- Wickham et al., ch. 13: ‘Communication’
Cheatsheet
Videos
Bail: ’Data visualization’
Robinson: ‘Analyzing the Kenya census in R’
A screencast example of how to use
{ggplot2}
to explore a dataset, including through maps.Robinson: ‘Analyzing deforestation in R’
Another good screencast example of how to explore a survey, including through maps.
Optional
Ismay and Kim, ch. 2: [‘Data visualization’][moderndive-2] [moderndive-2]: https://moderndive.com/2-viz.html
Chang, R Graphics Cookbook (O’Reilly, 2nd ed., 2023)
Free to read online. A very good go-to reference handbook for producing common plots (bar plots, line graphs, etc.) with the
{ggplot2}
package, with dozens of examples.
Emaasit,
{ggplot2}
extensions (c. 2023)A gallery of
{ggplot2}
extensions, that is, R packages that have been built on top of the package to facilitate various types of plots. See also Erik Gahner Larsen’s awesome{ggplot2}
list for a nice selection of themes and color palettes to use with it.Heiss, Data Visualization with R (2023)
An entire course that covers the core principles of graphic design, and how to apply them with
{ggplot2}
. I do not usually recommend courses in this section, because there is a special wiki page for that, but this course is exceptionally good, with highly informative slides, video-recorded examples, and lots of good code and other resources.Holtz, R Graph Gallery (c. 2023)
Many different types of plots, all coded in R, mostly with the
{ggplot2}
package. Cited in the slides.Munzner, Visualization Analysis and Design (CRC Press, 2014)
An excellent book on visualization ‘theory’ – the fundamentals on how data abstraction works. The website linked to above has an entire online course on the topic, plus many sorter talks, all of which are like the book, extremely clear and well-illustrated. Tangential to the course, but highly recommended.
Tufte, The Visual Display of Quantitative Information (2nd ed., Graphics Press, 2001)
A beautiful treaty on data visualization. ‘VDQI’ is the kind of book that you will make you fall in love with a topic, and that you will keep mentioning over the years on every possible occasion. Tangential to the course, but highly recommended.
Wickham, ‘A Layered Grammar of Graphics’ (Journal of Computational and Graphical Statistics, 2010)
Hadley Wickham is the main author of the
{ggplot2}
package, which implements the ‘grammar of graphics’ logic in R. This article explains what that grammar, which was designed by Leland Wilkinson, actually is.
Wickham et al., ggplot2: Elegant Graphics for Data Analysis (Springer, 3rd ed., 2023)
Free to read online. The book that explains the complete mechanics of
{ggplot2}
.
Wickham et al., ‘Visualizing Statistical Models’ (Statistical Analysis and Data Mining, 2015)
Free preprint. A paper on how to visualize various kinds of models, and why doing so is crucial. The transition from tables to graphs is still very much an ongoing process in academic research, and R is helping with that.
Wilke, Fundamentals of Data Visualization (O’Reilly, 2019)
Free to read online. Yet another great book on the principles of data visualization.
Bonus: SQL
Special section for those of you who want to understand how to use SQL, within R or on its own.
Baumer et al., ch. 15: ‘Database querying using SQL’
Baumer et al., ch. 16: ‘Database administration’
The two chapters above cover a lot of SQL basics, and how to implement them from within R. The first chapter uses its own SQL source, but the second one explains how to create database connections.
Posit/RStudio, Best Practices in Working with Databases
An absolute must-read if you are going to use databases (DBs) through RStudio, which has been extended with many features with DB users in mind.
Weidmann, Data Management for Social Scientists (Cambridge University Press, 2023)
Free to download from the publisher’s website. Explains how to use SQL, and the PostgreSQL relational database system in particular, from within R. Comes with a useful set of additional chapters on spatial, network and text data.
Wickham et al., ch. 23: ‘Databases’
From the handbook that you have already started reading a lot from. Introduces the two key packages beyond
{dplyr}
to work with databases in R, the{DBI}
package, which contains ‘drivers’ to handle database connections, and the{dbplyr}
package.
Session 6: Univariate Exploratory analysis
Handbooks
Here are a few relevant chapters from the handbooks that we have already used:
Irizarry, ch. 12: ‘Summary statistics’
This chapter covers summary statistics (mean, median, percentiles etc.) and their relationship to density curves, boxplots, and the normal distribution. It also explains methods to identify outliers.
Irizarry, ch. 14: ‘Random variables’
This chapter explains how the theoretical properties of random variables and probability distributions can be leveraged to estimate population statistics from samples. The key concept that connects them is the standard error (SE). The chapter covers the Central Limit Theorem (CLT) and the Law of Large Numbers (LLN), which explain why we strive for large sample sizes (large ‘N’) when we collect data.
Irizarry, ch. 15: ‘Statistical inference’ (up to Section 15.7)
This chapter covers estimation (going from ‘sample’ to ’population’), confidence intervals, p-values, and two association tests: the Chi-square test, and odds ratios.
At that stage, I also want to introduce another handbook, Llaudet and Imai’s Data Analysis for Social Science, which comes with more details on the statistical side of things, and which also includes lots of R code examples. I recommend reading the following chapters:
- Llaudet and Imai, ch. 3: ‘Inferring population characteristics via survey research’
- Llaudet and Imai, ch. 6: ‘Probability’
- Llaudet and Imai, ch. 7: ‘Quantifying uncertainty’
(The chapters are on Google Drive.)
Videos
Rooduijn et al., Basic Statistics Module 4: ‘Probability distributions’
Rooduijn et al., Basic Statistics Module 5: ‘Sampling distributions’
Rooduijn et al., Basic Statistics Module 6: ‘Confidence intervals’
Use the three modules above to catch up on (or revise) your introductory statistics course – you know, the one you took as an undergraduate, but which you then quickly forgot almost everything about. I am not recommending Module 1 on variables, distributions and descriptive statistics, but it might also help if you cannot even remember e.g. what a median is.
Starmer: ‘Statistics Fundamentals’
Throughout this course, I will mostly recommend watching the Rooduijn et al. videos on various topics. However, if those do not work for you for any reason, you might be able to ‘fall back’ on this other source, which covers very similar grounds. Make a note of it, as I will not systematically link to it in the next few weeks.
Optional
Ismay and Kim, Appendix A: ‘Statistical Background’
This reading is presented as a glossary. It covers basic summary statistics (mean, median, etc.), the normal distribution, and log10 transformations. We use log-transformations at several points in this course, so take a look at that part if you are not familiar with them.
Gerring and Christenson, ch. 18: ’Univariate statistics’
Gerring and Christenson, ch. 19: ’Probability distributions’
(On Google Drive.) A no-code introduction to descriptive statistics and distributions. Recommended if you are looking for something short: each chapter is less than 10 pages, with many graphs.
Ismay and Kim, ch. 7: ‘Sampling’
A full dive into how sampling works, and what we can derive from it. Ends on an interesting section on polls, which is a bit more developed than what you might have read on that topic in Irizarry’s handbook.
Ismay and Kim, ch. 8: ‘Bootstrapping and confidence intervals’
This chapter goes deeper into the core topics of the session. It specifically covers bootstrapping, a method that involves resampling the data in order to produce ‘bootstrapped’ confidence intervals. To see how to use those in association tests, take a look at Appendix B of the book, ‘Inference Examples’.
Session 7: Correlation
Handbooks
Gerring and Christenson, ch. 20: ’Statistical inference’
Gerring and Christenson, ch. 21: ‘Bivariate statistics’ (before p. 322)
(On Google Drive.) You really do not need to read more on the topic than this, as we will very soon delve into something much more powerful and interesting. The important thing to take away from that part of the course is the logic behind statistical significance, which you already got last week from reading some earlier pages out of Gerring and Christenson’s handbook.
Videos
Rooduijn et al., Basic Statistics Module 2: ‘Correlation and regression’
Like our class, this module quickly jumps from linear correlation to (simple) linear regression, which we will come back to next week. Video 2.2 is the one that focuses strictly on correlation, in 7 minutes.
Starmer: ‘Pearson’s correlation, clearly explained’
A 20-minute treatment of linear correlation. Longer than the Rooduijn et al. one, but possibly easier to follow.
Optional
Bueno de Mesquita and Fowler, Thinking Clearly with Data. A Guide to Quantitative Reasoning and Analysis (Princeton University, 2021)
(On Google Drive.) If you have to read a single book on the topic of correlation and causation, read this one. Correlation is the topic of only three (excellent) chapters. The rest of the book covers 90% of what you might want to learn about causal inference: regression, samples, (randomised) experiments, regression discontinuity designs, and differences-in-differences. Oh, and there’s a bonus section on measurement and quantification at the end.
Hijmans, ‘Spatial autocorrelation’
Very optional reading. It seems to me that it is very often the case that people speak of correlation when looking at either time series (serial correlation, or autocorrelation) or maps (spatial correlation). You might know the former from studying econometrics. This tutorial briefly mentions both, with a focus on the latter, and shows one way to measure it, Moran’s I. We will not cover time series at any point in this course, but we will briefly come back to the topic of spatial analysis later on, even though we will probably not have the time to come back to the specific topic of spatial dependence.
Hirschman, ‘Stylized Facts in the Social Sciences’ (Sociological Science, 2016)
Free to read online. Let’s pause at that stage of the course: why are we interested in relationships between things, and what are we actually looking to extract from those relationships? Read the article, and possibly this comment on the text (and its related article), to get some answers. Trigger warning: contains causal language and some implicit philosophy of science (as do my slides for this session).
Session 8: Statistical Inference
Handbooks
Read again:
Gerring and Christenson, ch. 20: ’Statistical inference’
Gerring and Christenson, ch. 21: ‘Bivariate statistics’ (after p. 322)
(Offline) Read those if you have never studied association tests before, or if you need a refresher on how they work. The chapters are very short, to the point, and well illustrated. Focus on understanding what statistical significance really means.
Imai, Quantitative Social Science, ch. 7: ‘Uncertainty’
(On Google Drive.) This chapter covers the same theoretical points as those above, and adds a bunch of useful examples with R code.
Videos
Rooduijn et al., Basic Statistics Module 7: ‘Significance tests’
Rooduijn et al., Inferential Statistics Module 1: ‘Comparing two groups’
Rooduijn et al., Inferential Statistics Module 2: ‘Categorical association’
Pretty much everything you need to know about association tests and statistical significance. This is a very broad topic, and the ‘Inferential Statistics’ course has modules that we will not cover at all, e.g. non-parametric tests, which are complementary to those that we are looking at in class, and analysis of variance (ANOVA), which we will ignore entirely in order to focus on regression models.
Optional
Gelman et al. Regression and Other Stories (Cambridge University Press, 2020), ch. 3 (‘Some basic methods in mathematics and probability’) and 4 (‘Statistical inference’)
A fuller treatment of the basics: weighted averages, logarithms, probability distributions, standard errors, statistical significance and hypothesis testing. The authors are rightfully hostile to a lot of the language used to describe p-values and errors: read the second chapter to understand why. The rest of the book covers regression modelling in depth, using Bayesian inference: I will recommend it again (as an optional reading) in due time.
Ismay and Kim, ch. 9: ‘Hypothesis testing’
This chapter is interesting because it shows how to use the
{infer}
package to perform association tests from bootstrapped statistics (see the optional readings of the previous session for an introduction to those). The chapter also contains a helpful section on interpreting hypothesis tests.
Lindeløv, Common statistical tests are linear models (2019)
This link shows you that what we are covering right now in class can in fact be completely subsumed into our later topics: linear (and nonlinear) models. This is why, next week, I will very briefly cover linear correlation, and then skip directly to simple linear regression: because correlation is just the standardized coefficient in a bivariate regression model.
Session 9: Linear regression
Handbooks
Note that regression models are a vast topic, and that the handbook chapters below will almost certainly not suffice for a full understanding of how it works, and how to perform all the required operations for them, e.g. dummies, interactions, diagnostics and marginal effects. For those, dig into the optional readings of this session.
Healy, ch. 7: ‘Work with models’
This chapter shows how to use the
{broom}
package to manipulate model results, as we do in class, and shows how to plot their results in great detail. The end covers working with complex survey data. For more ways to plot regression coefficients in R, also take a look at the{ggstats}
and{dotwhisker}
packages.
Videos
El Khadir: ‘Linear Regression in 2 Minutes’
From the Visually Explained YouTube channel, which is more focused on machine learning than statistics, hence the language used in the video (e.g. ‘features’ instead of ‘variables’). Ignore the language: start with this, to get the ‘geometry’ behind linear regression.
Optional
There are a lot of optional readings for this session, but they each cover different grounds. Go through the list, and choose one or two at most, depending on your level of familiarity with linear models and interests.
Gerring and Christenson, ch. 22: ’Regression’
(Offline) A short introduction on the topic.
Imai, Quantitative Social Science, ch. 4: ‘Prediction’
Offline. A similar one-chapter treatment of linear regression, with slightly different vocabulary, and some R examples. This is very long chapter that spans over 60 pages, but you can focus on the key parts, Sections 4.2, 4.3.2 and 4.3.3, which also cover linear correlation, and interaction terms.
Caffo, Regression Models for Data Science in R (Leanpub, 2019)
Free to download. This handbook covers pretty much everything you need to know on regression models, and comes with videos and coded examples. It contains all essential equations, and two chapters on models with binary and count ‘responses’ (dependent variables), which is why I recommend it if you are already familiar enough with linear models but want to revise them and push it a bit further by the same occasion. Otherwise, go with the Hanck et al. handbook below for a more compact treatment.
Gelman et al. Regression and Other Stories (Cambridge University Press, 2020)
Everything that we do in this course is done through frequentist inference, but this book will show you that there is another way to think statistically: go through it for an introduction to Bayesian reasoning and inference, an advanced topic on which there is also a dedicated page on the course wiki.
Hanck et al., ch. 4: ‘Linear regression with one regressor’
Hanck et al., ch. 6: ‘Regression models with multiple regressors’
Hanck et al., ch. 9: ‘Assessing studies based on multiple regression’
This handbook covers a bit more ground, and also comes with coded examples. Chapter 9 is particularly useful to understand regression diagnostics. I recommend checking the handbook if you are also taking econometrics, and/or are interested in more advanced modelling than what we manage to do in class.
Huntington-Klein, The Effect (CRC Press, 2022), ch. 13: ‘Regression’
Free to read online. An attempt to cover all of linear (and nonlinear) regression in a single chapter, complete with an extra section on standard error correction, and code for R but also Stata and Python. From an excellent book that also covers a lot of econometrics (e.g. fixed effects) and causal inference (e.g. instrumental variables, differences-in-differences and regression discontinuity designs). It goes further than we need for class, but like the Hanck et al. handbook, I recommend it for economists especially.
Kam and Franzese, Modeling and Interpreting Interactive Hypotheses in Regression Analysis (University of Michigan Press, 2007)
This book is all about interaction terms in regression models, when to use them, and how to interpret them. See esp. ch. 2 (‘Interactions in Social Science’) and 3 (‘Theory to Practice’).
Rodrigues, ch. 6: ‘Statistical models’
This chapter covers more than just linear regression: it also mentions other models (in Section 6.7), and provides a brief overview of regularization in Section 6.8, and cross-validation in Section 6.9. Read those parts if you are interested in statistical and machine learning.
Sanchez and Marzban, ch. 5: ‘Linear regression’
This reading clearly shows what the intuition behind linear regression is, and also gives you its mathematical foundations in matrix algebra and a few equations. Skip those if you find them too challenging, and focus on the geometrical insights, which are perfectly understandable on their own.
Schrodt, ‘Seven Deadly Sins of Contemporary Quantitative Political Analysis’ (Journal of Peace Research, 2014)
Free preprint. Some hard truths about how (mostly linear) regression is misused in our academic discipline. Any scientific method that gets used a lot will suffer from overuse. This paper is a detailed critique of what form that overuse takes in political science research, and in similar research fields.
Shalizi, The Truth About Linear Regression (2019)
Free to read online. Delivers exactly what it says on the tin: the truth about linear regression, in 406 detailed pages. A much longer version of the Sanchez and Marzban reading above, one might say, with much harder mathematical parts. Check chapter 13 on regression diagnostics in particular.
Silge, ‘Predict childcare costs in US counties with
xgboost
and early stopping’ (2023)A screencast video that shows how to perform regression within a machine learning workflow, using XGBoost (eXtreme Gradient Boosting). Watch this if you are curious about machine learning, which is performed here with functions from the
{tidymodels}
package bundle.
Videos
Rooduijn et al., Inferential Statistics Module 3: ‘Simple regression’
Rooduijn et al., Inferential Statistics Module 4: ‘Multiple regression’
These modules cover simple regression, with a single predictor (independent variable), and multiple regression, with multiple ones. They cover all the basics you need to know (outside of the R code) on the topic.
Silge: ‘Resampling to understand gender in art history textbooks’
A screencast example of how to use bootstrapped statistics in the context of a linear model. Recommended if you dug deep enough into the previous links to read on that topic.
Session 10: Logistic Regression
Handbooks
Hanck et al., ch. 8: ’Nonlinear regression functions’
This chapter goes back to nonlinearity, when we introduced linear and nonlinear correlation. It shows how to use the insights of that session into regression models, with polynomials and logarithms, and also covers interactions.
Hanck et al., ch. 11: ’Regression with a binary dependent variable’
Covers the full spectrum of models that you need to read about: linear probability models (and their many issues), logit and probit models (focus on probit if you are an economist, ignore it otherwise), and maximum likelihood estimation (MLE).
Videos
Hastie and Tibshirani, Statistical Learning 4.2: ‘Logistic Regression’
Hastie and Tibshirani, Statistical Learning 4.3: ‘Multivariate Logistic Regression’
Two videos from an online course on statistical learning, by two legends in the field whose book also deals about the topic of classification. The videos are all about understanding the mathematics behind logistic regression, but the authors also have a lab session with an example.
Silge: ‘Fit and predict with logistic regression for bird bath observations in Australia’
A screencast example of how to use logistic regression within a machine learning (ML) workflow, which involves resampling and cross-validation. Watch this if you are curious about ML in general.
Starmer: ‘Logistic regression’ (2020)
A 2-hour treatment of logistic regression, broken down into digestible videos, and complete with a final one on how to perform it in R. Highly recommended, especially to beginners: at least watch the introduction, which is less than 10 minutes. If you want to go even deeper into the topic, also check the video on ROC and AUC, and how to do it in R (you can skip the parts on random forests).
Optional
Li, Appendix: ‘A Brief Introduction to Analyzing Categorical Data and Finding More Data’
Offline. A 20-page case study of how to perform logistic regression in R, using data from the World Values Survey. The next pages focus on finding more data online: you can skip those.
Boehmke and Greenwell, Hands-On Machine Learning with R (CRC Press, 2020), ch. 5: ‘Logistic regression’
As the title indicates, this chapter comes from a machine learning book, which means that the vocabulary used will be a bit different than what we use in class.
Sanchez and Marzban, ch. 25: ‘Logistic regression’
This chapter explains the mathematical underpinnings of logistic regression. Read it to understand why logistic regression is a generalization of what we have covered earlier with linear regression, and to understand how the estimation of the model differs from OLS.
Vegetti, Introduction to Generalized Linear Modeling (GLM) (2017)
A short course that covers the basics that you will need for this course: logit and probit models (you can skip the parts on probit), Maximum Likelihood Estimation (MLE), and how to interpret (log-)odds, odds ratios and interaction terms. The last set of slides goes beyond our scope by also covering ordinal and multinomial logit. Comes with R lab sessions: check out Day 3 in particular.
Session 11: Spatial analysis
Handbooks
Baumer et al., ch. 17: ‘Working with geospatial data’
Baumer et al., ch. 18: ‘Geospatial computations’
Two very efficient chapters to get familiar with the basics.
Healy, ch. 7: ‘Draw maps’
This chapter was already mentioned on visualization. Contains many very well-coded examples.
Pebesma and Bivand, Spatial Data Science (2022)
An online book coauthored by the author of the
{sf}
and{stars}
packages that we used in class.Lovelace et al., Geocomputation with R (CRC Press, 2022)
Free to read online. A fairly advanced book on spatial analysis, with useful chapters on its application in e.g. ecology and transportation.
Jung, Spatial Analysis with R (2023)
An online tutorial that covers every aspect of manipulating spatial data in R, through the
{sf}
package.Moraga, Spatial Statistics for Data Science: Theory and Practice with R (2023)
Free to read online. Yet another online book that covers all of the basics, from using
{sf}
data to drawing maps to estimating spatial models with areal data. Includes a chapter on plotting raster and vector data with the{terra}
package.
Videos
Silge: ‘Spatial resampling to understand drought in Texas’
A screencast that goes into a topic that not mentioned in class: spatial resampling, a method that aims at controlling for spatial correlation. Check out her other videos too.
Session 12: Classification ?
Handbooks
Note that some of the ‘handbook’ readings below are not from the course handbooks, but from other sources.
Baumer et al., ch. 12: ‘Unsupervised learning’
This chapter covers almost the same topics as we did in class, and contains an interesting example that uses voting data from the Scottish Parliament, which was collected for a senior thesis.
James et al., An Introduction to Statistical Learning (Springer, 2013), ch. 10: ‘Unsupervised Learning’
(On Google Drive.) A one-chapter introduction to Principal Components Analysis (PCA) and clustering, from an excellent, seminal book that you can get for free online (2nd edition, Springer, 2021). Each chapter comes with a notebook that shows the R code and examples featured in the chapter: here’s the notebook for chapter 12, which used to be chapter 10 in the first edition (I will distribute the first edition chapter because it is a bit shorter). There is also a translation of the code with the {
tidymodels
} package bundle.
Waggoner, Modern Dimension Reduction (Cambridge University Press, 2021)
Free preprint. A short book on the topic: see ch. 2, which focuses on PCA, in particular. The full code and data are also available on GitHub. Cited in the slides.
Waggoner, Unsupervised Machine Learning for Clustering in Political and Social Research (Cambridge University Press, 2020)
Another book with lots of well-coded examples: see ch. 2—4 (on clustering, _k_means, and hierarchical clustering) in particular. Cited in the slides.
Videos
Pew Research Center, Methods 101: What is machine learning, and how does it work? (2020)
A short and very acessible introduction to machine learning, which was mentioned a lot during class.
Savage, ‘The Importance of Class in an Age of Inequality’
This conference is mentioned in my slides because it goes through a classic graph from Bourdieu’s Distinction. The graph is an example of Multiple Correspondence Analysis (MCA), a technique that is useful to reveal latent dimensions, such as social class, in highly dimensional data like survey data on cultural consumption.
Starmer: Principal Component Analysis (PCA), Step-by-Step
A clear 20-minute explainer on what principal components are. The same channel has lots of other videos on the topic, as well as on many other techniques that are commonly used in machine learning for dimensionality reduction.
El Khadir, Visually Explained: Principal Component Analysis (PCA) (in 6 minutes)
El Khadir, Visually Explained: Support Vector Machine (SVM) in 2 minutes
Shorter treatments of two common topics in dimensionality reduction and machine learning. The rest of the channel has more to offer in the same vein.
Silge: ‘Dimensionality reduction of UN voting patterns’
A screencast example. Related blog post.
Hastie and Tibshirani, ‘In-depth introduction to machine learning in 15 hours of expert videos’
The James et al. book that I recommended earlier (An Introduction to Statistical Learning) actually comes with an entire online course, from which it emerged. Here’s an index for the videos, which might possibly be better than the one on the original website, or the one from YouTube.
Optional
Breiman, ‘Statistical Modeling: The Two Cultures’ (Statistical Science, 2001)
Free to read online. An article that I will mention in order to explain the different ‘logics’ behind the models that we look at in class.
Conlen and Hohman, The Beginner’s Guide to Dimensionality Reduction (2018)
An excellent interactive visual explanation of the topic. Cited in the slides.
Sanchez and Marzban, ch. 4: ‘Principal Components Analysis’
Sanchez and Marzban, ch. 30: ‘Clustering’
Sanchez and Marzban, ch. 31: ’K-Means’
Sanchez and Marzban, ch. 32: ‘Hierarchical Clustering’
More detailed chapters on the various methods covered in the James et al. reading. Those chapters are very well illustrated, not with R examples, but with actual illustrations that explain the logic followed by the different methods and algorithms covered.
Bonus
Learning more R
Interested in learning more R, either through courses or by self-teaching it to yourself? Try this:
If your university course catalogue does not have something on offer, the Summer Institute in Computational Social Science (SICSS) offers R training through summer schools. This is one efficient way to continue learning data science in the near future, especially if you are interested in scientific research (similar workshops exist in the context of scientific conferences).
There are lots of online courses that use R to teach data science or quantitative methods applied to various disciplines. For a list of examples, go to the course wiki and check the other similar courses listed there. Many of them have detailed examples and lecture notes.
Additional online data science course are offered through commercial providers, which can deliver course completion certificates. Take a look, for instance, at Rafael Irizarry’s Professional Certificate in Data Science at edX, at the Google Data Analytics course at Coursera, and at Jeff Leek’s Advanced Data Science course, formerly offered at DataCamp, now offered at Coursera.
For very focused/specific learning, and if you are comfortable with code, head over to GitHub, which I mentioned in class, and search for example R code using specific packages. The trick is to use the
language:r
search term. This is a more advanced way to find real-world examples of how to use very specific functions/models.
Keeping up with R
Interested in following R news outside of a classroom environment? Try the following:
- The R Weekly and R Views blogs offer an easy way to hear about new R packages and tutorials.
- R users hold annual conferences: see rstudio::conf, now posit::conf, and regional useR meetings.
- Twitter has a popular
#rstats
hashtag to identify R content.
Various things
Just a bunch of extra stuff :
Cohen, ‘What is life expectancy’ (2021)
We looked at life expectancy several times in class, but you might still be confused about what it really is – most people are. The very short answer is that life expectancy is a model, built out of something called cohorts and life tables. Watch this video, which lasts less than 7 minutes, for a perfect explanation.
Hout, ‘America’s Liberal Social Climate and Trends: Change in 283 General Social Survey Variables between and within US Birth Cohorts, 1972–2018’ (Public Opinion Quarterly, 2021)
Free to read online. I mention Age-Period-Cohort (APC) effects in almost all of my courses. Read this paper to understand why there are so important. It focuses on the US, but APC effects are just as important in France and elsewhere.
Zimmer and Collins, ‘What Do Vaccine Efficacy Numbers Actually Mean?’ (New York Times, 2021)
Will this course help you to better understand the news? Yes it will. Here’s an example. (Link found thanks to Lucy d’Agostino McGowan, on Twitter.)
Keeping up to date
Various authors, CRAN Task Views
Lists of carefully selected R packages to perform lots of tasks. Check the ones on e.g. econometrics, spatial analysis, and Web technologies (the latter being especially useful for e.g. advanced Web scraping).
Bryan et al., Happy Git with R (n.d.)
Pretty much all you need to know to use Git (and GitHub) from R (and RStudio), in an easy-to-read online tutorial.
RStudio/Posit, R Markdown
RStudio/Posit, Quarto
Documentation pages for the two main technologies that will allow you to produce reports, slides and other documents with a mix of text, R code, figures and tables.
Briatte, Going with Python
The wiki page where I provide a few pointers on how to learn data science with Python. Already recommended in Session 1.
Briatte, Going Bayesian
The wiki page where I provide a few pointers on how to learn Bayesian data analysis, using R. I might have mentioned Bayesian reasoning and Bayesian models at a few points during the course. There was no time for us to cover it, but it might (just might) serve you well to learn about it if you go deeper into statistical analysis in the future.
Going further with R
Baumer et al., Modern Data Science with R (2nd ed., CRC Press, 2021)
Free to read online. The book covers the same topics as the course (and was assigned to some of its sessions), but it pushes a bit deeper into each of them, with some bonus chapters on SQL databases, geospatial models and network data. If I had to recommend a single book to read in full after taking the course, it would be that one.
Boehmke and Greenwell, Hands-On Machine Learning with R (CRC Press, 2020)
Free to read online. A book that covers many of the topics that we covered in class (Sessions 8, 9 and 11 especially), plus many more, all from a machine learning perspective. Very much recommended if you are curious about ML methods and algorithms.
Kuhn and Silge, Tidy Modeling with R (O’Reilly, 2023)
Free to read online. A book that introduces the
{tidymodels}
package bundle, which allows to fit many statistical and machine learning models (as also shown through Julia Silge’s excellent videos).
On text analysis
Silge: ‘Topic modeling for Spice Girls lyrics’
A fun and accessible screencast example of topic modelling.
Ornstein, Text as Data (2022)
An online course/book companion for Grimmer et al.’s Text as Data (Princeton University Press, 2021).
Silge and Robinson, Tidy Text Mining (O’Reilly, 2022)
Free to read online. The book that we kind of followed in class, except for the final chapter on topic models.
Hvitfeldt and Silge, Supervised Machine Learning for Text Analysis in R (CRC Press, 2022)
Free to read online. A machine learning approach to text mining, with an entire section on ‘deep learning’ through neural networks. Check it out if you are curious about large language models like ChatGPT, for instance: this is pretty much how they work. (On that topic, see also this explainer, and this example of how to build such a model. Both posts use Python.)
Bail: Text Analysis Basics
Bail: Dictionary-Based Text Analysis (covers TF-IDF)
Bail: Text Networks
Bail: Topic Models
A short course offered by the Summer Institute in Computational Social Science (SICSS), from which we got the ’R Boot Camp’ videos cited earlier in the course.
On Surveys
Fugard, Using R for Social Research (2022), ch. 9: ‘Complex surveys’
A tutorial that got turned into an online book. The rest of the chapters are also very good, but this one is remarkably clear on how to use survey weights properly.
Vegetti, Introduction to Survey Statistics (2018)
A short course that covers survey methods, survey weights and measurement, in 3 sets of compact slides. Very much recommended as a ‘Survey 101’ crash course. Comes with R lab sessions: check out Day 2 in particular.
Zimmer et al., Tidy Survey Book (forthcoming)
An online handbook on survey analysis with the
{survey}
and{srvyr}
packages. Not yet finished, but already very helpful in its current form.
Open University, ‘Opinion polls in a nutshell’ (2015)
The shortest introduction to the topic that you will be able to watch: the total runtime is 10 minutes at most. Focuses on opinion polls and forecasting.
Pew Research Center, ‘Survey basics’ (2018-2019)
- ‘Can we still trust polls?’
- ‘Methods 101: How is polling done around the world?’
- ‘Methods 101: Random sampling’
- ‘Methods 101: What are nonprobability surveys?’
- ‘Methods 101: Survey question wording’
- ‘Phone vs. online surveys: Why do respondents’ answers sometimes differ by mode?’
A more detailed series of excellent, very understandable videos, with many more details on survey methodology. From an institute that has extensive experience with running worldwide cross-country surveys.
On Web scraping
If you hear or read about Web scraping during this class and are interested in learning more on how to get data from the Web into R, start with the following:
Bail: APIs
Bail: Web scraping
Two short videos that introduce their respective topics very well, in case you just want a very quick overview of what lies ahead.
McCrain, RSelenium Tutorial (2020)
An online tutorial on how to use the
{rvest}
package to scrape complex Web pages, where the user needs to click on elements of the Web page to access some of its content. This aspect of Web scraping relies on ‘headless browsing’ (see the description of the resource below).
Pittard, Web Scraping with R (online book, 2022)
An online book that covers the essentials, including working with APIs. Very well-illustrated, but fairly limited on a key topic, ‘headless browsing’ with packages like
{RSelenium}
, which is a way to use a Web browser programmatically in order to render JavaScript and get data from complex Web pages (see this example for a demo, as well as the McCrain tutorial above).