6  Univariate Exploratory analysis

Session 6

Author

François Briatte
(small modifs by Kim Antunez & ChatGPT)

session date

March 12, 2024

By the end of this session, you have learned the process of Exploratory Analysis and its significance in data understanding.

In a nutshell

  • How to summarize numerical variables using measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).
  • How to visualize distributions through histograms and density curves.
  • How to summarize categorical variables, including dummy variables, using tables and percentages.
  • Techniques to transform and manipulate variables, including creating factor variables.

6.1 Exploratory Analysis

6.1.1 Definition

Exploratory analysis is the process of summarizing multiple values of one or more variables using a set of concise summary statistics. It provides an initial understanding of the data, aiding in decision-making, hypothesis generation, and identifying potential outliers or anomalies.

6.1.2 Purpose

It aims to uncover patterns, insights, and key characteristics in the data, helping to understand the underlying structure and relationships:

  • Descriptive Statistics (Numbers): Measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).
  • Distributions (Plots): Histograms and density curves.

6.2 Distributions

This section introduces the concept of distributions in statistics. A distribution refers to the way data values are spread out or organized. It’s a fundamental concept for understanding the characteristics of data and making informed decisions based on it.

6.2.1 Descriptive statistics (numbers)

Descriptive statistics are numerical measures that provide insight into the central tendency and variability of a dataset.

  1. Measures of Central Tendency:
  • Mean: The arithmetic average of all data points in a dataset.
  • Median: The middle value in a sorted dataset; it separates the data into two equal halves. The same goes to other quantiles
  • Mode: The most frequently occurring value in a dataset.
  1. Measures of Dispersion:
  • Range: The difference between the maximum and minimum values in a dataset.
  • Variance: A measure of how much the values in a dataset vary from the mean.
  • Standard Deviation: The square root of the variance; it provides a standardized measure of data spread.

6.2.2 Distributions (plots)

There are many visual representations of distributions using plots.

Source: https://flowingdata.com

Source: https://flowingdata.com

  1. Histograms: A graphical representation of the frequency distribution of data. It divides the data into intervals (bins) and displays the number of data points in each bin. Histograms help understand the shape and spread of data.

  2. Density Curves: A smoothed representation of the distribution of data. It provides insights into the probability density function of continuous data. Density curves are often used to approximate the shape of distributions.

Source: https://flowingdata.com

Source: https://flowingdata.com

6.3 Before Exercise 1

6.3.1 Quantitative / Continuous variables

Quantiles:

x <- rnorm(100)# Extremes & Quartiles by default
x[1:10]
 [1]  0.36561408  0.40024337 -1.13486749  1.08600194 -0.02618361 -0.15728523
 [7] -0.32553286 -0.52802395  1.00480576  0.13271770
quantile(x,  probs = c(0.25, 0.5, 0.75))
        25%         50%         75% 
-0.77212753  0.01283203  0.85969475 
median(x)
[1] 0.01283203

More about distributions:

range(x)
[1] -2.089173  2.540854
var(x)
[1] 1.067652
sd(x)
[1] 1.033272

6.3.2 Qualitative / Categorical variables

Factors:

x <- c("Animal 2", "Animal 1", "Animal 2", "Animal 3")
as.factor(x)
[1] Animal 2 Animal 1 Animal 2 Animal 3
Levels: Animal 1 Animal 2 Animal 3
xf <- factor(x,
             levels = c(paste0("Animal ", 1:3)),
             labels = c("Cat", "Dog", "Mouse")
             )
xf
[1] Dog   Cat   Dog   Mouse
Levels: Cat Dog Mouse

Cross-tabulation (1/2)

head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
tb <- with(airquality, table(cut(Temp, quantile(Temp)), Month))
tb
         Month
           5  6  7  8  9
  (56,72] 24  3  0  1 10
  (72,79]  5 15  2  9 10
  (79,85]  1  7 19  7  5
  (85,97]  0  5 10 14  5

Cross-tabulation (2/2)

prop.table(tb,1)
         Month
                   5          6          7          8          9
  (56,72] 0.63157895 0.07894737 0.00000000 0.02631579 0.26315789
  (72,79] 0.12195122 0.36585366 0.04878049 0.21951220 0.24390244
  (79,85] 0.02564103 0.17948718 0.48717949 0.17948718 0.12820513
  (85,97] 0.00000000 0.14705882 0.29411765 0.41176471 0.14705882
prop.table(tb,2)
         Month
                   5          6          7          8          9
  (56,72] 0.80000000 0.10000000 0.00000000 0.03225806 0.33333333
  (72,79] 0.16666667 0.50000000 0.06451613 0.29032258 0.33333333
  (79,85] 0.03333333 0.23333333 0.61290323 0.22580645 0.16666667
  (85,97] 0.00000000 0.16666667 0.32258065 0.45161290 0.16666667
prop.table(tb)
         Month
                    5           6           7           8           9
  (56,72] 0.157894737 0.019736842 0.000000000 0.006578947 0.065789474
  (72,79] 0.032894737 0.098684211 0.013157895 0.059210526 0.065789474
  (79,85] 0.006578947 0.046052632 0.125000000 0.046052632 0.032894737
  (85,97] 0.000000000 0.032894737 0.065789474 0.092105263 0.032894737

Percentages with group_by and mutate are also possible!

Homework for next week

  • Finish Exercise 1 (Colonialism / democracy & life expectancy)

  • 1 application exercise about dataviz (in group & ungraded)

    • Use the hints
    • Search for help online (e.g. StackOverflow, more than ChatGPT)
    • be persistent (you will need it) and do your best!
  • Handbooks, videos, cheatsheets

    • 3 chapters of Irizarry’s handbook