8  Statistical inference

Session 8

Author

François Briatte
(small modifs by Kim Antunez & ChatGPT)

session date

March 26, 2024

By the end of this session, you have learned the foundational concepts of significance testing, by understanding distributions, confidence intervals, and test statistics.

In a nutshell

  • The importance of normality in statistical analysis.
  • Significance tests : Chi-squared test and t-test
  • Notions of: p-values, H0, Confidence intervals and their mathematical foundations (Law of Large Numbers, Central Limit Theorem)

8.1 Normal distribution

8.1.1 The (standard) normal distribution

Also known as the Gaussian distribution, the normal distribution is a symmetric bell-shaped distribution that is commonly observed in various natural phenomena.

Characteristics:

  • Unimodal: Single peak,
  • Skewness = 0: Symmetric around the center => Mean = Median = Mode.
  • Kurtosis ~ 3: Moderate flatness compared to other distributions (how it is concentrated around the mean and how it deviates from the tails)

8.1.2 Normality and Inference

If you are a statistician, you will often ask yourself how well a dataset’s distribution resembles the normal distribution because many statistical methods assume normality.

Indeed, normality is suitable for making generalizations and statistical inferences.

Techniques can help assess normality, like:

  • visual inspection
  • normal probability plots
  • statistical tests

8.2 About Inference

8.2.1 Comparison: Exploring Differences and Similarities

Comparison involves examining the differences or similarities between two (or more) distinct groups or categories.

  • of Means: Analyzing the differences in the average values of different variables across distinct groups (t-test).
  • of Proportions : Analyzing the differences in proportions or percentages between different groups (useful when dealing with categorical data, when examining the distribution of certain attributes across different categories) (chi-squared tests).

8.2.2 Ingredient 1: Estimation

Utilizing techniques to infer characteristics of a population from the information available in a sample.

Distribution

Illustrates the way data values are spread out or organized:

  • normal distribution (bell-shaped curve),
  • binomial distribution (often associated with binary outcomes such as counting successes in a fixed number of trials)

Each type of distribution describes how data behaves under specific conditions.

Imagine you have a set of data, like the test scores of your classmates. A distribution is a way to describe how these scores are spread out or organized. It tells you the different values that the scores can take and how often each value appears.

Example: If you have 30 test scores in your class, a distribution might show that 5 students scored 90, 10 students scored 80, 8 students scored 70, and so on.


Mathematical Foundations

They validate the reliability of estimation techniques:

  • Law of Large Numbers: This principle states that as the size of a sample increases, the estimate of the population parameter becomes more accurate.

  • Central Limit Theorem: The distribution of sample means approaches a normal distribution as the sample size increases, irrespective of the shape of the population distribution.

This theorem is a powerful tool that enables statisticians to make inferences about population parameters even when the population distribution is unknown or non-normal.

8.2.3 Ingredient 2: Test Statistics

It tests the presence of a relationship between variables.

t-test (parametric)

It determines whether the observed difference between the means of a variable of two groups is statistically significant. It relies on the Student’s t-distribution.

Chi-squared test (nonparametric)

It is used to analyze the association / independence between categorical variables. It compares observed frequencies to expected frequencies in a contingency table. It relies on the Chi-squared distribution.


Null Hypothesis (\(H_0\))

It is a statement that suggests, for example, that there is no significant relationship between the variables being studied (default assumption).

p-values

A measure of compatibility with the null hypothesis; values lower than a threshold (e.g., p < 0.05) lead to the rejection of the null hypothesis.

Confidence Intervals

Provides a range of values that you use to estimate a population parameter (like the average test score of all students) based on a sample of data.

A confidence interval helps you estimate something about a larger group based on what you’ve seen in a smaller group, and it gives you a sense of how confident you can be in that estimate.

Example: Suppose you have a sample of the test scores for 30 students, and you want to estimate the average test score for all students in your school. You calculate a confidence interval, which might say, “We are 95% confident that the average test score for all students in the school falls between 10 and 11/20.

If you were to take many different random samples and calculate confidence intervals from them, about 95% of those intervals would contain the true population parameter.


There are many significance tests

https://statsandr.com/

https://statsandr.com/

The choice of which test to use depends on factors such as the nature of the data, the number of groups being compared, and the assumptions underlying each test. It’s essential to understand the characteristics of your data and the requirements of each test to select the most appropriate one for your analysis.

8.3 Before Exercise 1

8.3.1 Chi-squared Test for independance

chisq.test performs chi-squared contingency table tests.

(\(H_0\)) Independence of a pair of variables. Knowing the value of a variable for one individual does not influence the value of the other variable.

M <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))
dimnames(M) <- list(gender = c("F", "M"),
                    party = c("Democrat","Independent", "Republican"))
M
      party
gender Democrat Independent Republican
     F      762         327        468
     M      484         239        477
Xsq <- chisq.test(M)  # Prints test summary
Xsq$p.value<0.05
[1] TRUE

Xsq$observed   # observed counts (same as M)
      party
gender Democrat Independent Republican
     F      762         327        468
     M      484         239        477
Xsq$expected   # expected counts under the null
      party
gender Democrat Independent Republican
     F 703.6714    319.6453   533.6834
     M 542.3286    246.3547   411.3166

8.3.2 Student’s t-Test

Two-sample t-test

(\(H_0\)) No difference in means between group 1 and group 2

head(sleep)
  extra group ID
1   0.7     1  1
2  -1.6     1  2
3  -0.2     1  3
4  -1.2     1  4
5  -0.1     1  5
6   3.4     1  6
t.test(extra ~ group, data = sleep)

    Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33 

Homework for next week

  • Finish Exercise 1 (Colonialism / democracy & life expectancy)

  • 1 preparation exercise

    • Search for help online (e.g. StackOverflow, more than ChatGPT)
    • be persistent (you will need it) and do your best!
  • Handbooks, videos, cheatsheets

    • 3 chapters of Gerring and Christenson ’s handbook (ch.20 again and end of ch.21)
    • 1 chapters of Imai’s handbook