François Briatte (small modifs by Kim Antunez & ChatGPT)
session date
March 26, 2024
By the end of this session, you have learned the foundational concepts of significance testing, by understanding distributions, confidence intervals, and test statistics.
In a nutshell
The importance of normality in statistical analysis.
Significance tests : Chi-squared test and t-test
Notions of: p-values, H0, Confidence intervals and their mathematical foundations (Law of Large Numbers, Central Limit Theorem)
8.1 Normal distribution
8.1.1 The (standard) normal distribution
Also known as the Gaussian distribution, the normal distribution is a symmetric bell-shaped distribution that is commonly observed in various natural phenomena.
Characteristics:
Unimodal: Single peak,
Skewness = 0: Symmetric around the center => Mean = Median = Mode.
Kurtosis ~ 3: Moderate flatness compared to other distributions (how it is concentrated around the mean and how it deviates from the tails)
8.1.2 Normality and Inference
If you are a statistician, you will often ask yourself how well a dataset’s distribution resembles the normal distribution because many statistical methods assume normality.
Indeed, normality is suitable for making generalizations and statistical inferences.
Techniques can help assess normality, like:
visual inspection
normal probability plots
statistical tests
8.2 About Inference
8.2.1 Comparison: Exploring Differences and Similarities
Comparison involves examining the differences or similarities between two (or more) distinct groups or categories.
of Means: Analyzing the differences in the average values of different variables across distinct groups (t-test).
of Proportions : Analyzing the differences in proportions or percentages between different groups (useful when dealing with categorical data, when examining the distribution of certain attributes across different categories) (chi-squared tests).
8.2.2 Ingredient 1: Estimation
Utilizing techniques to infer characteristics of a population from the information available in a sample.
Distribution
Illustrates the way data values are spread out or organized:
normal distribution (bell-shaped curve),
binomial distribution (often associated with binary outcomes such as counting successes in a fixed number of trials)
…
Each type of distribution describes how data behaves under specific conditions.
Imagine you have a set of data, like the test scores of your classmates. A distribution is a way to describe how these scores are spread out or organized. It tells you the different values that the scores can take and how often each value appears.
Example: If you have 30 test scores in your class, a distribution might show that 5 students scored 90, 10 students scored 80, 8 students scored 70, and so on.
Mathematical Foundations
They validate the reliability of estimation techniques:
Law of Large Numbers: This principle states that as the size of a sample increases, the estimate of the population parameter becomes more accurate.
Central Limit Theorem: The distribution of sample means approaches a normal distribution as the sample size increases, irrespective of the shape of the population distribution.
This theorem is a powerful tool that enables statisticians to make inferences about population parameters even when the population distribution is unknown or non-normal.
8.2.3 Ingredient 2: Test Statistics
It tests the presence of a relationship between variables.
t-test (parametric)
It determines whether the observed difference between the means of a variable of two groups is statistically significant. It relies on the Student’s t-distribution.
Chi-squared test (nonparametric)
It is used to analyze the association / independence between categorical variables. It compares observed frequencies to expected frequencies in a contingency table. It relies on the Chi-squared distribution.
Null Hypothesis (\(H_0\))
It is a statement that suggests, for example, that there is no significant relationship between the variables being studied (default assumption).
p-values
A measure of compatibility with the null hypothesis; values lower than a threshold (e.g., p < 0.05) lead to the rejection of the null hypothesis.
Confidence Intervals
Provides a range of values that you use to estimate a population parameter (like the average test score of all students) based on a sample of data.
A confidence interval helps you estimate something about a larger group based on what you’ve seen in a smaller group, and it gives you a sense of how confident you can be in that estimate.
Example: Suppose you have a sample of the test scores for 30 students, and you want to estimate the average test score for all students in your school. You calculate a confidence interval, which might say, “We are 95% confident that the average test score for all students in the school falls between 10 and 11/20.
If you were to take many different random samples and calculate confidence intervals from them, about 95% of those intervals would contain the true population parameter.
There are many significance tests
The choice of which test to use depends on factors such as the nature of the data, the number of groups being compared, and the assumptions underlying each test. It’s essential to understand the characteristics of your data and the requirements of each test to select the most appropriate one for your analysis.
party
gender Democrat Independent Republican
F 762 327 468
M 484 239 477
Xsq <-chisq.test(M) # Prints test summaryXsq$p.value<0.05
[1] TRUE
Xsq$observed # observed counts (same as M)
party
gender Democrat Independent Republican
F 762 327 468
M 484 239 477
Xsq$expected # expected counts under the null
party
gender Democrat Independent Republican
F 703.6714 319.6453 533.6834
M 542.3286 246.3547 411.3166
8.3.2 Student’s t-Test
Two-sample t-test
(\(H_0\)) No difference in means between group 1 and group 2
head(sleep)
extra group ID
1 0.7 1 1
2 -1.6 1 2
3 -0.2 1 3
4 -1.2 1 4
5 -0.1 1 5
6 3.4 1 6
t.test(extra ~ group, data = sleep)
Welch Two Sample t-test
data: extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
-3.3654832 0.2054832
sample estimates:
mean in group 1 mean in group 2
0.75 2.33
Homework for next week
Finish Exercise 1 (Colonialism / democracy & life expectancy)