François Briatte (small modifs by Kim Antunez & ChatGPT)
session date
March 12, 2024
By the end of this session, you have learned the process of Exploratory Analysis and its significance in data understanding.
In a nutshell
How to summarize numerical variables using measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).
How to visualize distributions through histograms and density curves.
How to summarize categorical variables, including dummy variables, using tables and percentages.
Techniques to transform and manipulate variables, including creating factor variables.
6.1 Exploratory Analysis
6.1.1 Definition
Exploratory analysis is the process of summarizing multiple values of one or more variables using a set of concise summary statistics. It provides an initial understanding of the data, aiding in decision-making, hypothesis generation, and identifying potential outliers or anomalies.
6.1.2 Purpose
It aims to uncover patterns, insights, and key characteristics in the data, helping to understand the underlying structure and relationships:
Descriptive Statistics (Numbers): Measures of central tendency (mean, median, mode) and dispersion (range, variance, standard deviation).
Distributions (Plots): Histograms and density curves.
6.2 Distributions
This section introduces the concept of distributions in statistics. A distribution refers to the way data values are spread out or organized. It’s a fundamental concept for understanding the characteristics of data and making informed decisions based on it.
6.2.1 Descriptive statistics (numbers)
Descriptive statistics are numerical measures that provide insight into the central tendency and variability of a dataset.
Measures of Central Tendency:
Mean: The arithmetic average of all data points in a dataset.
Median: The middle value in a sorted dataset; it separates the data into two equal halves. The same goes to other quantiles
Mode: The most frequently occurring value in a dataset.
Measures of Dispersion:
Range: The difference between the maximum and minimum values in a dataset.
Variance: A measure of how much the values in a dataset vary from the mean.
Standard Deviation: The square root of the variance; it provides a standardized measure of data spread.
6.2.2 Distributions (plots)
There are many visual representations of distributions using plots.
Histograms: A graphical representation of the frequency distribution of data. It divides the data into intervals (bins) and displays the number of data points in each bin. Histograms help understand the shape and spread of data.
Density Curves: A smoothed representation of the distribution of data. It provides insights into the probability density function of continuous data. Density curves are often used to approximate the shape of distributions.
6.3 Before Exercise 1
6.3.1 Quantitative / Continuous variables
Quantiles:
x <-rnorm(100)# Extremes & Quartiles by defaultx[1:10]