Session 1
2024-01-30
Exploratory Data Analysis (EDA): The process of summarizing, visualizing, and understanding the main characteristics of a dataset.
Descriptive Statistics: Calculating summary measures such as mean, median, and standard deviation.
Data Visualization: Creating graphical representations of data to identify patterns and relationships.
Standard Statistical Modeling: Using techniques like regression and econometrics to analyze relationships between variables.
Machine Learning and Deep Learning: Techniques for building predictive models and artificial intelligence, including neural networks and big data processing.
➡️ Applying the quantitative skills mentioned above to specific fields or industries, such as economics, biology, finance, healthcare, etc. This involves tailoring statistical methods to address domain-specific questions and challenges.
Because…
Data science allows us to uncover patterns, trends, and relationships within data, enabling us to make informed predictions about future events and outcomes.
For instance, this article titled “Predicting Crimes Before They Happen” from the Los Angeles Times explores the concept of predictive policing and how data science techniques are being used to anticipate and prevent crimes before they happen. By examining factors such as location, time of day, weather conditions, and historical crime trends, data scientists can build predictive models and forecast where and when certain types of crimes are more likely to occur.
Data visualization is a powerful tool that translates complex data into visual representations, making it easier to understand and communicate insights to diverse audiences. Visualizations help reveal patterns, outliers, and correlations that might be difficult to grasp from raw data alone.
The example from the Bricole Urbanism project provides insights into the spatial arrangement and characteristics of urban areas. Through data visualization techniques, urban data such as building footprints, street layouts, and land use can be transformed into interactive maps and visualizations.
Many real-world phenomena are influenced by multiple factors and variables. Data science allows us to analyze multidimensional data to uncover hidden relationships and dependencies. Techniques like clustering and dimensionality reduction help us gain insights from high-dimensional data.
The example from the Last.fm user library exemplifies the depth of multidimensional data exploration. The diverse facets of music listening habits can be analyzed through data points which include genres, artists, albums, play counts, timestamps, and more.
The ability to work with data, analyze it, and extract insights has become a sought-after skill in many professions. Quantitative skills, including statistical analysis, machine learning, and programming, are in high demand across industries. Data-driven decision-making leads to improved strategies, better customer experiences, and optimized operations.
Other possibilities:
This course is a practical workshop, not a lecture
Form groups of 3 students before next week
Get ready to work together
Plan ahead to be able to do readings between lessons
Grading will happen through group coding exercises
There are many statistical programming languages:
All fundamentally differ from spreadsheet editors such as Excel
is a statistical programming language and software environment for statistical computing and graphics.
It’s popularity in data science is due to:
is an integrated development environment (IDE) specifically designed for R programming.
It provides a user-friendly interface for writing, executing, and managing R code :
Usable Computer Make sure you have access to a functional computer, preferably a laptop or desktop. Mobile devices like smartphones and tablets won’t suffice for the demands of data science work.
Hardware Specifications: While Microsoft Surfaces and Google Chromebooks are compatible, ensure your computer has a reasonable amount of disk space and ample memory (RAM). Close unused software and browser tabs to optimize performance.
Desktop Organization: Keep your desktop and windows clutter-free to create a focused and organized workspace.
Power Source: Connect your computer to a power source to avoid interruptions during your learning sessions.
Internet Connection: A stable and fast internet connection is crucial for accessing online resources, downloading software, and participating in interactive activities.
Download and install R : a popular programming language used in statistical computing and data analysis. Visit the R Project website and choose the appropriate download page for your operating system (Mac or Windows).
Download and install RStudio (IDE) : Install RStudio, an integrated development environment (IDE) designed for R programming. Visit the posit website and download the RStudio Desktop version.
Troubleshooting
Overview of the RStudio Interface: Console, Scripts, Environment, Plots, Help, Files…
Select File → New R Script (Ctrl/Cmd-Shift-N
)
Save the script (somewhere you remember) as hello.r
(for future use)
Try the followings. Press Ctrl-Enter
(Win) or Cmd-Enter
(Mac) to execute / run
R packages are essential for additional functionalities. Lots of them are on the CRAN (Comprehensive R Archive Network)
# install a package from the CRAN (required only once)
install.packages("stringr")
# you now have access to lots of new functions
?stringr::str_detect
# package::function(arguments)
txt <- c("apple", "banana", "pear", "pineapple")
stringr::str_detect(txt, "p")
# load a package (in every script where needed)
library(stringr)
Well done! You have just executed / ran your first script
Your goal: get familiar with RStudio
Also: gather in groups of 3 and mention your group here