data <- read.csv("data.csv")
R Mastery: A Comprehensive Guide to Statistical Analysis, Data Visualization, and Programming Essentials
This comprehensive document contains all the information I have ever used in R, ranging from fundamental statistical analysis and basic graphical representations to more advanced visualizations using various packages and the essential principles of programming with R. Whether you are a beginner or an advanced user, this document has something for everyone.
Overview
This comprehensive document contains all the information I have ever used in R, ranging from fundamental statistical analysis and basic graphical representations to more advanced visualizations using various packages and the essential principles of programming with R. Whether you are a beginner or an advanced user, this document has something for everyone.
You will find a wealth of information on statistical analysis, including descriptive statistics, hypothesis testing, regression analysis, and more. In addition to these basic concepts, I also cover more advanced topics such as data visualization with ggplot2
, which will enable you to create complex and aesthetically pleasing graphs to better illustrate your data.
Moreover, this document provides an overview of R programming essentials, including the data structures, functions, loops, and conditionals that are the building blocks of the language. By understanding these basic concepts, you can more easily manipulate and analyze data in R.
To ensure that this code can be reproduced if needed, I have included a chapter on “Reproducibility” at the end of this document. In this section, I provide a summary of every package I used, as well as their version, which can be useful if the code breaks. Overall, this document is a valuable resource for anyone who wants to improve their R programming skills and gain a deeper understanding of statistical analysis and data visualization.
Data manipulation
To begin with R it is primordial to know your way around a data frame, in this section I will go over the functions needed to shape your date like you need it.
Base R
The most basic manipulation and overview functions are directly in R without any package necessary.
Import your data
The first step is to import your data. To import data into R, there are several functions you can use, including read.table()
, read.csv()
, read.csv2()
, read.delim()
, read.delim2()
, and more. Here’s a brief explanation of each function:
-
read.table()
: This function is used to read data from a text file. It is a very versatile function that can handle many different file formats, but it requires you to specify some arguments such as the file path, the delimiter used in the file, and if the file contains a header or not. -
read.csv()
: This function is similar toread.table()
, but it is specifically designed to read comma-separated value (CSV) files. It assumes that the file has a header row, and it automatically sets the delimiter to a comma. -
read.csv2()
: This function is similar toread.csv()
, but it is specifically designed to read CSV files where the decimal separator is a semicolon (;) and the field separator is a comma (,). -
read.delim()
: This function is similar toread.csv()
, but it is specifically designed to read tab-delimited files. -
read.delim2()
: This function is similar toread.delim()
, but it is specifically designed to read tab-delimited files where the decimal separator is a semicolon (;) and the field separator is a tab (\t). For example, to read a CSV file named “data.csv” in your current working directory, you can use the read.csv function as follows:
This assumes that the file has a header row, uses a comma as the delimiter, and that the file is located in your current working directory.
You can adjust the parameters of these functions based on the specifics of your data and the file format. For more information, you can use the R help system by typing ?read.csv
, ?read.table
, etc.
What does your data look like?
In this section I will give you some of the most commonly used functions in R for exploring and summarizing data.
The function head()
is used to display the first few rows of a data frame. By default, it displays the first six rows, but you can specify the number of rows you want to display.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
The function summary()
is used to generate a summary of a data frame, including measures of central tendency and variability for each variable. It is particularly useful for identifying missing values, outliers, and the distribution of the data.
summary(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50
Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
The str()
function is used to display the structure of a data frame, including the number of observations and variables, their names, data types, and a preview of the data. It is particularly useful for identifying the format of the data and for checking if variables are of the right data type.
str(iris)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The colSums(is.na(data))
function is used to check for missing values in a data frame. It returns the total number of missing values per column. It is particularly useful for identifying if any variables have missing data and for imputing missing values.
Basic functions
To effectively modify and customize your data frames, you’ll need to utilize a variety of functions available in R. Fortunately, there are numerous options available to help you achieve your desired results. Below are some examples of functions you may find helpful in your data frame manipulations.
One of the most useful function is subset()
. This function allows you to subset a data frame based on one or more conditions. Here’s an example of using subset()
to select only the rows of the iris
dataframe where the value in the Sepal.Length
column is greater than 5:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
11 5.4 3.7 1.5 0.2 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
The function merge()
allows you to merge two or more data frames based on a common variable. Here’s an example of using merge()
to combine two data frames based on the common ID
column:
# create two sample data frames
df1 <- data.frame(id = c(1, 2, 3), x = c(4, 5, 6))
df2 <- data.frame(id = c(2, 3, 4), y = c(7, 8, 9))
# use merge to combine the data frames based on the "id" column
merged_df <- merge(df1, df2, by = "id")
merged_df
id x y
1 2 5 7
2 3 6 8
The function apply()
can be used to apply a function to rows or columns of a data frame. In this example, the code applies the sum function to each row of the data frame, if you wanted to sum each column you can change (1) to (2).
# create a sample data frame
df <- data.frame(x = c(1, 2, 3),
y = c(4, 5, 6),
z = c(7, 8, 9))
# use apply to calculate the row sums
row_sums <- apply(df, 1, sum)
row_sums
[1] 12 15 18
The tapply()
function can be used to apply a function to subsets of a data frame defined by one or more variables. Here’s an example of using tapply()
to calculate the mean of a variable for each level of a grouping variable.
# create a sample data frame
df <- data.frame(x = c(1, 2, 3, 4),
y = c(5, 6, 7, 8),
group = c("A", "A", "B", "B"))
# use tapply to calculate the mean of y for each group
mean_by_group <- tapply(df$y, df$group, mean)
mean_by_group
A B
5.5 7.5
Another function from the apply environment is lapply()
. This function can be used to apply a function to each column of a data frame and return a list of the results. Here’s an example of using lapply()
to calculate the range of values for each column of a data frame:
# create a sample data frame
df <- data.frame(x = c(1, 2, 3),
y = c(4, 5, 6),
z = c(7, 8, 9))
# use lapply to calculate the range of values for each column
range_by_col <- lapply(df, range)
range_by_col
$x
[1] 1 3
$y
[1] 4 6
$z
[1] 7 9
If you want to simply the output to a vector (or matrix if possible) you can use the sapply()
function. In this example we calculate the mean of each column of a data frame:
# create a sample data frame
df <- data.frame(x = c(1, 2, 3),
y = c(4, 5, 6),
z = c(7, 8, 9))
# use sapply to calculate the mean of each column
mean_by_col <- sapply(df, mean)
mean_by_col
x y z
2 5 8
If you want to change the column names, you can use the colnames()
function like in this example:
# Create a sample data frame
df <- data.frame(x = c(1, 2, 3),
y = c(4, 5, 6))
# Set column names to be "Var1" and "Var2"
colnames(df) <- c("Var1", "Var2")
# Print the data frame with column names changed
df
Var1 Var2
1 1 4
2 2 5
3 3 6
You can also change the row names with the rownames()
function as displayed here:
# Create a sample data frame
df <- data.frame(x = c(1, 2, 3),
y = c(4, 5, 6))
# Set row names to be "A", "B", and "C"
row.names(df) <- c("A", "B", "C")
# Print the data frame with row names changed
df
x y
A 1 4
B 2 5
C 3 6
Moving to the tidyverse
The tidyverse
is a collection of R packages that are designed to work together to make data analysis more efficient and intuitive. The packages in the tidyverse
include:
-
ggplot2
for creating visualizations -
dplyr
for data manipulation -
tidyr
for reshaping data -
readr
for importing data -
purrr
for functional programming -
tibble
for creating data frames
In the tidyverse
something very useful is the pipe operator: %>%
, it allows you to chain together multiple functions in a way that makes the code easier to read and understand. It works by taking the output of one function and passing it as the first argument to the next function in the chain. This eliminates the need to create intermediate variables to store the output of each function, and makes it easier to write code that is both concise and expressive.
The readr
package
This package provides a fast and friendly way to read in flat files, including CSV, TSV, and fixed-width files. It is designed to be faster and more memory-efficient than the base R functions for reading in data.
The package provides several functions for reading in data, including:
-
read_csv()
: for reading in CSV files -
read_tsv()
: for reading in TSV files -
read_delim()
: for reading in files with a delimiter of your choice -
read_fwf()
: for reading in fixed-width files
These functions are designed to be fast and memory-efficient, and they also automatically detect and handle a wide range of data formats and encodings.
One of the key features of readr
is its ability to automatically detect the types of columns in your data, such as numeric, character, or date/time columns. However, if you have specific requirements for the types of columns in your data, you can also specify the column types manually using the col_types
argument.
For example, if you want to specify that the second column in your data is a character column and the fourth column is a numeric column, you can do:
my_data <- read_csv("my_data.csv", col_types = cols(
col2 = col_character(),
col4 = col_double()
))
readr
provides several options for handling missing values in your data. By default, readr
will treat any value that is empty or matches the string “NA” as a missing value.
You can also specify additional strings to be treated as missing values using the na
argument. For example, if you want to treat the string “MISSING” as a missing value, you can do:
The dplyr
package
dplyr
is a powerful and widely-used package in R for data manipulation. It provides a concise and consistent syntax for transforming data sets, which allows you to easily filter, arrange, summarize, and join data.
Here’s a brief overview of the main functions in dplyr
:
-
select()
: Allows you to select specific columns from a data set, using a simple syntax that includes the column names separated by commas. -
filter()
: Allows you to select specific rows from a data set based on a set of conditions, using a simple syntax that includes logical operators such as==
,>
,<
, etc. -
arrange()
: Allows you to sort a data set based on one or more columns, using a simple syntax that includes the column names separated by commas, with optional modifiers likedesc()
for descending order. -
mutate()
: Allows you to create new columns in a data set based on calculations or transformations of existing columns, using a simple syntax that includes the name of the new column followed by an equals sign (=
) and the calculation or transformation expression. -
summarize()
: Allows you to aggregate data by computing summary statistics such as mean, median, count, etc., using a simple syntax that includes the summary function name followed by the name of the column to be summarized. -
group_by()
: Allows you to group a data set by one or more columns, which is often used in conjunction withsummarize()
to compute summary statistics for each group. -
join()
: Allows you to combine two or more data sets based on common columns, using a simple syntax that includes the names of the data sets and the columns to be joined on.
Here’s an example of how to use dplyr
to perform some common data manipulation tasks:
library(dplyr)
# load a sample data set
my_data <- iris
# select specific columns
my_data <- select(my_data, Sepal.Length, Sepal.Width, Species)
# filter rows based on a condition
my_data <- filter(my_data, Sepal.Length > 5)
# arrange data by a column
my_data <- arrange(my_data, Sepal.Length)
# create a new column based on an existing one
my_data <- mutate(my_data, Sepal.Area = Sepal.Length * Sepal.Width)
# summarize data by a column
my_summary <- summarise(my_data, Mean.Sepal.Length = mean(Sepal.Length))
# group data by a column and summarize
my_summary2 <- my_data %>%
group_by(Species) %>%
summarise(Mean.Sepal.Length = mean(Sepal.Length))
# join two data sets
my_data2 <- data.frame(Species = c("setosa", "versicolor", "virginica"),
Petal.Length = c(1.4, 4.2, 6.0))
my_data <- left_join(my_data, my_data2, by = "Species")
In this example, we load the iris
data set, which contains information about flowers, and use various dplyr
functions to select specific columns, filter rows based on a condition, arrange the data by a column, create a new column based on an existing one, summarize the data by a column, group the data by a column and summarize, and join two data sets based on a common column.
Overall, dplyr
provides a powerful and flexible set of tools for data manipulation in R, and is a key package to have in your data science toolkit.
The tidyr
package
The tidyr
package is a data manipulation package in R that provides functions for reshaping and tidying data sets. It is a complementary package to dplyr
, which focuses on manipulating data sets by filtering, summarizing, and transforming variables.
The gather()
function is used to reshape a data frame from a wide format to a long format. It does this by taking multiple columns and “gathering” them into a key-value pair format. For example, if you have a data frame with columns for year, quarter, and sales for each quarter, you can use gather()
to reshape it into a format where each row represents a year-quarter combination and the sales value is in a single column.
The function spread()
is the opposite of gather()
. It is used to reshape a data frame from a long format to a wide format. It does this by taking a key-value pair and “spreading” the values across multiple columns. For example, if you have a data frame with columns for year, quarter, and sales, where each row represents a year-quarter combination, you can use spread()
to reshape it into a format where each row represents a year and there is a column for each quarter’s sales value.
# create example data frame
df <- data.frame(month = c("Jan", "Feb", "Mar", "Jan", "Feb", "Mar"),
year = c(2018, 2018, 2018, 2019, 2019, 2019),
rainfall = c(50, 75, 100, 60, 80, 110))
# reshape data frame
df2 <- spread(df, key = "month", value = "rainfall")
The separate()
function is used to split a single column into multiple columns based on a delimiter or a fixed number of characters. For example, if you have a data frame with a column for name that contains both first and last names, you can use separate()
to split the column into two columns.
# create example data frame
df <- data.frame(name = c("John Smith", "Jane Doe", "Bob Johnson"))
# separate name column into first and last name columns
df2 <- separate(df, col = name, into = c("first_name", "last_name"), sep = " ")
The function unite()
is the opposite of separate()
. It is used to combine multiple columns into a single column by concatenating them with a delimiter. For example, if you have a data frame with separate columns for first and last names, you can use unite()
to combine them into a single column.
# create example data frame
df <- data.frame(first_name = c("John", "Jane", "Bob"),
last_name = c("Smith", "Doe", "Johnson"))
# combine first and last name columns into a single name column
df2 <- unite(df, col = "name", first_name, last_name, sep = " ")
The fill()
function is used to fill in missing values in a data frame. It does this by filling in missing values with the most recent non-missing value in the same column. For example, if you have a data frame with a column for price that has missing values, you can use fill()
to fill in those missing values with the most recent non-missing price value.
# create example data frame
df <- data.frame(date = c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04"),
price = c(100, NA, 150, NA))
# fill in missing values with most recent non-missing value
df2 <- fill(df, price)
df2
date price
1 2022-01-01 100
2 2022-01-02 100
3 2022-01-03 150
4 2022-01-04 150
drop_na()
is a useful function when we have a data frame with missing values, and we want to remove any rows that contain missing values.
# create example data frame
df <- data.frame(id = 1:5,
name = c("John", NA, "Bob", "Jane", "Bill"),
age = c(30, 25, NA, 40, 35))
# drop any rows that contain missing values
df2 <- drop_na(df)
Overall, tidyr
is a powerful package that provides a range of functions for tidying and reshaping data frames. It is a key component in the “tidyverse” of packages in R that are designed to work together to provide a consistent and efficient data analysis workflow.
The tibble
package
The tibble
package is an alternative to R’s built-in data.frame
object. It is designed to be more user-friendly and to make working with data frames easier. Some of the key features of tibble include:
- Printing:
tibble
objects print nicely to the console, making them easier to read. - Error messages:
tibble
objects have improved error messages that provide more context than the defaultdata.frame
error messages. - Subsetting:
tibble
objects behave more consistently thandata.frame
objects when subsetting. - Column types:
tibble
objects preserve column types better thandata.frame
objects.
Here is an example of how to create and work with a tibble
object:
# create a tibble with three columns
my_data <- tibble(
name = c("Alice", "Bob", "Charlie"),
age = c(25, 30, 35),
has_pet = c(TRUE, FALSE, TRUE)
)
# print the tibble to the console
my_data
# A tibble: 3 × 3
name age has_pet
<chr> <dbl> <lgl>
1 Alice 25 TRUE
2 Bob 30 FALSE
3 Charlie 35 TRUE
# filter the tibble to only include people with pets
my_data_with_pets <- filter(my_data, has_pet == TRUE)
# add a new column to the tibble
my_data_with_pets <- mutate(my_data_with_pets, pet_type = c("dog", "cat"))
# print the modified tibble to the console
my_data_with_pets
# A tibble: 2 × 4
name age has_pet pet_type
<chr> <dbl> <lgl> <chr>
1 Alice 25 TRUE dog
2 Charlie 35 TRUE cat
In this example, we create a tibble
object called my_data
with three columns: name
, age
, and has_pet
. We then print the tibble
to the console, which shows the nicely formatted output. We use filter()
to create a new tibble
called my_data_with_pets
that only includes rows where has_pet
is TRUE
. We then use mutate()
to add a new column called pet_type
to my_data_with_pets
. Finally, we print the modified tibble
to the console, which shows the added pet_type
column.
Other packages
ggplot2
is an R package for data visualization that is based on the grammar of graphics, which provides a consistent framework for building graphics by breaking them down into components. The package allows for a high degree of flexibility and customization, allowing users to create a wide range of visualizations, from simple scatterplots to complex multi-layered graphics. ggplot2
works by mapping data to aesthetic properties of the plot, such as color, shape, and size, and then layering geometric objects, such as points, lines, and bars, on top of this mapping. This allows for easy manipulation and comparison of different aspects of the data. The package also provides many tools for modifying and fine-tuning the appearance of the plot, such as changing axis labels and limits, adjusting colors and themes, and adding annotations. Overall, ggplot2
is a powerful and flexible tool for creating high-quality visualizations in R. We will dive deeper in ggplot2
in a future chapter.
The other tydiverse
package not aforementioned is the purrr
package. It is a powerful and flexible package for working with functions and vectors in R. It provides a consistent set of tools for manipulating and transforming data in a functional programming style. Some of the key features of purrr
include the ability to work with functions in a more flexible way, including functions with multiple arguments and complex inputs, as well as the ability to apply functions to data in a variety of ways, such as mapping over lists and data frames. purrr
also provides many functions for working with vectors, including tools for filtering, grouping, and reshaping data, as well as for handling missing values and dealing with errors. Overall, purrr
is a powerful and flexible package that provides a consistent and functional approach to working with data in R.
Statistical analysis
In this chapter we will focus on statistical tests, how to use them but also interpret them.
Test for normality and transformations
In statistical analysis, it is often essential to check whether a dataset follows a normal distribution before applying certain parametric tests. R provides various tools to assess the normality of data and apply transformations if necessary. In this section, we will explore some common methods to test for normality and demonstrate how to perform data transformations.
Shapiro-Wilk Test
The Shapiro-Wilk test is a popular method to assess normality. It tests the null hypothesis that a sample is drawn from a normally distributed population. We can use the shapiro.test()
function in R to perform this test. Let’s see an example:
# Generate a random sample from a normal distribution
set.seed(123)
normal_sample <- rnorm(100)
# Perform Shapiro-Wilk test
shapiro.test(normal_sample)
Shapiro-Wilk normality test
data: normal_sample
W = 0.99388, p-value = 0.9349
Histogram and Q-Q Plot
Another visual approach to check for normality is by creating a histogram and a quantile-quantile (Q-Q) plot of the data. The histogram provides a rough representation of the data’s distribution, while the Q-Q plot compares the quantiles of the data to the quantiles of a theoretical normal distribution. If the points on the Q-Q plot fall approximately along a straight line, it suggests the data is normally distributed. Here’s an example:
Data Transformation
If the data significantly deviates from normality, we can try applying transformations to make it approximately normal. Common transformations include logarithmic, square root, and reciprocal transformations. Let’s apply a logarithmic transformation to our non-normal data and re-check for normality:
# Apply logarithmic transformation
log_transformed_data <- log(non_normal_sample)
# Create a histogram of transformed data
hist(log_transformed_data, main = "Histogram of Log-Transformed Data")
# Create a Q-Q plot of transformed data
qqnorm(log_transformed_data)
qqline(log_transformed_data, col = "red")
# Perform Shapiro-Wilk test on transformed data
shapiro.test(log_transformed_data)
Shapiro-Wilk normality test
data: log_transformed_data
W = 0.95742, p-value = 0.002644
Remember that data transformations should be applied judiciously and be interpreted cautiously in the context of your specific analysis.
In conclusion, R offers powerful tools to assess the normality of data and apply appropriate transformations to meet the assumptions of parametric tests. Always validate the results and consider the specific requirements of your analysis before making any conclusions based on the normality tests and transformations.
The Chi-square test of independence
The Chi-square test of independence is a fundamental statistical method used to determine whether there is a significant association between two categorical variables. It is applicable when we have two or more categorical variables, and we want to investigate if they are dependent or independent of each other. The test assesses whether the observed frequencies in a contingency table differ significantly from the frequencies we would expect under the assumption of independence between the variables.
Let’s go through an example using R to illustrate the Chi-square test of independence.
Suppose we have survey data on the preference for ice cream flavors among two groups of people: Group A and Group B. The data is organized in a contingency table:
Vanilla | Chocolate | Strawberry | |
---|---|---|---|
Group A | 25 | 15 | 10 |
Group B | 30 | 20 | 25 |
We can use the chisq.test()
function in R to perform the Chi-square test:
# Create the contingency table
ice_cream_table <- matrix(c(25, 15, 10, 30, 20, 25), nrow = 2, byrow = TRUE,
dimnames = list(c("Group A", "Group B"),
c("Vanilla", "Chocolate", "Strawberry")))
# Perform the Chi-square test
chi_sq_result <- chisq.test(ice_cream_table)
print(chi_sq_result)
Pearson's Chi-squared test
data: ice_cream_table
X-squared = 2.7056, df = 2, p-value = 0.2585
The test will provide us with a chi-square statistic and a p-value. If the p-value is below a predetermined significance level (often set at 0.05), we reject the null hypothesis of independence and conclude that there is a significant association between the two categorical variables. Conversely, if the p-value is greater than 0.05, we fail to reject the null hypothesis, suggesting that there is no significant association between the variables.
In summary, the Chi-square test of independence is a valuable tool to assess the relationship between categorical variables. It is commonly used in various fields, including social sciences, medicine, marketing, and more, to gain insights into the associations between different groups and make informed decisions based on the observed data.
Correlation tests
Correlation tests are essential statistical techniques used to assess the strength and direction of the relationship between two continuous variables. In R, we can employ various correlation tests to determine if there is a significant linear association between the variables. Two commonly used correlation tests are the Pearson correlation coefficient and the Spearman rank correlation coefficient.
Pearson Correlation Coefficient
The Pearson correlation coefficient, also known as Pearson’s r, measures the linear relationship between two continuous variables. It ranges from -1 to 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 suggests no linear relationship. The cor()
function in R can be used to compute Pearson’s correlation coefficient. Here’s an example:
Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient assesses the strength and direction of a monotonic relationship between two variables. It is suitable for variables that may not have a linear relationship but exhibit a consistent monotonic trend. The cor()
function with method = "spearman"
can be used to compute Spearman’s correlation coefficient. Here’s an example:
# Generate sample data
set.seed(987)
variable3 <- runif(100)
variable4 <- runif(100)
# Calculate Spearman correlation
cor(variable3, variable4, method = "spearman")
[1] -0.1977798
In both cases, the correlation tests will produce a correlation coefficient and a p-value. The p-value indicates the statistical significance of the correlation. A p-value less than the chosen significance level (often 0.05) suggests that the correlation is statistically significant.
Understanding the relationship between variables is crucial in data analysis and can provide valuable insights for decision-making and model building. Correlation tests help us quantify the association between variables, enabling us to make informed conclusions about their interdependence in the dataset.
Linear regression
Linear regression is a powerful statistical technique used to model the relationship between a dependent variable and one or more independent variables. It is widely employed in various fields, including economics, social sciences, and machine learning. In R, we can use the lm()
function to perform linear regression and estimate the coefficients of the regression equation. Let’s explore three aspects of linear regression: simple linear regression, multiple linear regression, and variable selection with model comparison.
Simple linear regression
Simple linear regression involves modeling the relationship between two continuous variables: a dependent variable and a single independent variable. It assumes that the relationship can be approximated by a straight line equation (y = mx + b), where y represents the dependent variable, x is the independent variable, m is the slope, and b is the intercept. We can use the lm()
function to perform simple linear regression in R. Here’s an example:
# Generate sample data
set.seed(456)
independent_var <- rnorm(100)
dependent_var <- 2 * independent_var + rnorm(100)
# Perform simple linear regression
simple_lm_result <- lm(dependent_var ~ independent_var)
summary(simple_lm_result)
Call:
lm(formula = dependent_var ~ independent_var)
Residuals:
Min 1Q Median 3Q Max
-2.21548 -0.81266 -0.04914 0.91531 1.91311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.11472 0.09834 -1.166 0.246
independent_var 1.98347 0.09797 20.246 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9763 on 98 degrees of freedom
Multiple R-squared: 0.807, Adjusted R-squared: 0.8051
F-statistic: 409.9 on 1 and 98 DF, p-value: < 2.2e-16
The output of the summary()
function will provide information about the following key components:
Coefficients Table:
The coefficients table shows the estimated regression coefficients for the intercept ((Intercept)
) and the independent variable (independent_var
). The coefficient estimates represent the change in the dependent variable (dependent_var
) for a one-unit increase in the independent variable while holding other variables constant.Standard Error:
The standard error is an estimate of the variability or uncertainty associated with the coefficient estimates. Smaller standard errors indicate more precise estimates.t-value:
The t-value measures how many standard errors the coefficient estimate is away from zero. It is used to test the null hypothesis that the true coefficient is zero (i.e., no relationship between the independent and dependent variables). A high t-value indicates a significant relationship between the variables.Pr(>|t|):
The p-value associated with the t-value. It indicates the probability of observing the t-value or more extreme values under the null hypothesis. A p-value below the chosen significance level (often 0.05) suggests a statistically significant relationship.Residual Standard Error (RSE):
The residual standard error is an estimate of the standard deviation of the residuals (the differences between the observed and predicted values). It measures the goodness of fit of the model - smaller RSE indicates a better fit.R-squared (R2):
The R-squared value represents the proportion of variance in the dependent variable explained by the independent variable(s). It ranges from 0 to 1, with higher values indicating a better fit. For simple linear regression, R-squared is equal to the square of the correlation coefficient between the dependent and independent variables.
Multiple linear regression
Multiple linear regression extends the concept of simple linear regression to include multiple independent variables. It models the relationship between a dependent variable and two or more independent variables, assuming a linear combination of these variables to predict the dependent variable. The lm()
function can also be used for multiple linear regression. Let’s see an example:
# Generate sample data
set.seed(789)
independent_var1 <- rnorm(100)
independent_var2 <- rnorm(100)
dependent_var_multiple <- 2 * independent_var1 + 3 * independent_var2 + rnorm(100)
# Perform multiple linear regression
multiple_lm_result <- lm(dependent_var_multiple ~ independent_var1 + independent_var2)
summary(multiple_lm_result)
Call:
lm(formula = dependent_var_multiple ~ independent_var1 + independent_var2)
Residuals:
Min 1Q Median 3Q Max
-3.1398 -0.6257 0.1290 0.6580 2.2478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09354 0.09893 -0.946 0.347
independent_var1 2.08573 0.10162 20.524 <2e-16 ***
independent_var2 2.95072 0.09538 30.937 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.986 on 97 degrees of freedom
Multiple R-squared: 0.9311, Adjusted R-squared: 0.9297
F-statistic: 655.2 on 2 and 97 DF, p-value: < 2.2e-16
The summary()
output provides coefficient estimates for each independent variable, along with their standard errors, t-values, and p-values. Additionally, it gives an overall assessment of the model’s fit.
The summary table for multiple linear regression is similar to the one for simple linear regression. However, in multiple linear regression, the coefficients table will contain estimates for each independent variable, along with the intercept.
Additionally, you will find the following:
Multiple R-squared:
The multiple R-squared (R2) indicates the proportion of variance in the dependent variable explained by all the independent variables in the model.Adjusted R-squared:
The adjusted R-squared adjusts the multiple R-squared value based on the number of independent variables and the sample size. It penalizes the addition of unnecessary variables to prevent overfitting. A higher adjusted R-squared suggests a better-fitting model.
The summary()
output also includes information on residuals and various diagnostics that can be useful for evaluating the model’s assumptions and checking for potential issues like heteroscedasticity or influential outliers.
Remember that while the summary table provides useful information about the model, it is essential to consider other diagnostic plots and statistical tests to ensure the model’s validity and interpret the results properly.
Variable selection and model comparison
In linear regression, selecting the most relevant independent variables is essential to build a robust and interpretable model. There are various approaches to perform variable selection, such as forward selection, backward elimination, and stepwise regression. Additionally, model comparison helps us evaluate different models and choose the best-fit model based on several metrics, including R-squared, adjusted R-squared, AIC (Akaike Information Criterion), and BIC (Bayesian Information Criterion).
Let’s consider a dataset with multiple potential independent variables. We can use different variable selection techniques to identify the subset of variables that contribute most significantly to the model. In this example, we’ll use the step()
function in R to perform a stepwise regression for variable selection:
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
# Generate sample data
set.seed(123)
independent_var1 <- rnorm(100)
independent_var2 <- rnorm(100)
dependent_var <- 3 * independent_var1 + 2 * independent_var2 + rnorm(100)
# Perform stepwise regression for variable selection
model <- lm(dependent_var ~ independent_var1 + independent_var2)
selected_model <- step(model)
Start: AIC=-7.03
dependent_var ~ independent_var1 + independent_var2
Df Sum of Sq RSS AIC
<none> 87.78 -7.032
- independent_var2 1 378.22 466.01 157.903
- independent_var1 1 676.30 764.08 207.350
summary(selected_model)
Call:
lm(formula = dependent_var ~ independent_var1 + independent_var2)
Residuals:
Min 1Q Median 3Q Max
-1.8730 -0.6607 -0.1245 0.6214 2.0798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.13507 0.09614 1.405 0.163
independent_var1 2.86683 0.10487 27.337 <2e-16 ***
independent_var2 2.02381 0.09899 20.444 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9513 on 97 degrees of freedom
Multiple R-squared: 0.9198, Adjusted R-squared: 0.9182
F-statistic: 556.3 on 2 and 97 DF, p-value: < 2.2e-16
The step()
function will perform a stepwise selection and provide the final selected model with the most significant variables.
Once we have multiple candidate models, we need to evaluate their performance to choose the best one. Several metrics can be used for model comparison:
R-squared (R^2)
R-squared measures the proportion of the variance in the dependent variable that is explained by the independent variables in the model. A higher R-squared value indicates a better fit of the model to the data.
# Calculate R-squared for the selected model
rsquared <- summary(selected_model)$r.squared
print(paste("R-squared:", round(rsquared, 3)))
[1] "R-squared: 0.92"
Adjusted R-squared (R^2_adj)
Adjusted R-squared is a modified version of R-squared that considers the number of independent variables in the model. It penalizes the inclusion of irrelevant variables, providing a more realistic assessment of model fit.
# Calculate Adjusted R-squared for the selected model
rsquared_adj <- summary(selected_model)$adj.r.squared
print(paste("Adjusted R-squared:", round(rsquared_adj, 3)))
[1] "Adjusted R-squared: 0.918"
Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
AIC and BIC are information criteria that balance the goodness of fit with model complexity. Lower AIC and BIC values indicate a better model fit with fewer variables, making them useful for model comparison.
[1] "AIC: 278.756"
[1] "BIC: 289.177"
By comparing these metrics across different candidate models, we can select the model that provides the best trade-off between goodness of fit and simplicity, ensuring a reliable representation of the relationship between the variables.
In conclusion, variable selection and model comparison are critical steps in building an effective linear regression model. Properly selected models, based on meaningful metrics such as R-squared, adjusted R-squared, AIC, and BIC, can lead to better insights and more accurate predictions. Always interpret the results with caution and consider the context and assumptions of the analysis.
Validity of the model
Linear regression makes several assumptions about the data, such as :
Linearity of the data. The relationship between the predictor (x) and the outcome (y) is assumed to be linear.
Normality of residuals. The residual errors are assumed to be normally distributed.
Homogeneity of residuals variance. The residuals are assumed to have a constant variance (homoscedasticity)
Independence of residuals error terms.
Diagnostic Plot: Q-Q Plot
The Q-Q (quantile-quantile) plot helps us assess the assumption of normality in the residuals. If the residuals follow a normal distribution, the points on the Q-Q plot should approximately fall along a straight line.
# Create a Q-Q plot of residuals
qqnorm(simple_lm_result$residuals, main = "Q-Q Plot of Residuals")
qqline(simple_lm_result$residuals, col = "red")
Statistical Test: Shapiro-Wilk Test
To complement the Q-Q plot, we can perform the Shapiro-Wilk test to formally test for the normality of the residuals. The null hypothesis is that the residuals are normally distributed. A significant p-value (p < 0.05) indicates a departure from normality.
# Shapiro-Wilk test for normality of residuals
shapiro_test <- shapiro.test(simple_lm_result$residuals)
print(shapiro_test)
Shapiro-Wilk normality test
data: simple_lm_result$residuals
W = 0.96982, p-value = 0.0214
Statistical Test: Durbin-Watson Test
The Durbin-Watson test is used to detect autocorrelation in the residuals. Autocorrelation occurs when the residuals are correlated with their own lagged values. The Durbin-Watson test statistic ranges from 0 to 4, where a value around 2 indicates no autocorrelation. Values significantly below 2 indicate positive autocorrelation, and values above 2 indicate negative autocorrelation.
# Durbin-Watson test for autocorrelation in residuals
library(lmtest)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Durbin-Watson test
data: simple_lm_result
DW = 2.0848, p-value = 0.6637
alternative hypothesis: true autocorrelation is greater than 0
Diagnostic Plot: Residual Plot
A residual plot helps us check for the assumption of homoscedasticity, which assumes that the variance of the residuals is constant across all levels of the independent variable. We plot the residuals against the fitted values (predicted values) and look for patterns. If the plot shows a constant spread with no apparent pattern, it indicates that the model meets the assumption.
# Create a residual plot
plot(simple_lm_result$fitted.values, simple_lm_result$residuals,
main = "Residual Plot", xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red", lwd = 2)
Statistical Test: Breusch-Pagan Test (for Homoscedasticity)
The Breusch-Pagan test is used to test for the assumption of homoscedasticity. The null hypothesis is that the variance of the residuals is constant (homoscedastic). A significant p-value suggests the presence of heteroscedasticity.
studentized Breusch-Pagan test
data: simple_lm_result
BP = 0.26677, df = 1, p-value = 0.6055
Remember that diagnostic plots and statistical tests should be used as tools to assess the model’s validity and identify potential issues. It is crucial to interpret the results in conjunction with domain knowledge and the specific context of the analysis. Addressing any deviations from assumptions may involve data transformation or choosing a different model to improve the model’s performance and reliability.
This is still a work in progress…
Generalized linear model
ANOVA
One factor
Two factors
ANCOVA
PCA
Distance matrix and classification
DAPC
Graphical visualisation
Base R
ggplot2
Programming in R
Consequat elit laborum id eiusmod nisi ut minim officia. Tempor non consectetur veniam sit nisi nostrud Lorem minim quis amet anim duis. Sunt deserunt excepteur velit. Ipsum elit ullamco nostrud ad veniam et.
Deserunt aute proident commodo nostrud. Esse veniam et duis ipsum adipisicing dolor incididunt. Labore deserunt anim commodo. Magna Lorem consectetur excepteur aliquip qui voluptate nulla velit Lorem Lorem minim nulla.
Reproducibility
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Paris
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] lmtest_0.9-40 zoo_1.8-12 MASS_7.3-60 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.1
[7] dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4
[13] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.4 jsonlite_1.8.8 compiler_4.3.2 tidyselect_1.2.0 scales_1.3.0
[6] yaml_2.3.8 fastmap_1.1.1 lattice_0.21-9 R6_2.5.1 generics_0.1.3
[11] knitr_1.45 htmlwidgets_1.6.4 munsell_0.5.0 pillar_1.9.0 tzdb_0.4.0
[16] rlang_1.1.2 utf8_1.2.4 stringi_1.8.3 xfun_0.41 timechange_0.2.0
[21] cli_3.6.2 withr_2.5.2 magrittr_2.0.3 digest_0.6.33 grid_4.3.2
[26] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 evaluate_0.23
[31] glue_1.6.2 fansi_1.0.6 colorspace_2.1-0 rmarkdown_2.25 tools_4.3.2
[36] pkgconfig_2.0.3 htmltools_0.5.7
Note
The writing process was conducted with the help of ChatGPT (OpenAI).