R Workshop 1 - The basics

This workshop aims to show you the basics of R, including opening data, visualising data, wrangling it, running basic analyses and some other good practices. Thanks to Dr Chris Crawford for recording this at short notice.

The start of the video was lost, so here’s where to go to install R and RStudio: https://posit.co/download/rstudio-desktop/ . Make sure you install both, then open RStudio, which will bring you up to the start of the video. Note that you don’t need to open R as well. That will open from within RStudio.

Once you’ve worked through this workshop, there’s a second workshop that includes more examples of what R can do, including writing papers directly in RStudio using RMarkdown, a couple of things that you might not have considered possible like text analysis in R, or web scraping as an alternative way to collect data, and the cool figures you can get from ggplot2.

Over time, I will aim to expand this series, so please let me know if you’d like to see something else.

I’ve also created a page on using Jamovi, a free point-and-click program that is built on R. If R seems a bit too much (I get it), Jamovi is a great free program that you can use to build your stats skills, or keep them fresh.

R Workshop Workbook 1: Introduction to R

This is an introduction to R, for people who are interested or curious.

R is free, and there’s a lot to like about it. But it has a very steep learning curve. It’s not point-and-click (although there are some attempts at this).

There are loads of packages that extend the functionality of R. There isn’t much it can’t do! Sometimes these packages become a little outdated, and may break, but usually new ones spring up.

I’m AProf Alex Russell and I’ve been teaching stats since 2007. I’ve used loads of software, including learning on R’s predecessor SPlus. I use R all the time, and even I find it a bit tricky at times.

This workshop isn’t about learning statistics – if you’re not comfortable with a mean or a t-test, you might find this moves a bit fast for you.

Thanks to you for attending, and thanks to Prof Tania Signal (DDR-HMAS) for providing the resources to make this happen, and to Associate Professor Grace Vincent for helping too.


What’s on this page?

Downloading R and RStudio

Projects

Let’s set up a script, including making notes to help you remember what you’re doing

Packages, to extend R’s functionality

Data wrangling, including choosing specific rows or columns, creating new variables, and recoding existing variables

Operations

Basic Statistical Analyses, including independent samples t-test, one-way ANOVA, correlation, simple linear regression, and ways to get your output out of R

Opening your own data, whether it’s a csv, Excel file, SPSS data file, Stata data file. R can open more, too. I’ve just picked these examples.

Example of good practices around scripts

Fun stuff to finish

Pros and cons of R

Downloading R and RStudio

  1. Download R: Go to https://cran.r-project.org/ and install R.

  2. Download RStudio: Visit https://www.rstudio.com/ to download and install RStudio, a user-friendly interface for R.

We can work in R by itself. If we open it, it looks something like this. 

But we’re going to use RStudio. It will open R from within.

Open RStudio.

Here’s what RStudio looks like once you’ve actually started doing some stuff in a project. Yours will look a bit blank to begin – I’ll have screenshots for you on this in a moment.

  • Top left is the Source Editor. This is where you can edit analysis scripts. If you don’t have a script open, this won’t be here. This is also where you can see things like data sets (if you ask for them), data, etc.

  • Bottom left is the Console. This is R working away.

  • Top right is the Workspace Browser and History. This shows what’s in your workspace, like datasets and other things. We’ll see this build as we go along.

  • Bottom right is where you can see files, plots, help and other things.

You can move these around if you prefer them in different places.

Projects

RStudio works with “projects”, which are collections of everything you see on the screen. You can have a project set up for one research paper and another for a different research paper. When you reopen a project, it’ll take you to the folder that has all of the scripts and data and everything ready to go. 

I set up a working folder for each project. In that, I have my data, my scripts and my output. You can use subfolders if you like (e.g., Data in one, Scripts in another, Output in another), whatever you like. I find this can be a little annoying and slow me down – I have to remember to save and load data from that subfolder, for example.

Let’s set up a project for today.

To do this, first decide where you want this project to be. Set up a folder somewhere on your computer.

Click on File > New Project and you will see a screen like this.

I tend to do Existing Directory and go to the folder I’ve set up. Or, you can set up the folder here by using New Directory. Note that it’s fine with paths with spaces.

I’ll select Existing Directory. Navigate to where you set up your folder and select that. (If you selected New Directory instead, you can go to this folder – it’ll just set up a new folder within the folder you already created. No big deal.)

This will then created an Rproj file (R project file), which saves the RStudio set up within.

On the top right of your screen, you’ll now see the Project name, and you can click on that to switch between projects.

Let’s set up a script

R is a scripted language, which means that you’ll be typing a lot. And the code is often the same for certain things across projects. Most people don’t know how to type all the code – they look it up each time and modify it to their needs.

Go to File > New and select R Script. This will open on the top left (unless you’ve moved it). It’s empty, and it’s called Untitled1. Type something in it, anything.

Notice how now Untitled1 is red and has an asterisk next to it. This means that something has changed and it hasn’t been saved. So, save this script into your folder. File > Save. Call it something like Intro or whatever you like.

Later, I’ll talk about best practices with scripts. For now, we’ll use this to play around.

R can open lots of different kinds of data, like csv, Excel, etc. If you have an SPSS data file, it can open that too. Before we do that, we’ll need to talk about packages. Let me get to that in a bit.

For now, we’ll use R’s built-in datasets.

To see the list of datasets that are built into R, type the following into your script – and be careful, R is case sensitive:

data() 

To run it, put your cursor next to it and select “Run” from the top of the script window.

You’ll see a window like this now in the top left. These are all datasets that you can call on to play around with R. Of course, we’ll talk about opening our own data later.

You’ll see the iris data in there, which is what we used in the Jamovi workshop.

Let’s open one called mtcars.

data(mtcars)

On the right now, in the Environment section, you’ll see we have something called mtcars, with 32 obs. of 11 variables. 32 obs = number of cases (here, cars).

Click on it and you’ll see in the Console some stuff will appear. Note that it shows that it has run force(mtcars) for us to show the data. We didn’t have to type it. This is one of the benefits of RStudio, but most of the time we’ll need to type what we want. Note also that force(mtcars) has not appeared in our Script. Our script is whatever we put into it, and the idea is that we choose what we send to the Console (bottom left, i.e., R). The Console does the work.

Making notes

Let’s start to keep track of what we’re doing in the Script.

If there is a # character in a line, it will ignore anything after that until the next line. You can have the # at the start of the line (similar to the * in SPSS).

data(mtcars)

# Note to self – I’m loading the mtcars data here

Or you can have it part way through a line.

data(mtcars) # Note to self – I’m loading the mtcars data here

 

I’ll often make notes like this for big headings, because they stand out in the script:

############################

# Loading data

############################

Packages

Packages are a really important part of R. These are add-ons that can be created by anything. They are often functions, such as combinations of commands that enable new things to be done. This is how analyses are added, for example. Packages can include anything, though – data for example. If you wanted to, you could make a package for teaching R, or for sharing your analysis and data for a project.

Packages must be installed. At first, you’ll install packages using code like this:

install.packages("dplyr")

Once they’re installed, you can load packages like this:

library(dplyr)

Note the difference between the two – one has inverted commas, one doesn’t. Quick note – Word uses curly inverted commas. R won’t accept them. Try pasting and running this and see what happens.

install.packages(“dplyr”)

You don’t need to install packages every time. Just call on them (library) in each script.

Let’s use dplyr. It’s a very common tool for data manipulation.

When you load it, you’ll get this message on the right. It’s saying “Hey, normally in R, if you use filter or lag, we have a set idea of what that means. But, dplyr has functions like that too. So, since you’re using dplyr, if you ask to filter something, it’ll do it the dplyr way.

RStudio has gotten pretty smart. If you’re trying to load a package and it realises you don’t have it installed, it will often prompt you.

Data wrangling

A big part of R is setting up our data.

Let’s get a sense of how mtcars is set up.

summary(mtcars)

Here are our 11 variables, with a five number summary (min, max, median, 1st and 3rd quartiles) and also a mean.

Notice how it’s giving these values for all variables, but things like “vs” and “am” and “gear” have values that indicate that they’re probably categorical? Let’s check what R thinks they are.

str(mtcars)

The “num” indicates that they’re being treated as numeric, i.e., continuous.

We can change this.

mtcars$cyl <- as.factor(mtcars$cyl)

Now try str(mtcars) again.

Notice how cyl is now a factor with three levels, and those levels have labels “4”, “6” and “8”.

Try summary(mtcars) again. 

Now it’s treating cyl as groups, with 11 cases with 4 cylinders, 6 cases with 7 cylinders and 14 with 8 cylinders.


Choosing data or cases based on values

If we want to, we can easily create a subset of the data by selecting rows, columns or both.

Here’s an example of setting up a subset of the mtcars data, by filtering rows so that we’re only selecting rows where mpg is more than 20.

mtcars_filtered <- filter(mtcars, mpg > 20)  # Filter for cars with mpg > 20

And if we want to select particular columns (say if we wanted a smaller, cleaner datasaet to work with rather than a huge one with hundreds or thousands of variables), we can also do this.

mtcars_selected <- select(mtcars, mpg, hp)  # Select only mpg and hp columns


Creating new variables

We can add variables to mtcars, such as if we wanted to create something called hp_per_wt (i.e., horsepower divided by weight). One way of doing this is using the mutate command (from the dplyr package)

library(dplyr)

mtcars <- mutate(mtcars, hp_per_wt = hp / wt)  # Add horsepower-to-weight ratio

With dplyr, we can also structure this using pipes, which look like %>%

Pipes are ways of combining multiple commands.

Here are some examples of mutating variables:

# Create a variable called kpl, for kilometers per litre, which is just mpg with a conversion factor

mtcars %>%

  mutate(kpl = mpg * 0.425144)

# Classifying cars as efficient or not based on mpg

mtcars %>%

  mutate(efficiency = ifelse(mpg > 20, "Efficient", "Inefficient"))

# Categorising cars into low, medium and high horsepower

mtcars %>%

  mutate(hp_category = case_when(

    hp < 100 ~ "Low",

    hp >= 100 & hp < 150 ~ "Medium",

    TRUE ~ "High"

  ))

# Add an identifier column

mtcars %>%

  mutate(car_id = row_number())

# Calculate mean mpg for each number of cylinders and add it as a new column

mtcars %>%

  group_by(cyl) %>%

  mutate(mean_mpg_by_cyl = mean(mpg))

# Modify an existing variable to change horsepower from imperial to metric

mtcars %>%

  mutate(hp_metric = hp * 1.01387)

# Create an interaction variable between hp and wt (useful for regression)

mtcars %>%

  mutate(hp_wt_interaction = hp * wt)

# Standardise a variable (i.e., make it so that its mean is 0 and SD is 1)

mtcars %>%

  mutate(mpg_z = (mpg - mean(mpg)) / sd(mpg))

# Create a logical variable, where its values are not 0 and 1, but instead TRUE and FALSE

mtcars %>%

  mutate(is_efficient = mpg > 20)

# Convert multiple variables to a log scale in one go using the across function

mtcars %>%

  mutate(across(c(mpg, hp, wt), log))

# Add a column with summarised data

mtcars %>%

  mutate(total_hp = sum(hp))

# Rename a variable

mtcars %>%

  rename(miles_per_gallon = mpg)

# Create a variable that determines if a car has mpg above the median of the dataset

mtcars %>%

  mutate(above_median_mpg = mpg > median(mpg)) 

# Make a variable into a factor, specifying both the levels and the labels for the levels. Note the c() command, which is “concatenate”.

mtcars %>%

  mutate(cyl_factor = factor(cyl,

                             levels = c(4, 6, 8),

                             labels = c("Four cylinders", "Six cylinders", "Eight cylinders"))) 

We can also combine multiple mutate operations into one chunk of code. Just separate each with a comma.

# Combining multiple mutate operations into one command

mtcars %>%

  mutate(

    kpl = mpg * 0.425144,

    hp_category = case_when(

      hp < 100 ~ "Low",

      hp >= 100 & hp < 150 ~ "Medium",

      TRUE ~ "High"

    ),

    mpg_z = (mpg - mean(mpg)) / sd(mpg),

    is_efficient = mpg > 20

  )

Operations

Here is some standard code to look at mean and SD of a particular variable. Note that we need to specify which dataset (even if we only have one open) and then the variable, i.e., mtcars$mpg specifies mtcars as the data, and mpg as the thing within the data that we want to use.

# Calculate Mean and Standard Deviation

mean_mpg <- mean(mtcars$mpg)

sd_mpg <- sd(mtcars$mpg)

cat("Mean of mpg:", mean_mpg, "\n")

cat("Standard Deviation of mpg:", sd_mpg, "\n")

Let’s look at a standard histogram of this. Note that in the next workshop, we’ll look at nicer figures in ggplot2.

# Histogram of mpg

hist(mtcars$mpg, main = "Histogram of MPG", xlab = "Miles Per Gallon", col = "lightblue", border = "black")

Or a boxplot.

# Boxplot of mpg

boxplot(mtcars$mpg, main = "Boxplot of MPG", ylab = "Miles Per Gallon", col = "lightgreen")

Here are some other options. Work through them and see if you can work out what they are.

# Counts for each category

cyl_counts <- table(mtcars$cyl)

print(cyl_counts)

# Percentages for each category

cyl_percentages <- prop.table(cyl_counts) * 100

print(cyl_percentages)

# Bar plot of cylinder counts

barplot(cyl_counts, main = "Bar Plot of Cylinders", xlab = "Number of Cylinders", ylab = "Frequency", col = "orange", border = "black")

Basic Statistical Analyses

This section outlines basic statistical analyses in R. As you’ll see, they are pretty basic and there are often better ways of doing the same analyses. However, these highlight some cool things.

First, an independent samples t-test. Note that the way we run this is to create an object called t_test_results which is all the output of the t-test. This includes the t-value, the degrees of freedom, means for each group, the p-value, etc. We can call on all of these, and these will be useful in the next workshop.

1. Independent Samples T-Test

t_test_results <- t.test(mpg ~ factor(cyl), data = mtcars)  # T-test for mpg by cylinder group

t_test_results  # Display the results

Note that it won’t let you run a t-test with cyl, because cyl has three levels, and a t-test only works with two. Let’s change to vs. That works.

t_test_results <- t.test(mpg ~ factor(vs), data = mtcars)  # T-test for mpg by cylinder group

t_test_results  # Display the results 

An ANOVA is also quite simple. Note the set up: dependent variable is specified first, then a tilda (~), then the IV(s). Also note how we can specify that gear should be treated as a grouping variable within the analysis. We don’t necessarily have to change it in the data.


2. One-Way ANOVA

aov_results <- aov(mpg ~ factor(gear), data = mtcars)  # ANOVA for mpg by gear

summary(aov_results)  # Display the ANOVA summaryaov_

And if we want to explore things a bit further, let’s do Tukey tests. These will compare each group to each other group.

tukey_results <- TukeyHSD(aov_results)

print(tukey_results)


3. Correlation

Correlations are simple, but the core correlation function is a bit limited.

correlation <- cor(mtcars$mpg, mtcars$hp)  # Pearson correlation between mpg and hp

correlation  # Display the correlation coefficient

See how this only gives us the actual correlation coefficient, but not any other data, like a p-value? This is a nice example of how sometimes we want to use commands other than stock R commands.

Let’s put together a series of commands that make a much nicer correlation output. Don’t worry, you won’t understand what all of this does just yet, but it’s a nice example of using a series of R commands to get much better output. And, I didn’t write this! Lots of people make code like this available online, so you can take it and modify it for what you need.

# Load necessary packages

# install.packages("Hmisc")

# install.packages("dplyr")

# install.packages("reshape2")

data(mtcars)

library(Hmisc)

library(dplyr)

library(reshape2)

# Calculate correlations and p-values

cor_matrix <- rcorr(as.matrix(mtcars[, c("mpg", "hp", "wt", "qsec")]))

correlations <- cor_matrix$r

p_values <- cor_matrix$P

# Define the add_stars function with NA handling

add_stars <- function(cor, p) {

  if (is.na(p)) {

    return("")  # or return NA if you prefer

  } else if (p < 0.001) {

    return(paste0(round(cor, 2), "***"))

  } else if (p < 0.01) {

    return(paste0(round(cor, 2), "**"))

  } else if (p < 0.05) {

    return(paste0(round(cor, 2), "*"))

  } else {

    return(round(cor, 2))

  }

}

# Melt correlation and p-value matrices into long format

cor_df <- melt(correlations)

p_df <- melt(p_values)

# Combine data frames and apply the function to add stars

cor_df <- cor_df %>%

  rename(Var1 = Var1, Var2 = Var2, Correlation = value) %>%

  left_join(p_df, by = c("Var1", "Var2")) %>%

  rename(P_value = value) %>%

  mutate(Correlation_with_stars = mapply(add_stars, Correlation, P_value)) %>%

  select(Var1, Var2, Correlation_with_stars) %>%

  dcast(Var1 ~ Var2, value.var = "Correlation_with_stars") 

# Print the final table

print(cor_df, row.names = FALSE)


4. Simple Linear Regression

Regression is pretty simple in R, but also very powerful as you get into more advanced applications. Again, same kind of format – dv ~ iv. Note that the command/function here is lm, for linear model.

lm_model <- lm(mpg ~ hp, data = mtcars)  # Linear regression predicting mpg from hp

summary(lm_model)  # Display the regression summary


5. Getting output out of R

You’ll have noticed that the R output isn’t very nicely formatted. There are a bunch of packages around that can help with output, especially for regression models.Stargazer is one, but I have found that it is sometimes a bit fickle and I’m not sure that it’s still being updated.

# You can run a model within the stargazer function

library(stargazer)

stargazer(lm(mpg ~ hp, data=mtcars), type="html", star.cutoffs=c(.05, .01, .001), out="regressiontest.html")

# Or you can call on a model that you’ve run previously

library(stargazer)

stargazer(lm_model, type="html", star.cutoffs=c(.05, .01, .001), out="regressiontest2.html")

These will create html files that you can then copy and paste wherever you like.

Opening your own data

So far, we’ve been opening data in built-in packages, for demo purposes. But, you’ll almost always want to open your own data files. So how do we do this?

While R can read in certain types pretty easily, there are packages that make life easier. Here’s how you can open a variety of common data types.

I typically recommend downloading data in SPSS format where possible, because it stores labels (e.g., “man”, “woman”, “non-binary/gender diverse”, “prefer not to say”) with data values (e.g., “1”, “2”, “3”, “4”). But of course you could have data stored in .csv formats, Excel formats, text files, or all sorts of things.

Opening a csv file

# Install the package if you don’t already have it installed

# install.packages("readr")

# Load package

library(readr)

# Assuming the file is named "data.csv" and located in your working directory

csv_data <- read_csv("data.csv")

Opening an Excel file

# Install necessary packages if you haven't already

# install.packages("readxl")

# Load package

library(readxl)

# Assuming the file is named "data.xlsx" and located in your working directory

# Specify the sheet if there are multiple sheets; e.g., sheet = "Sheet1"

excel_data <- read_excel("data.xlsx", sheet = 1)

Opening an SPSS data file (.sav)

# Install necessary packages if you haven't already

# install.packages("haven")

# Load package

library(haven)    

# Assuming the file is named "data.sav" and located in your working directory

spss_data <- read_sav("data.sav") 

Opening a Stata data file (.dta)

# Install necessary packages if you haven't already

# install.packages("haven")

# Load package

library(haven)    

# Assuming the file is named "data.dta" and located in your working directory

stata_data <- read_dta("data.dta")

Obviously you wouldn’t load the same data in multiple formats! But, if you need to, you can have multiple data files open in R at the same time. Remember, when you refer to variables in datasets, you specify which one.

So you might have loaded baseline data from a study as baseline_data, and then wave 1 data as wave1_data. Let’s say both contain a variable called “gambling_spend”. You can refer to each spend from each dataset separately using R’s usual approach for specifying variables: baseline_data$gambling_spend and wave1_data$gambling_spend.

And you can combine data from multiple datasets into one data frame in R as well. Let’s say we want to see if gambling_spend has changed between the waves, using a simple repeated measures t-test. We’ll first combine them into a common data frame in R (i.e., put them into the same object with an identifier to link them), and then run a repeated measures t-test. Earlier we ran an independent samples t-test. To change it to a repeated measures t-test, we just add paired = TRUE into the code.

# Combine the two data frames by creating a common ID column (assuming each row represents an individual)

# Also, make sure that `gambling_spend` columns in each data frame are properly aligned

combined_data <- data.frame(

  ID = 1:nrow(baseline_data),  # Creating a simple ID column

  gambling_spend_baseline = baseline_data$gambling_spend,

  gambling_spend_wave1 = wave1_data$gambling_spend

)

# Run the paired t-test

t_test_results <- t.test(combined_data$gambling_spend_baseline,

                         combined_data$gambling_spend_wave1,

                         paired = TRUE)

# Display the results

print(t_test_results)

Example of good practices around scripts

Here’s what I often do with my code - split it out into multiple scripts.

First load the dataset(s) with an initial script. It might look like this:

rm(list=ls())

library(readr)

library(haven)

mtcars <- read.csv("mtcars.csv")

save(mtcars, file="mtcars_data.RData")

What this does is clears the environment, loads any required packages, and then reads in the data (assuming the dataset is mtcars.csv – it could be any dataset). Note that this is just example code. Then, it saves the data as an RData object, which R can use from then on.

As mentioned in the next workshop, this might also include functions that I’ll use going forward.

 

A second script is then used for any changes to the data (e.g., recoding, scoring scales, setting things up as factors, etc).

load("mtcars_data.RData")

mtcars$cyl <- factor(mtcars$cyl, levels = c(4, 6, 8))

save(mtcars, file="mtcars_cleaned.RData")

This one reads the RData object made by the first script, does some stuff to it (here recoding the cyl variable as a categorical variable), and then saves it into a separate RData object.

 

A third script is then used for analyses.

load("mtcars_cleaned.RData")

cor_results <- cor(mtcars$mpg, mtcars$hp)

library(stargazer)

stargazer(lm(mpg ~ hp, data=mtcars), type="html", star.cutoffs=c(.05, .01, .001), out="regressiontest.html")

This one loads the cleaned data from the second script, runs a correlation, then a regression between two variables (mpg and hp), and then uses a package called stargazer to make nice tables for that regression.

Fun stuff to finish workshop 1

Try demo(), which will show you a list of the demos available. 

You could try demo(graphics), as an example.

In the workshop I also demonstrated some interactive examples of figures, and how these can be hosted online for other people to see. Important: Note that the data in these are not necessarily accurate, especially the ACT regions plots. They are just a demonstration of function.

https://www.alexmtrussell.com.au/gambling-stats-product-by-state

https://www.alexmtrussell.com.au/gambling-stats-act-regions

https://www.alexmtrussell.com.au/actegmlocations/ 

These are made in R using a special type of file called RShiny.

I’ll showcase some more examples of things like text analysis and webscraping in the next workshop.

Pros and cons of R

Pros:

  • Open source and free.

  • Loads of packages. There isn’t much it can’t do!

    • Lots of things that Jamovi and SPSS and the like can’t do, like mediation with categorical outcomes. R’s got you covered.

  • R created by statisticians, not software engineers.

  • Loads of community support.

  • Integrates with other tools.

  • Beautiful visualisations. More on this in the next workshop.

    • Including interactive visualisations.

  • Can handle loads of data types. Doesn’t have to be your standard “rows and columns” kind of approach.

  • Can combine data across complex datasets.

  • Reproducible workflows, e.g., with RMarkdown.

Cons:

  • Steep learning curve for students (and staff!)

  • Not point and click (although there are things that can help with this, like Rcmdr)

  • Inconsistent behaviour between packages.

  • Packages break as R is updated, and/or functions become outdated. Many aren’t updated, so lots of dead or dying packages around.

  • Dependencies between packages. Some packages rely on other packages, so when something breaks, it can have a bit of a chain reaction.

  • It’s hard to get an intuitive sense of what’s happening in the data or analysis sometimes.

  • Error messages aren’t intuitive either.

  • Often, something will run but you’re not sure what it is!

    • Double check results using another analysis to make sure that the results are what you think they are.

  • R can be slower/less efficient than some other languages like Python. But it usually doesn’t make too much of a difference.

  • Can be difficult to get output out, but there are packages for this.