3 Coding in RStudio
This chapter will introduce the basics for coding in R studio, including variables, performing calculations, vectors, named vectors, indexing/subsetting vectors, dataframes, colnames/rownames, subsetting dataframes, filtering by conditionals and if-else statements. Use your R Studio Cloud account to follow along with the examples. The best way to learn how to code is with hands-on experience!
Leaving Certificate Syllabus
This chapter is complimentary to:
- Leaving Certificate Computer Science Section 2: Data
Variables
The first concept we will cover is variables. Variables are placeholders that store pieces of information. This information can take many forms – a number, a vector, a matrix, a function – the list goes on.
To assign a variable in R, you can use the <-
notation or =
notation.
In the code block below, we create a character variable by taking the string "Hello World"
and assigning it to the variable greeting.
greeting <- "Hello World"
greeting
, you can use the print()
function.print(greeting)
Notice how the variable has been stored in our environment (the upper right box in RStudio) as a value.
As previously mentioned, variables are placeholders and as such, can be overwritten and modified.
Change the contents of the greeting
variable to hold the string "Hello user"
and print the contents. Notice how the contents have changed.
greeting <- "Hello user"
print(greeting)
Note the environment variable has been updated to reflect this change.
Calculations
R is a statistical computing software and at its core, an oversized calculator.
R performs addition, subtraction, multiplication, division, exponentiation and modulo with + - * / ^ %%
.
1+4
10-1
12*3
1/6
2^6
3%%9
Returns the results:
It is common practice to store the results of any calculation in a variable so that you can use the result later:
x <- 2+4
print(x)
Data Types
Data types help R interpret our code inputs.
- Anything surrounded by double quotes is interpreted as a character string.
- Integers and floats are interpreted as numerics on which we can perform mathematical operations on.
TRUE / FALSE
statements are known as Booleans.
The box below shows examples of each data type being assigned to a variable. Note, everything after the # is interpreted as a comment on your code block.
Data Types
my_age <- 28 # Numeric variable
my_name <- "Nicholas" # Character variable
is_datascientist <- TRUE # logical variable
Vectors
Vectors are a collection of the same data type. Be careful not to mix data types in a vector!
To initialise a vector, we use the c
()
function – which stands for concatenate.
Below we will create two vectors:
racing_number <- c(33,44,11,4,3)
driver_names <- c("Verstappen", "Hamilton", "Perez", "Norris", "Riccardo")
Named Vectors
We can use the driver_names
vector variable to assign names to the racing_number
vector using the names()
function. Rename the racing_number
variable as drivers
to avoid confusion!
drivers <- racing_number
print(drivers)
names(racing_number) <- driver_names
Returns:
Manipulating Vectors
Let’s update our previous vectors to include two new drivers and their racing numbers:
# Add two new numbers to the pre-existing vector
racing_number <- c(racing_number, 16, 24)
# Add two new names to the existing names vector
driver_names <- c(driver_names, "Leclerc", "Zhou")
# Update the names of the racing_number vector
names(racing_number) <- driver_names
# Assign the racing_number vector to drivers
drivers<- racing_number
# Inspect the output
print(drivers)
Returns:
Indexing
Before demonstrating how to delete items from a vector, we need to cover vector indexing. Indexing allows us to access specific items in a vector.
In the example below, we will access the first, last and 2nd to 4th drivers in our vector:
drivers[1]
drivers[7]
drivers[2:4]
Returns:
Note:
Instead of drivers[7]
we could use drivers(length(drivers))
to access the last element in the vector. This saves you from having to count the items manually and is programmatically robust to future changes to the vector.
To delete an item from the vector, we place a minus in front of the corresponding index we want to drop.
Drop Lewis Hamilton from our drivers
vector. Don’t forget to assign the operation to the drivers
variable if you want to save the changes.
drivers[-2]
Returns:
Lists
Lists can be used to store multiple vectors in a single data structure. We can name the vectors in the list, adding another element to this data structure.
We will create a named list attributing four driver pairings to their respective teams:
F1_teams <- list(Scuderia_Ferrari=c("Charles Leclerc", "Carlos Sainz"), Scuderia_Alpha_Tauri_Honda=c("Pierre Gasly", "Yuki Tsunoda"), Alfa_Romeo_Racing_ORLEN=c("Valterri Bottas", "Guanyu Zhou"), HASS_F1_Team=c("Mick Schumacher", "Kevin Magnussen"))
Constructing a list is simple – just assign multiple vectors (e.g Scuderia_Ferrari=c("Charles Leclerc", "Carlos Sainz")
– each separated by a comma wrapped in the list()
function.
The benefit of lists like these is that you can easily access items in the list using human-readable names instead of numerical indexes (which still works!).
Below are a few examples of how to access the Ferrari drivers, which are the first vector in our list:
F1_teams$Scuderia_Farrari
F1_teams[1]
F1_teams["Scuderia_Farrari"]
F1_teams$HASS_F1_Team[2]
Dataframes
Dataframes are a superior method to lists for storing multiple vectors. Typically, each row in a dataframe corresponds to an observation (person, event, sample), whilst columns correspond to the variable being recorded (e.g height, age, eye colour).
Go to RStudio Cloud and open your session. Load in the Iris
dataset:
iris <- datasets::iris
You can see a newly created Data object in your environment called iris
with 150 obs of 5 variables
. That is to say, we have 150 rows and 5 columns.
Colnames & Rownames
A simple rule applies to colnames
and rownames
: they must be unique. This is because R uses both colnames
and rownames
to index each column and row respectively, duplicate entries are not allowed.
Inspect the column names and row names of a dataframe:
colnames(iris)
rownames(iris)
Returns:
Note that the rownames in this dataset are not important, they are just automatically incremented (unique) integers.
Dataframe Indexes
There are situations where we will need to isolate columns or rows for an analysis. The same numerical indexing logic from vectors applies, but there are two entries to the square brackets – one for rows, and one for columns.
Like lists, we can provide human-readable names to access a specific column: iris$Sepal.Width
.
Subsetting Dataframes
Now that we know how to isolate specific cells of a dataframe, the next step is to apply these changes by ‘slicing the dataframe’. Slicing and subsetting are interchangeable terms.
In our Iris
dataset, make a new dataframe that contains only numerical measurements for Petals (i.e. the 3rd and 4th columns)
:
petal_data <- iris[,3:4]
species
):numerical_data <- iris[,-5]
subset()
function. The above operations are performed using subset
below:petal_data <- subset(iris, select = c(Petal.Width, Petal.Length))
numerical_data <- subset(iris, select = -c(Species)
Note:
It is rare that you would select/drop observations from a dataset in this manner (do not cherry-pick your data). This is why the examples are performed on columns.
Filtering Dataframes
Filtering dataframes is an extension of dataframe subsetting, performed using logical operators
:
<
: less than<=
: less than or equal to>
: greater than>=
: greater than or equal to==
: exactly equal to!=
: not equal to!x
: Not xx | y
: x OR yx & y
: x AND y
Using the Iris
dataset as an example, subset the original dataframe to isolate data that belongs to the species Setosa
:
setosa_data <- subset(iris, iris$Species == "Setosa")
Updating Dataframes
To create a new variable in our dataframe, we can use the $
operator.
In the example below, we will add a column called sepal_less_petal_len
to the original dataframe.
This column will contain Sepal.Length - Petal.Length
iris$sepal_less_petal_len <- iris$Sepal.Length - iris$Petal.Length
Exercise!
Make your own Data frame
We can also make our very own data frames and lists, by combining vectors.
- Open up R studio cloud, and 5 create vectors; Monday, Tuesday, Wednesday, Thursday and Friday. Make the content of each vector a shopping list containing 5 items for a restaurant on each day of the week.
- Monday might look like this:
-
monday <- c("pasta", "bacon", "mushrooms", "milk", "cheese")
-
- Do this for all 5 days.
- Combine all 5 vectors into a data frame using the data.frame() function:
-
dataframe <- data.frame(monday, tuesday, wednesday, thursday, friday)
-
- Use some of the subsetting techniques described such as indexing and filtering to explore how data frames are in essence, combinations of vectors that are sub settable.
If Else
We can use ifelse()
to create new variables based on conditional statements.
In the example below we will use a vector, however, this applies to dataframe columns too:
Note:
The ifelse()
function is an example of a ternary operator which reads as follows: “ A ? B : C“ – If A
is true choose B
, else choose C
.
vector <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) test <- ifelse(vector > 5, "greater than 5", "less than 5") print(test)
Returns:
But what about the 5th element? 5 is not less than 5.
To add a second layer of conditionals we will re-use the ifelse()
function:
test <- ifelse(vector == 5, "five", test) print(test)
Returns: