5 Machine Learning: Clustering


Machine Learning is a technique which enables computers to learn from data, without being explicitly programmed. Machine learning is becoming a big part of our everyday lives, and you might not even realize it!

Here are some examples of machine learning seen every day:

  • Facebook tailors your news feed based on the data from posts you’ve liked in the past.
  • Netflix recommends videos based on data from your watch history.
  • Apple iPhones recognize your friends’ faces based on the data from your camera roll.

Note that in all of these cases, the decision made by the machine learning algorithm is based on what has been learned from the input data.

  • Facebook learns the kind of content you like to see.
  • Netflix learns what kind of movies you’re into.
  • iPhones learn what your friend’s faces look like.


Clustering is a machine learning technique used to group samples together based on the similarity of their variables. This is useful to identify underlying patterns in our dataset and is an example of unsupervised machine learning.

Leaving Certificate Syllabus

This chapter is complimentary to:

  • Leaving Certificate Computer science, strand 1: Computers and Society.
  • Leaving Certificate Mathematics, strand 4: Algebra.

Toy Dataset

Firstly we will use a toy dataset so you can identify patterns by eye before applying the principles to a larger dataset where patterns are not as obvious.

This dataset contains the expression levels of 3 genes for 4 patients:

patient1 11 10 1
patient2 13 13 3
patient3 2 4 10
patient4 1 3 9

Please copy and paste the code below to analyse the data in R Studio Cloud:

df <- data.frame(IRX4=c(11,13,2,1),
              OCT4=c(10,13,4,3 ),
              PAX6=c(1 ,3 ,10,9),

Visualise Toy Dataset

We will begin by making a barplot of the gene expression data. You will need to use the 'ggpubr' package.

Firstly we need to reformat the data (do not worry about this) but you should recognise the facet.by parameter in ggpubr to produce plots for each patient:


df2 <- tidyr::gather(cbind(patient=rownames(df),df), key= "gene", value= "expression", IRX4, PAX6, OCT4)

ggbarplot(df2, x="gene", y="expression", fill="patient", facet.by = "patient")

Hopefully, it is obvious from the barplot that both Patient 1 and Patient 2 share similar gene expression profiles – that is to say they have similar features – whilst Patient 3 and Patient 4 exhibit inverse gene expression patterns when compared to Patient 1 and Patient 2.

We can see this by eye, but if there were hundreds of patients, with expression data for hundreds of genes it would not be so easy. That’s where machine learning comes in to recognize patterns that we cannot see.


Recall we used scatterplots to assess relationships between two variables. Scatter plots that have been coloured by groupings can also tell us which variables are good at delineating samples.

Manually plotting each combination of variables in a dataframe is tedious, we can use a ScatterPlotMatrix to do this automatically:

# Telling R that each patient ID is a unique name
id <- as.factor(rownames(df))

# Base R plotting
pairs(df, col=id, lower.panel = NULL, cex = 2, pch = 20)
par(xpd = TRUE)
legend(x = 0.05, y = 0.4, cex = 1, legend = id, fill = id)


A guide on how to interpret the scatter plot matrix is given below. Note that one is primarily concerned with patterns emerging in the plots, not necessarily the expression values associated with each point.

Distance Metrics

The human eye is extremely efficient at recognising patterns in plots, which we have demonstrated using barplots and scatter plots. But how does a computer ‘see’ these patterns? It does not have eyes so it must use a different method to define how similar (or dissimilar) samples are.

We will use the Manhattan Distance to compute sample similarity mathematically:

Using patient 1 and patient 2 from our toy dataset, let’s work the solution by hand:

Patient 1 vector (x): 11, 10, 1
Patient 2 vector (y): 13, 13, 3

Formula: sum|((X1 - Y1)) , (X2 - Y2), (X3 - Y2))|

Fill in: sum|((11 - 13), (10 - 13), (1 - 3))|

Solve: sum|-2, -3, -2 |

Solve: sum(2, 3, 2)

Answer: Manhattan Distance( Patient 1, Patient 2) = 7

Dist() function

Solving the distance metrics by hand is a useful exercise to understand how distance metrics are generated, but would take much too long in the case of large datasets.

To automate this, use the dist() function in R. Pass the dataframe (which must only contain numerics) and select the distance metric you want to use (in this case we are using ‘manhattan’):

dist(df, method="manhattan", diag=TRUE, upper=TRUE)
#          patient1 patient2 patient3 patient4
## patient1        0        7       24       25
## patient2        7        0       27       28
## patient3       24       27        0        3
## patient4       25       28        3        0

Notice the zeros on the diagonal, this is because there is zero distance between patient1 and itself, patient2 and itself… etc.

Sample Heatmaps

The table of results generated from the dist() function are tedious to interpret – instead, we can use data visualisations to quickly convey this information.

# Load library for heatmaps

# use Manhattan distance (store in matrix)
d <- as.matrix(dist(df, method="manhattan"))

# add patient ID to rows & columns for the heatmap
rownames(d) <- rownames(df)
colnames(d) <- rownames(df)

pheatmap(d, cluster_rows = F, cluster_cols = F,
         show_rownames = T, show_colnames = T, display_numbers = TRUE)



We can see the results for each sample comparison and once again, visually, we can see clusters (groups) forming. High values (red) represent dissimilarity. Here, we can see that patient 1 and 2 are similar to each other, but different from patients 3 and 4, and vice versa.


The next step is to perform clustering via computational methods.

Hierarchical Clustering

The hierarchical clustering algorithm works by:

  1. Calculating the distance between all samples.
  2. Join the two ‘closest’ (smallest distance metric) samples together to form the first cluster.
  3. Re-calculate distances between all samples (and the new cluster) and repeat the process until every sample has been added to a cluster.

Watch the animation below to see how hierarchical clustering works:

How can we portray this GIF in a static plot? By using a dendogram which looks like a tree with branches. Each branch represents a cluster, until you work all the way down to the bottom in which case each branch represents a sample:


We will add a dendrogram to our toy dataset heatmap to define clusters:

# Plot the distance matrix using a heatmap
pheatmap(d, cluster_rows = T, cluster_cols = T,
         show_rownames = T, show_colnames = T,
         treeheight_row = 100, treeheight_col = 100)



Feature heatmaps

Sample heatmaps are used to assess sample heterogeneity. Feature heatmaps are representations of the values present in each of the variables in the dataset for each sample.

Using feature heatmaps in conjunction with clustering, we can identify the underlying variables (genes) that differentiate samples.

You do not need to compute any distance metric here, simply pass the numeric dataframe to the pheatmap function:

# rotate the dataframe to make plot easier to interpret.
df_t <- as.data.frame(t(df))

pheatmap(df_t, cluster_cols = TRUE,
         cluster_rows = TRUE,
         treeheight_row = 100,
         treeheight_col = 100,
         display_numbers = TRUE)


Introduction to Genomics Data Science Copyright © by Barry Digby; Clodagh Murray; and Pilib Ó Broin. All Rights Reserved.

Share This Book