#load the packages neededlibrary(data.table)library(purrr)library(NbClust)#read and assign the data. Please adjust the directory to where the data lives.dt_cars <-fread(file ="cars.csv", sep =";")#having a quick look into the dataset.head(dt_cars)
#create a "Drive" variable, to indicate the type of the drive. The command first creates the Drive variable and assign FWD to all the observations, since the data doesn't have a variable for FWD, unlike AWD and RWD. Then change the Drive variable for cars that have AWD or RWD, then the command changes the Drive variable to factor variable.dt_cars[, "Drive":="FWD"][AWD==1, "Drive":="AWD"][RWD==1, "Drive":="RWD"][, "Drive":=as.factor(Drive)]#omit the missing values from the dataset. Essential later to get k-means as the function cannot handle the missing values/require data without missing values.dt_cars<-na.omit(dt_cars)
\(\bullet\)k-Means:
#get the k-means; the mean for each cluster based on the three variables inserted, with two clusters in total.k_means<-kmeans(dt_cars[,c("HP","EngineSize","Length")],2)k_means
kmeans divides the dataset, given the horse power, engine size and length, into two clusters: the first one with \(122\) observations, and the second part with \(265\) observations.
Cluster means shows the mean for each variable per cluster.
Clustering vector shows the number of cluster, for each observation. If the observation belongs to the first cluster or the second one.
Within cluster sum of squares by cluster shows the sum of squares per cluster.
cluster : a vector of integers (from 1:k) indicating the cluster to which each point is allocated, which was shown in the Clustering vector; centers: a matrix of cluster centres; totss: The total sum of squares; withinss: vector of within-cluster sum of squares, one component per cluster, which was shown before; tot.withinss: total within-cluster sum of squares, i.e. sum(withinss); betweenss: The between-cluster sum of squares, i.e. totss-tot.withinss; size: The number of points in each cluster; iter: The number of (outer) iterations; ifault: integer: indicator of a possible algorithm problem.
We can retrieve the total within sum of squares (total WSS) by retrieving the tot.withinss component:
k_means$tot.withinss
[1] 798521.9
\(\bullet\)Rerun kmeans for k=\(1\) to \(15\), with elbow plot:
#I use "lapply" to run the function 15 times for k=1 to 15.k15_all<-lapply(1:15, function(k){kmeans(dt_cars[,c("HP","EngineSize","Length")],k)})#Another way is to create a loop to run the code 15 times for k=1 to 15.#k15_all<-for(k in 1:15){ #specify the range for k.#k15<-print(kmeans(dt_cars[,c("HP","EngineSize","Length")],k))#}#alternatively, we can use "map_dbl" from "purrr" package.k15_alt<-map_dbl(1:15,function(k){ #create a function for k, to cover the range 1:15. meansk<-kmeans(dt_cars[,c("HP","EngineSize","Length")],k) #the same function as before, using kmeans. meansk$tot.withinss #extracting the total within sum of squares.})#create a new dataframe to store the output of the previous function, so that it can be plotted later.new_df<-data.frame(k=1:15, tot_withinss=k15_alt) #the "tot_withinss" will retrieve the total within sum of squares, which was extracted in the previous function using: "meansk$tot.withinss" in the previous function.#now we plot the new dataframe. "type" here will display both: line and points. "xlim" is for the limits of the x-axis.plot(x=new_df$k,y=new_df$tot_withinss,type="b",xlim=c(1,15),xlab="k",ylab="Total within sum of squares")axis(side=1,at=1:15) #I use "axis" function to display the whole range from 1 to 15, and with a break of one.
#another way is to use "fviz_nbclust" from "factoextra" package, which will plot directly the clusters against the total within sum of squares, for the aim to specify the optimal number of clusters by showing the "elbow" form in the plot. library(factoextra)#first we specify the data, then I specify it's kmeans for clustering, then the maximum number of clusters (k.max), then the method that should be used, and here: "wss" is for total within sum of squares.fviz_nbclust(dt_cars[,c("HP","EngineSize","Length")], kmeans, k.max=15,method ="wss")
\(\bullet\)Change the random starting locations:
#I create a loop to plot three plots directly with 3 different starting locations; at 1, which is already the default, 10, and 100.for(i inc(1,10,100)){print(fviz_nbclust(dt_cars[,c("HP","EngineSize","Length")], kmeans, k.max=15,method ="wss",nstart=i))}
The three plots, almost, look exactly the same.
\(\bullet\)Determine the optimal k for k-means clustering:
#run NbClust to determine the best number of clusters.NbClust(dt_cars[,c("HP","EngineSize","Length")],method="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 4 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 6 proposed 4 as the best number of clusters
* 3 proposed 7 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 2 proposed 13 as the best number of clusters
* 2 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 4
*******************************************************************
The output shows many information including the indices used to determine the optimal number of clusters. In particular, the results show the Hubert index and the D index. Given that and all other indices, the function concludes with the best number of clusters as \(4\), according to the majority rule, and that is consistent with the elbow form in the plot, which is at \(k=4\).
The function provides \(30\) indices to determine the number of clusters, and it provides this “by varying all combinations of number of clusters, distance measures, and clustering methods”. The documentation provides further information about the indices and the reference(s) for each of them.
\(\bullet\)Standardization of variables:
#standardize the variables.dt_cars[,"HP_st":=((HP-mean(HP))/sd(HP))][,"EngineSize_st":=((EngineSize-mean(EngineSize))/sd(EngineSize))][,"Length_st":=((Length-mean(Length))/sd(Length))]#rerun NbClust to determine the optimal number of clusters, with the standardized variables.NbClust(dt_cars[,c("HP_st","EngineSize_st","Length_st")],method="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 9 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 2 proposed 4 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 4 proposed 12 as the best number of clusters
* 1 proposed 15 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
The optimal number of clusters is \(2\) instead of \(4\). The difference after standardization is because of the change of the variables’ range. Since the aim is to minimize the squared Euclidean distance between the data points and their nearest prototypes, the range of each variable will have an effect on determining this distance (e.g. big difference in magnitude). As the range of a variable gets bigger, the effect of that variable gets higher because the distance between the data points and their nearest prototypes gets bigger.
Another important point is the different units of each variable as they don’t measure the same thing; they are not expressed in the same way (e.g. percentage), and this is the main reason for the different ranges among the variables. However, the standardization controls for this limitation, and leads to more comparability between the variables, and hence lower number of clusters.
#plot the number of clusters against the total within sum of squares.fviz_nbclust(dt_cars[,c("HP_st","EngineSize_st","Length_st")], kmeans, k.max=15,method ="wss")
The elbow shape/form can be seen at \(k=2\) as well.
\(\bullet\)The Expectation-Maximization (EM) algorithm to implement k-means clustering:
#first I subset the two variables into a matrix, so the calculation can be easier.newdata<-data.matrix(dt_cars[,c("HP_st","EngineSize_st")])#then I define some parameters; A <-sample(1:2, length(newdata)/2, replace = T) #I create a random sample of the two clusters, with replacement, since the assignment to 1 or two is not unique.C<-newdata[1:2,] #I subset the first two rows, to be used later as the cluster centers. Any two points can be used as a cluster center.colors <-seq(2, 3) #create a color parameter with two numbers, for the two clusters, to be used later to color the plots by cluster.#create a loop to plot HP against Engine size for the two clusters. The initialization stage, as per slides p. 12.plot(newdata,xlab="HP (standardized)",ylab="Engine Size (standardized)",main="Initialize")for(i in1:2) { #1:2 is for the two clusterspoints(C[i, , drop =FALSE], col = colors[i])points(newdata[A == i, , drop =FALSE], col = colors[i]) }
#create a loop within a function, similar to the previous one to plot the iterations of the algorithm.k_plot <-function(newdata, C, txt) {plot(newdata, main = txt, xlab ="", ylab ="")for(i in1:2) {points(C[i, , drop =FALSE], pch =23, lwd =3, col = colors[i]) #the parameters in "points" function are explained in "points" documentation.points(newdata[A == i, , drop =FALSE], col = colors[i]) }}#I repeat the iterations until no data point changes its cluster membership. The loop stops with "break" command.repeat {# calculate Euclidean distance between objects and cluster centers. D <-matrix(data =NA, nrow =2, ncol =length(newdata)/2) #create an empty matrix/with NA values, with two rows and number of columns as our data, to be filled later with the distance calculations.for(i in1:2) { #nested loop; to calculate the distance for each data point, for each cluster. for(j in1:length(newdata)/2) { D[i, j] <-sqrt((newdata[j, 1] - C[i, 1])^2+ (newdata[j, 2] - C[i, 2])^2) } } O <- A #assign the sampling of the clustering into a new variable, to be used later to check if the data points changed or not.#E-step: parameters are fixed. A <-max.col(t(-D)) # assign objects to cluster centers. "t" to get the transpose of the matrix, after changing the sign, then calculate the maximum.if(all(O == A)) break#here is the condition where the algorithm should stop if there is no change.k_plot(newdata, C, "E-step") #I plot the "E-step" using the function created before.#M-step: assignments are fixed.#determine new cluster centers based on mean of assigned objectsfor(i in1:2) { C[i, ] <-apply(newdata[A == i, , drop =FALSE], 2, mean) #within the loop, "apply" will return the values for the dataset whenever the assigned objects before is equal to the cluster number. the other parameters in "apply" are explained in the documentation. }k_plot(newdata, C, "M-step") #plot the M-step.}
After \(6\) successive expectation (E)- and maximisation (M)-steps, the algorithm converges to an optimum, in which all data points stop changing cluster membership (e.g. clusters are stable).