Let's start by loading some data from a csv file on the elemental composition of some pottery. A total of 5 elements are reported.
data = read.csv('https://vincentarelbundock.github.io/Rdatasets/csv/car/Pottery.csv')
data[1:6,]
First we need to trim the data because no one likes test in there dataframes. Notice that there are also 26 rows.
data = data[,3:7]
head(data)
nrow(data)
Time to load some libraries.
library(fpc)
library(cluster)
Clustering techniques do not care about the values in the data frame per se, rather they care about how far each value is from the others. So we create a distance matrix which looks like this:
d = dist(data)
as.matrix(d)
In the cluster library we have a tool called pamk() which we shall use to determine the statistically significant number of clusters to form. Here is is 2 according to 'nc'.
pamk(d)$nc
Using the argument 2 for number of cluster, we generate pam.data. Plotting the results we can see the choosen clusters on the first two principle axes as well as a Silhouette of the clusters (see below).
pam.data = pam(d, 2)
clusplot(pam.data)
plot(pam.data)
The 'Silhouette' is the measure of how well-supported the particular cluster is according to the data. A value of less than ~0.3 generally indicates poor support while a value of >0.7 is excellent support.
Loading another library, we will try another approach using heiarchtical clustering.
library(pvclust)
This method, pvclust, attempts to determine how significnatly different each member is from the others and forms a dendrogram from the results.
pv.data = pvclust(scale(d), method.dist="cor", method.hclust="average", nboot=1000)
plot(pv.data)
pvrect(pv.data, alpha=0.95)
Here the data was paritioned through the use of pvclust, which bootstraps significance factors by rearranging the provided data. Two clusters of p>0.95 significance were found.