Monday, October 12, 2015

Removing Clustering Ambiguity due to Binary Data with Independent Component Analysis

This post shows that one can really reduce or eliminate clustering ambiguity that is a result of discrete data by the use of a transformation of the data to independent components. However, the value of such a transformation in terms of the grouping of the data is a completely open question… one that would be fun to explore. Refer to the references in the previous posts on Naive Bayes and Independent Component Analysis on such binary independent component analysis (BICA) transformations.

library(proxy) #For bit vector (dis)similarity measures
library(cluster) #For hierarchical clustering and dendrogram formatting 
library(fpc)   #For relative differences between clusterings.
library(fastICA) #For fast ICA algorithm
library(lsr) #For the function permuteLevels()
library(compare) #For the compare() function to compare the two factor objects

clusterMeasure <- "Jaccard"
clusterMethod <- "complete"

Refer to notes in the Clustering Ambiguity blog posts I or II concerning the provenance of the data.

tag_data <- read.csv("model-tags2.csv", header=TRUE,stringsAsFactors = FALSE)

lentag_data <- dim(tag_data)[[1]] #Get the number of models (the number of items in the data set)
#Set the column indices
columnIndices <- 2:42
#These will have to be changed if the input data is changed
#This `dist` function is from the proxy package 
D <- dist(tag_data[,columnIndices], method = clusterMeasure)
#Convert `dist` output to a matrix object
Dm <- as.matrix(D)

In the Clustering Ambiguity blog posts, I permuted the input data and then computed the dissimilarity matrix. There was really no reason to recompute the matrix with the data permuted. So here I simply permute the rows and columns with the same permutation and this has the same effect as permuting the input data.

#for the same permutation, set random number seed
set.seed(13)
cutsize = 5
#Generate a permutation
a.permutation <- sample(lentag_data) 

DmPermutation <- Dm[a.permutation,a.permutation]

PModelCompleteClust <- agnes(Dm,diss=TRUE,clusterMethod)

PModelCompleteClustPermute <- agnes(DmPermutation,diss=TRUE,clusterMethod)

Clusters <- cutree(PModelCompleteClust,cutsize)
ClustersPermute <- cutree(PModelCompleteClustPermute,cutsize)

table(Clusters)
## Clusters
##  1  2  3  4  5 
## 42 29 35 21 16
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5 
## 73 29 21  4 16
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5
##        1 42  0  0  0  0
##        2 29  0  0  0  0
##        3  2 29  0  4  0
##        4  0  0 21  0  0
##        5  0  0  0  0 16

With both the original order of the input (Dm) and the newly permuted data (DmPermutation), an example will show that there are no ties in proximity when using the ICA transformation of the binary data, given 1. a transformation to just 10 continuous independent components of the original bit vectors having a length of 41 bits; and 2. the same cutsize of 5 used in the Clustering Ambiguity posts. There is a good chance that further tests will show ties in proximity will likely be very rare given an ICA transformation.

fpICA <- fastICA(tag_data[,columnIndices],10) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##   1   2   3   4   5 
## 105   9  13  11   5
table(ClustersPermute)
## ClustersPermute
##   1   2   3   4   5 
##  13 105   9   5  11
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters   1   2   3   4   5
##        1   0 105   0   0   0
##        2   0   0   9   0   0
##        3  13   0   0   0   0
##        4   0   0   0   0  11
##        5   0   0   0   5   0
Clusters
##   [1] 1 2 3 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 4 1 3 1 4 1 1 1 1 1 1 1 1 1
##  [36] 1 1 4 1 1 1 1 1 1 1 5 1 1 1 1 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 2 2 1 1 2 1 1 1 3 3 3 3 1 1 1 2 1 1 1 4 4 2 5 1 1 1 1 3 2 2 2
## [106] 1 4 3 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 5 1 4 4 5 1 1 1 1 1 1 1 1 1
## [141] 1 5 1
ClustersPermuted
##   [1] 2 3 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 5 2 1 2 5 2 2 2 2 2 2 2 2 2
##  [36] 2 2 5 2 2 2 2 2 2 2 4 2 2 2 2 5 5 5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 3 3 2 2 3 2 2 2 1 1 1 1 2 2 2 3 2 2 2 5 5 3 4 2 2 2 2 1 3 3 3
## [106] 2 5 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 4 2 5 5 4 2 2 2 2 2 2 2 2 2
## [141] 2 4 2
#To show that the clusters are exactly the same takes a little be of comparing.

#By inspection, the results are the same, but to do it automatically to make sure...

#Turn the clustering results into factors
ClustersFactored <- factor(Clusters)
ClustersPermFactored <- factor(ClustersPermuted)
#Permute the levels of the permuted clustering results to the levels of the 
#non-permuted clustering results.  The Level permutation is found via inspection.
#One could write a loop to do the comparisons
NewClustersPermFactored <- permuteLevels(ClustersPermFactored,c(2,3,1,5,4))
#Now compare the two factor vectors in terms of the corresponding level for each element in the vector.
#If the factor vectors are the same, compare will return "TRUE"
compare(NewClustersPermFactored,ClustersFactored,allowAll=TRUE)
## TRUE
##   dropped attributes
#Another way to do it is to use the integer vectors returned by the clusterings #and loop through the first vector, finding each cluster and comparing it to the other clustering vector.
compareEqualClusterings <- function(i, x, y, size){
  j <- which(x==i)[1]
  all(which(x==i),which(y==y[j]))
}
#Warning: the above function is devoid of error checking!

#If all clusters are the same, then the sum of the true cluster comparisons will be equivalent to the cutsize
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE
#There are plenty of other ways to do this, some probably shorter and possibly more efficient.

Namely, the exact same clusters are returned. Though they are quite different from the clusters returned from the binary vectors and the Jaccard measure. Let's see what happens when we decrease the number of components.

fpICA <- fastICA(tag_data[,columnIndices],5) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##  1  2  3  4  5 
## 42  2 12 64 23
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5 
## 42 64 23  2 12
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5
##        1 42  0  0  0  0
##        2  0  0  0  2  0
##        3  0  0  0  0 12
##        4  0 64  0  0  0
##        5  0  0 23  0  0
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE

Let's see what happens if we increase the cut size. The further down the dendrogram near the leaves is where most of the ties occur…

cutsize = 10
fpICA <- fastICA(tag_data[,columnIndices],5) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##  1  2  3  4  5  6  7  8  9 10 
## 29  2 13 11 18 35 23  3  8  1
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5  6  7  8  9 10 
## 13 29 35 23  2 18 11  8  1  3
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5  6  7  8  9 10
##       1   0 29  0  0  0  0  0  0  0  0
##       2   0  0  0  0  2  0  0  0  0  0
##       3  13  0  0  0  0  0  0  0  0  0
##       4   0  0  0  0  0  0 11  0  0  0
##       5   0  0  0  0  0 18  0  0  0  0
##       6   0  0 35  0  0  0  0  0  0  0
##       7   0  0  0 23  0  0  0  0  0  0
##       8   0  0  0  0  0  0  0  0  0  3
##       9   0  0  0  0  0  0  0  8  0  0
##       10  0  0  0  0  0  0  0  0  1  0
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE

Further down the dendrogram doesn't appear to make a difference and it is far enough down with a cut size of 10 to begin to get a singleton cluster.

No comments:

Post a Comment