Monday, July 27, 2015

Clustering Ambiguity

Clustering Bit Vector Data and Clustering Ambiguity

Clustering Bit Vector Data and Clustering Ambiguity

There are many examples of bit vector data in biology, chemistry, market basket analysis, etc., where clustering such data is a common and useful task. This R session explores clustering ambiguity due to discrete data, in this case bit vector data, and associated bit vector (dis)similarity measures. It is important to realize that clustering results may not be unique, even though the algorithm and the (dis)similarity measure used is the same. Similar results can be found with count data, depending on the number of count features and the count range.

Below a simple example with hierarchical clustering shows that clustering results are not invariant to input order, and that the difference in the results can be significant. Different clustering algorithms will exhibit similar problems, independent of other possibly stochastic attributes (e.g., k-means initializations) that the algorithms might have. More generally, clustering algorithms that have decision ties can produce ambiguity. All major groups of clustering algorithms have such problems. This is not a new problem, but one that can be significant and is often overlooked.

The data in this example comes from an attempt to describe 143 predictive modeling algorithms in R, both regression and classification models, and those that can be use for either. In spite of the clustering ambiguity that arises, it is an entertaining, thoughtful, and possibly useful attempt to group predictive models.

library(proxy)  #For bit vector (dis)similarity measures
library(cluster)  #For hierarchical clustering and dendrogram formatting 
library(fpc)  #For relative differences between clusterings.
clusterMeasure <- "Jaccard"
clusterMethod <- "complete"
## Here one can try other measures and clustering methods, such as the Ochiai
## (Cosine), but be award of other labeling changes needed below.  Note, the
## Farey Sequence code below is dependent on the Jaccard measure.  Other
## measures' discrete distributions may be easy to determine (Euclidean),
## others very difficult (Cosine/Ochiai)

# E.g., clusterMeasure <- 'Ochiai'

# E.g., clusterMethod <- 'average'

The following data is from http://blog.revolutionanalytics.com/2014/01/predictive-models-in-r-clustered-by-tag-similarity-1.html by Max Kuhn, though I have modified it slightly: the row names are the actual predictive model function names in R from various packages, the full names are stored as strings in the second column of the .csv file. A tag is a feature of a predictive model, such as, is it used for classification?, or regression?, or both?. There are 41 tag features, and 143 predictive model functions in R. Note, aside from proxy the package vegan also has a designdist() and vegdist() that generate (dis)similarity measures for bit vectors, and in some cases count vectors. ChemmineR, and rcdk, two cheminformatics R packages also contain bit vector (chemical compound fingerprints) and corresponding measure utilities.

tag_data <- read.csv("model-tags2.csv", header = TRUE, stringsAsFactors = FALSE)
head(tag_data[, 2:5])  #A small snippet of the data without the model names in full (column 1)
##          Classification Regression Bagging Bayesian.Model
## ada                   1          0       0              0
## avNNet                1          1       1              0
## bagEarth              1          1       1              0
## bagFDA                1          0       1              0
## bayesglm              1          1       0              0
## bdk                   1          1       0              0
head(tag_data[, 1])  #The corresponding model names in full
## [1] "Boosted Tree (ada)"                            
## [2] "Model Averaged Neural Networks (avNNet)"       
## [3] "Bagged MARS (bagEarth)"                        
## [4] "Bagged Flexible Discriminant Analysis (bagFDA)"
## [5] "Bayesian Generalized Linear Model (bayesglm)"  
## [6] "Self-Organizing Maps (bdk)"
lentag_data <- dim(tag_data)[[1]]  #Get the number of models (the number of items in the data set)
# Set the column indices
columnIndices <- 2:42
# These will have to be changed if the input data is changed
# This `dist` function is from the proxy package
D <- dist(tag_data[, columnIndices], method = clusterMeasure)
# Convert `dist` output to a matrix object
Dm <- as.matrix(D)
# show a small block of the proximity matrix
Dm[1:3, 1:3]
##             ada avNNet bagEarth
## ada      0.0000 0.7778    0.625
## avNNet   0.7778 0.0000    0.500
## bagEarth 0.6250 0.5000    0.000
par(cex = 0.3, col = "darkred")  #Shrink the leaf text labels
PModelCompleteClust <- agnes(Dm, diss = TRUE, clusterMethod)
plot(as.dendrogram(as.hclust(PModelCompleteClust), hang = 0.1), horiz = TRUE)

plot of chunk unnamed-chunk-1

# The odd as.dendrogram(as.hclust(), hang=), horiz= is to obtain both the
# horizonal dendrogram and manipulate the length of the leaf from its
# respective merge height

Since there are 143 models the leafs of the dendrogram are hard to read – one of the downsides of large hierarchical clustering visualizations. At least we can cut the tree and get the labels of a specific cluster.

# E.g., cut the tree such that there are 20 clusters
twentyClusters <- cutree(PModelCompleteClust, 20)
table(twentyClusters)
## twentyClusters
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 13  9 13  8  5 16  1  1  9 11  4 16 11  7  2  3  8  2  1  3
# Show those functions that are clustered in the largest cluster.  First
# find the index that represents the #largest cluster
largeclustindex <- which(table(twentyClusters) == max(table(twentyClusters)))[1]
modelindices <- which(twentyClusters == largeclustindex)
# Show the full model names
tag_data[modelindices, 1]
##  [1] "C5.0 Rules (C5.0Rules)"                      
##  [2] "Single C5.0 Tree (C5.0Tree)"                 
##  [3] "Conditional Inference Tree (ctree)"          
##  [4] "Conditional Inference Tree (ctree2)"         
##  [5] "MARS (earth)"                                
##  [6] "Tree Models from Genetic Algorithms (evtree)"
##  [7] "Flexible Discriminant Analysis (fda)"        
##  [8] "MARS (gcvEarth)"                             
##  [9] "C4.5-like Trees (J48)"                       
## [10] "Rule-Based Classification (Jrip)"            
## [11] "Oblique Trees (oblique.tree)"                
## [12] "Single Rule Classification (OneR)"           
## [13] "Rule-Based Classification (PART)"            
## [14] "CART (rpart)"                                
## [15] "CART (rpart2)"                               
## [16] "Cost-Sensitive CART (rpartCost)"

Notice that ctree and ctree2 have 0 dissimlarity and thus have identical tab bit strings. By changing array indices for modelindices below you can find others that are identical

Dm[modelindices[1:4], modelindices[1:4]]
##           C5.0Rules C5.0Tree ctree ctree2
## C5.0Rules       0.0     0.50  0.60   0.60
## C5.0Tree        0.5     0.00  0.25   0.25
## ctree           0.6     0.25  0.00   0.00
## ctree2          0.6     0.25  0.00   0.00

Here are ctree and ctree2 identical bit strings respectively

paste0(as.character(tag_data[modelindices[3], columnIndices]), collapse = "")
## [1] "11000000000001000000000000000000000000001"
paste0(as.character(tag_data[modelindices[4], columnIndices]), collapse = "")
## [1] "11000000000001000000000000000000000000001"

Here their respective tags relate to:

ctreetagnameindices <- which(tag_data[modelindices[3], columnIndices] == 1) + 
    1
dimnames(tag_data)[[2]][ctreetagnameindices]
## [1] "Classification"             "Regression"                
## [3] "Implicit.Feature.Selection" "Tree.Based.Model"

They are both classification tree models but can be used in regression, and they both have implicit feature selection. The feature set (the set of tags) does not descriminate them at all.

(If one wants to suck out the duplicates, leaving only one representative unique vector for each, then the following code can be used, where “data” is a set of bit vectors, and “data2” is the unique set.

pastestring <- function(i, data){paste(as.character(data[i,]),collapse="")}
data2 <- data[-which(duplicated(unlist(sapply(1:nrow(data),pastestring,data)))),]

Yes, Virginia, that is ugly…A shell script to do this is actually shorter and faster, and therefore better for very large sets.)

By permuting the input order, we can show that the clustering is NOT necessarily invariant to input order.

# for the same permutation, set random number seed
set.seed(13)
# Generate a permutation
a.permutation <- sample(lentag_data)
# Permute the order of the tag data
tag_dataPermutation <- tag_data[a.permutation, columnIndices]
# Show a small section of the now permuted data
tag_dataPermutation[1:4, 1:5]
##          Classification Regression Bagging Bayesian.Model Boosting
## qrf                   0          1       1              0        0
## gcvEarth              1          1       0              0        0
## lda2                  1          0       0              0        0
## C5.0Cost              1          0       0              0        1
# Create the new dissimilarity matrix
Dpermutation <- dist(tag_dataPermutation, method = clusterMeasure)
# Convert the dist() vector output to a full matrix
DmPermutation <- as.matrix(Dpermutation)
par(cex = 0.3, col = "red")
PModelCompleteClustPermute <- agnes(DmPermutation, diss = TRUE, clusterMethod)
plot(as.dendrogram(as.hclust(PModelCompleteClustPermute), hang = 0.1), horiz = TRUE)

plot of chunk unnamed-chunk-2

The dendrograms are clearly different, but maybe that is just a series of mobile kinetic sculpture rotations (see Alexander Calder)? Let's compare clusters from the two clusterings and see which are identical and which, if any, are not.

# Create a cut size
cutsize <- 5
# This represents roughly just shy of 0.8 dissimilarity -- see distribution
# plot above

# One change the cut size (e.g., try 4, 6, and 15) and see how cluster
# matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(PModelCompleteClust, cutsize)
ClustersPermute <- cutree(PModelCompleteClustPermute, cutsize)

# for loops ok in small example, otherwise, vectorize!
for (i in 1:cutsize) {
    for (j in 1:cutsize) {
        compare1 <- which(Clusters == i)
        compare2 <- which(ClustersPermute == j)
        if (length(compare1) == length(compare2)) {
            print(paste("Cluster Equals", as.character(i), as.character(j), 
                sep = " "))
            compare1 <- dimnames(tag_data[compare1, ])[[1]]
            compare2 <- dimnames(tag_dataPermutation[compare2, ])[[1]]
            if (setequal(compare1, compare2)) {
                print(paste("Cluster Match", as.character(i), as.character(j), 
                  sep = " "))
            }
        }
    }
}
## [1] "Cluster Equals 2 2"
## [1] "Cluster Equals 4 3"
## [1] "Cluster Match 4 3"
## [1] "Cluster Equals 5 5"
## [1] "Cluster Match 5 5"
table(Clusters)
## Clusters
##  1  2  3  4  5 
## 42 29 35 21 16
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5 
## 73 29 21  4 16
# If the tables are large, you can tell which clusters are at least the same
# size quickly creating a cross table:
table(Clusters, ClustersPermute[order(a.permutation)])
##         
## Clusters  1  2  3  4  5
##        1 42  0  0  0  0
##        2 29  0  0  0  0
##        3  2 29  0  4  0
##        4  0  0 21  0  0
##        5  0  0  0  0 16
# `Clusters` will be by the rows of the table; `ClustersPermute` will be by
# the columns.  We have to permute the order however back to the original
# input order of the first clustering.  Any row with only one non-zero
# entry, and that entry is the only non-zero entry to the respective column,
# will have the same number of cluster members.  This doesn't work however
# if there are ties (say, the first clustering has a cluster with a total of
# 6 members, and the second clustering has several clusters with a total of
# 6 members)

Only a few clusters match in this instance. Let's look at the proximity matrix for a given measure, and we can begin to get a sense as to why there are so many ties in proximity and therefore clustering ambiguity.

# We use just the lower half of the matrix, `D`
plot(table(D), col = 3, xlab = clusterMeasure, ylab = "Frequency")
title("Discrete Distribution of Dissimmilarities")

plot of chunk showdiscretedist

The length of the tag_data bit string is 41 bits. We can count the total number of possible Tanimoto values by way of the Farey numbers N, namely the number of reduced positive integer fractions with a numerator of up to 41, between and inclusive of 0 and 1. The asymptotic number of fractions for large N is \( \frac{3N^2}{\pi^2} \) Thus, we can get a rough estimate as to the number of fractions.

lengthFeatureCol <- 41  #
ceiling((3 * lengthFeatureCol^2)/(pi^2))
## [1] 511

But we can also calculate the exact number with a simple function.

# Count of Farey Numbers N (returned as the first member of the list
# 'Terms'. Uncomment the line Terms <- append(Terms,A/B) if you want to
# collect the fractions.)

FareyCount <- function(N) {
    if (is.numeric(N) & N > 0) {
        NumTerms <- 1
        A <- 0
        B <- 1
        C <- 1
        D <- N
        Terms <- list()  #collect terms as well.
        while (C < N) {
            # Appending the list is slow when N gets over 50 or so, so not run.

            # Terms <- append(Terms,A/B)
            NumTerms <- NumTerms + 1
            K <- (N + B)%/%D
            E <- K * C - A
            F <- K * D - B
            A <- C
            B <- D
            C <- E
            D <- F
        }
        return(append(Terms, NumTerms))
    } else {
        print("Error: N needs to be a numeric, positive integer")
    }
}

FareyTermsCount <- FareyCount(lengthFeatureCol)
FareyTermsCount[[1]]
## [1] 531

In spite of the fact that there are 531 possible fractions, and this particular data set has \( \frac{N(N-1)}{2} = \frac{143*142}{2} = 10153 \) proximities(dissimilarities)), there are only

length(table(D))
## [1] 36

distinct proximities, given this particular data set. Very few of the possible proximities are used. \( 36 < F(41) = 531 < \frac{N(N-1)}{2} = 10153 \). Those that are tend to be a great many low order fractions, \( 1/2, 1/3, 2/5 \), etc., and far more rarely, fractions like, say, \( 19/21 \), etc. This turns out to be associated with the probability of matching bits – the preponderance of unreduced fractions with the Jaccard reducing to low order fractions. If the size of the data is considerably smaller than the number of bits in the bit vector, the chance of ambiguity is smaller – and, depending on the clustering algorithm and measure, may not be a significant issue if at all.

A note on precision: The number of bits in a bit vector needs to be over 1813 bits for the difference in the closest pair of Farey numbers to be smaller than the precision of a float 6-digits of precision for the interval of \( [0,1] \). But then there are over a million possible values, thus ties are quite rare in continuous data where there is sufficient precision (C float is often good enough, double is obviously better). Conversely, continuous valued data without sufficient precision (e.g., 2 digits of precision in the interval \( [0,1] \)) also leads to ties in proximity and clustering ambiguity.

Because there are some duplicate fingerprints as we showed above with ctree and ctree2, there are a number of 0 dissimilarities. There are also some functions that have no tags in common, and hence those dissimilarities that are 1.

We can compare the two clusterings at cut size 5 from above with the function cluster.stats in the R package fpc. We have to reorder the permutation (the inverse permutation). The clusterings differ enough to show considerable ambiguity. Different non-trivial permutations will show similar differences, some larger, some smaller. This shows that given this data, the measure we used, and the clustering algorithm, the results are not invariant to input order. Definitive statements about clustering results in this setting need to take this issue into consideration.

One can try other permutations, measures, and algorithms and compare the results with cluster.stats() from the fpc package. Here we use two relative indices to compare the clusterings, then show those same indices with a clustering compared with itself.

CompareClusterings <- cluster.stats(Dm, Clusters, ClustersPermute[order(a.permutation)])

# Use the same cluster
CompareSameClusterings <- cluster.stats(Dm, Clusters, Clusters)

# Compare the Rand Index and the Variation of Information
CompareClusterings$corrected.rand
## [1] 0.6245
CompareClusterings$vi
## [1] 0.5387

# Values for the Rand Index and the Variation of Information when the
# clusterings are identical
CompareSameClusterings$corrected.rand
## [1] 1
CompareSameClusterings$vi
## [1] 0

An explanation of the Rand Index can be found at: http://en.wikipedia.org/wiki/Rand_index. The Variation of Information can be found at: http://en.wikipedia.org/wiki/Variation_of_information.

Another way we can look at the differences is to calculate the cophenetic correlation. This is however over the entire hierarchy. Note how the cophenetic correlations are nearly identical. Significant differences of various partitions is largely hidden by the cophenetic correlation. See http://en.wikipedia.org/wiki/Cophenetic_correlation

cor(D, cophenetic(PModelCompleteClust))
## [1] 0.8121
cor(Dpermutation, cophenetic(PModelCompleteClustPermute))
## [1] 0.8112

No comments:

Post a Comment