Monday, October 12, 2015

Removing Clustering Ambiguity due to Binary Data with Independent Component Analysis

This post shows that one can really reduce or eliminate clustering ambiguity that is a result of discrete data by the use of a transformation of the data to independent components. However, the value of such a transformation in terms of the grouping of the data is a completely open question… one that would be fun to explore. Refer to the references in the previous posts on Naive Bayes and Independent Component Analysis on such binary independent component analysis (BICA) transformations.

library(proxy) #For bit vector (dis)similarity measures
library(cluster) #For hierarchical clustering and dendrogram formatting 
library(fpc)   #For relative differences between clusterings.
library(fastICA) #For fast ICA algorithm
library(lsr) #For the function permuteLevels()
library(compare) #For the compare() function to compare the two factor objects

clusterMeasure <- "Jaccard"
clusterMethod <- "complete"

Refer to notes in the Clustering Ambiguity blog posts I or II concerning the provenance of the data.

tag_data <- read.csv("model-tags2.csv", header=TRUE,stringsAsFactors = FALSE)

lentag_data <- dim(tag_data)[[1]] #Get the number of models (the number of items in the data set)
#Set the column indices
columnIndices <- 2:42
#These will have to be changed if the input data is changed
#This `dist` function is from the proxy package 
D <- dist(tag_data[,columnIndices], method = clusterMeasure)
#Convert `dist` output to a matrix object
Dm <- as.matrix(D)

In the Clustering Ambiguity blog posts, I permuted the input data and then computed the dissimilarity matrix. There was really no reason to recompute the matrix with the data permuted. So here I simply permute the rows and columns with the same permutation and this has the same effect as permuting the input data.

#for the same permutation, set random number seed
set.seed(13)
cutsize = 5
#Generate a permutation
a.permutation <- sample(lentag_data) 

DmPermutation <- Dm[a.permutation,a.permutation]

PModelCompleteClust <- agnes(Dm,diss=TRUE,clusterMethod)

PModelCompleteClustPermute <- agnes(DmPermutation,diss=TRUE,clusterMethod)

Clusters <- cutree(PModelCompleteClust,cutsize)
ClustersPermute <- cutree(PModelCompleteClustPermute,cutsize)

table(Clusters)
## Clusters
##  1  2  3  4  5 
## 42 29 35 21 16
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5 
## 73 29 21  4 16
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5
##        1 42  0  0  0  0
##        2 29  0  0  0  0
##        3  2 29  0  4  0
##        4  0  0 21  0  0
##        5  0  0  0  0 16

With both the original order of the input (Dm) and the newly permuted data (DmPermutation), an example will show that there are no ties in proximity when using the ICA transformation of the binary data, given 1. a transformation to just 10 continuous independent components of the original bit vectors having a length of 41 bits; and 2. the same cutsize of 5 used in the Clustering Ambiguity posts. There is a good chance that further tests will show ties in proximity will likely be very rare given an ICA transformation.

fpICA <- fastICA(tag_data[,columnIndices],10) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##   1   2   3   4   5 
## 105   9  13  11   5
table(ClustersPermute)
## ClustersPermute
##   1   2   3   4   5 
##  13 105   9   5  11
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters   1   2   3   4   5
##        1   0 105   0   0   0
##        2   0   0   9   0   0
##        3  13   0   0   0   0
##        4   0   0   0   0  11
##        5   0   0   0   5   0
Clusters
##   [1] 1 2 3 3 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 4 1 3 1 4 1 1 1 1 1 1 1 1 1
##  [36] 1 1 4 1 1 1 1 1 1 1 5 1 1 1 1 4 4 4 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 2 2 1 1 2 1 1 1 3 3 3 3 1 1 1 2 1 1 1 4 4 2 5 1 1 1 1 3 2 2 2
## [106] 1 4 3 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 1 1 1 5 1 4 4 5 1 1 1 1 1 1 1 1 1
## [141] 1 5 1
ClustersPermuted
##   [1] 2 3 1 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 5 2 1 2 5 2 2 2 2 2 2 2 2 2
##  [36] 2 2 5 2 2 2 2 2 2 2 4 2 2 2 2 5 5 5 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 3 3 2 2 3 2 2 2 1 1 1 1 2 2 2 3 2 2 2 5 5 3 4 2 2 2 2 1 3 3 3
## [106] 2 5 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 4 2 5 5 4 2 2 2 2 2 2 2 2 2
## [141] 2 4 2
#To show that the clusters are exactly the same takes a little be of comparing.

#By inspection, the results are the same, but to do it automatically to make sure...

#Turn the clustering results into factors
ClustersFactored <- factor(Clusters)
ClustersPermFactored <- factor(ClustersPermuted)
#Permute the levels of the permuted clustering results to the levels of the 
#non-permuted clustering results.  The Level permutation is found via inspection.
#One could write a loop to do the comparisons
NewClustersPermFactored <- permuteLevels(ClustersPermFactored,c(2,3,1,5,4))
#Now compare the two factor vectors in terms of the corresponding level for each element in the vector.
#If the factor vectors are the same, compare will return "TRUE"
compare(NewClustersPermFactored,ClustersFactored,allowAll=TRUE)
## TRUE
##   dropped attributes
#Another way to do it is to use the integer vectors returned by the clusterings #and loop through the first vector, finding each cluster and comparing it to the other clustering vector.
compareEqualClusterings <- function(i, x, y, size){
  j <- which(x==i)[1]
  all(which(x==i),which(y==y[j]))
}
#Warning: the above function is devoid of error checking!

#If all clusters are the same, then the sum of the true cluster comparisons will be equivalent to the cutsize
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE
#There are plenty of other ways to do this, some probably shorter and possibly more efficient.

Namely, the exact same clusters are returned. Though they are quite different from the clusters returned from the binary vectors and the Jaccard measure. Let's see what happens when we decrease the number of components.

fpICA <- fastICA(tag_data[,columnIndices],5) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##  1  2  3  4  5 
## 42  2 12 64 23
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5 
## 42 64 23  2 12
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5
##        1 42  0  0  0  0
##        2  0  0  0  2  0
##        3  0  0  0  0 12
##        4  0 64  0  0  0
##        5  0  0 23  0  0
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE

Let's see what happens if we increase the cut size. The further down the dendrogram near the leaves is where most of the ties occur…

cutsize = 10
fpICA <- fastICA(tag_data[,columnIndices],5) 
distICA <- stats::dist(fpICA$S,method="euclidean")
distICA <- as.matrix(distICA)
ICAClustering <- agnes(distICA,diss=TRUE,clusterMethod)

ICAClusteringPerm <- agnes(distICA[a.permutation,a.permutation],
                           diss=TRUE,clusterMethod)
#This represents roughly just shy of 0.8 dissimilarity -- see distribution plot above

#One change the cut size (e.g., try 4, 6, and 15) and see how cluster matching changes, and the ambiguity gets more, or less severe
Clusters <- cutree(ICAClustering,cutsize)
ClustersPermute <- cutree(ICAClusteringPerm,cutsize)
table(Clusters)
## Clusters
##  1  2  3  4  5  6  7  8  9 10 
## 29  2 13 11 18 35 23  3  8  1
table(ClustersPermute)
## ClustersPermute
##  1  2  3  4  5  6  7  8  9 10 
## 13 29 35 23  2 18 11  8  1  3
ClustersPermuted <- ClustersPermute[order(a.permutation)]
table(Clusters, ClustersPermuted)
##         ClustersPermuted
## Clusters  1  2  3  4  5  6  7  8  9 10
##       1   0 29  0  0  0  0  0  0  0  0
##       2   0  0  0  0  2  0  0  0  0  0
##       3  13  0  0  0  0  0  0  0  0  0
##       4   0  0  0  0  0  0 11  0  0  0
##       5   0  0  0  0  0 18  0  0  0  0
##       6   0  0 35  0  0  0  0  0  0  0
##       7   0  0  0 23  0  0  0  0  0  0
##       8   0  0  0  0  0  0  0  0  0  3
##       9   0  0  0  0  0  0  0  8  0  0
##       10  0  0  0  0  0  0  0  0  1  0
cutsize == sum(sapply(1:cutsize, compareEqualClusterings, Clusters, 
                      ClustersPermuted,cutsize))
## [1] TRUE

Further down the dendrogram doesn't appear to make a difference and it is far enough down with a cut size of 10 to begin to get a singleton cluster.

Friday, October 2, 2015

Independent Component Analysis and Naive Bayes Part 2 Binary Data

Naive Bayes, Independent Component Analysis, and fixed-length, bit vector data.

Naive Bayes, Independent Component Analysis, and fixed-length, bit vector data.

See previous post on Naive Bayes and Independent Component Analysis for a general discussion. This post concerns bit vector data exclusively, whereas the previous post was a warm up with continuous data.

Libraries and data

library(fastICA) # For ICA functions
library(ica) # More variety and control of ICA functions
library(e1071) # For naive bayes
library(klaR)  # For naive bayes with greater control
## Loading required package: MASS
library(AppliedPredictiveModeling) # For Permeability Data
library(caret)  # For Modeling infrastructure
## Loading required package: lattice
## Loading required package: ggplot2
library(lattice) # For histogram plotting function
library(gridExtra) # For plotting grid
library(gtools) # For odd() function
## 
## Attaching package: 'gtools'
## 
## The following object is masked from 'package:e1071':
## 
##     permutations
data(permeability)

Exploratory Data Analysis with molecular features (fingerprints).

These are the binary predictor variables. Such data often needs some preprocessing to remove useless features and/or identical fingerprints. The first section (rather long) deals with these issues in addition to having a look at the data.

#There are ...
(numCompounds <- nrow(fingerprints))
## [1] 165
#Moelcules

# Some subset of the compounds however may have duplicate fingerprints (This is quite common, as most fingerprint generators may not be able to distinquish between very similar compounds.)
uniqFps <- !duplicated(fingerprints)
nrow(fingerprints[uniqFps,])
## [1] 162
# There are some.  This gets rid of them:
fingerprints <- fingerprints[uniqFps,]
#Fix permeability data to reflect the deleted duplicates
permeability <- permeability[uniqFps]
#Reset numCompounds
numCompounds <- nrow(fingerprints)
# Find the features with no data.  Sometimes there are molecular features that never appear in the data set.  These are useless for the classifier or regression model.  Might as well suck these out.

NoBitOnIndices <- which(apply(fingerprints, 2, sum) == 0)
# There are ... 
length(NoBitOnIndices)
## [1] 20
# such features without any data.
# Ergo, ...

fingerprints <- fingerprints[,-NoBitOnIndices]

# Equally useless are features where all bits are turned on.

AllBitsOnIndices <- which(apply(fingerprints, 2, sum) == numCompounds)
length(AllBitsOnIndices)
## [1] 18
fingerprints <- fingerprints[,-AllBitsOnIndices]
# Which in this case turns out to be zero.

#Features that are identical for each compound add no discrimination.
tfp <- t(fingerprints)
fingerprints <- t(tfp[!duplicated(tfp),])

# How many features are there now
(numFeatures <- ncol(fingerprints))
## [1] 325
# Perfect anti-correlated features can also be removed, leaving just one feature for those identically anti-correlated

negCor <- function(j,fingerprints,i){
  if(cor(fingerprints[,i],fingerprints[,j]) < -0.999999)
    return(c(i,j))
}

outerloopNegCor <- function(i, fingerprints, numFeatures){  
  j <- i+1
  sapply(j:numFeatures, negCor, fingerprints, i)
}

negCorIndices <- unlist(sapply(1:(numFeatures-1), outerloopNegCor, 
                        fingerprints, numFeatures))

negCorIndices
## [1]   5  73  21 105  58  70
# Since the indicies are unique (we've already gotten rid of all postively correlated features), we can git rid of one feature per pair

fingerprints <- fingerprints[,-negCorIndices[odd(1:length(negCorIndices))]]

(numFeatures <- ncol(fingerprints))
## [1] 322
# How many bits are turned on per feature?

countFeatureBits <- function(i, fingerprints){
  length(which(apply(fingerprints, 2, sum) == i))
}

featureBitsOn <- sapply(1:numCompounds, countFeatureBits, fingerprints)

# Since there are only 162 molecular fingerprints, and we've ruled out all of the bits being turned on for any particular feature, the largest number of bits turned on for any feature is:
(maxFBO <- max(which(featureBitsOn > 0)))
## [1] 160
featureBitsOn <- featureBitsOn[1:maxFBO]
# This will be used to plot the bar plot of bits turned on per feature.
barplot(featureBitsOn,space = 0)

axis(side = 1, at = maxFBO, labels = FALSE)

# add x-axis with centered position, with labels, but without ticks.
axis(side = 1, at = seq_along(featureBitsOn) - 0.5, tick = FALSE, labels = 1:maxFBO)

plot of chunk EDA

# So there are some features with only 1 or 2 bits on, and several features with most, but not all bits on.

# Here is the distribution of the permeability
# It is skewed so here is also a log transformation of the data.

#Fix size of dataframe 
twoResponsesLen <- 2*numCompounds
responses <- numeric(twoResponsesLen)
responses[seq(1,twoResponsesLen, by = 2)] <- permeability
responses[seq(2,twoResponsesLen, by = 2)] <- -log(permeability)

permdf <- data.frame(Permeability = responses,
                     transType = 
                       factor(rep(c('Raw Permeability',
                                    'Negative Log Permeability'),
                                  numCompounds)))

histogram(~ Permeability | transType, data = permdf, 
          type = "density",
          panel=function(x, ...) {
              panel.histogram(x, ...)
              panel.densityplot(x, darg=list(bw = 2, kernel="gaussian"),...)
          },
          breaks = NULL,
          scales = list(x = list(relation = 'free')))

plot of chunk EDA

# Permeability is a continous value, but in terms of a drug's effectiveness, we can make it a categorical response with the designation of "low" and "high" permeability for a specific target, e.g., A-PAMPA and Caco-2.  

PAMPAPerm <- vector(length = numCompounds)
PAMPAPerm[permeability >= 1] <- "high"
PAMPAPerm[permeability < 1] <- "low"
PAMPAPerm <- factor(PAMPAPerm)

# In addition, the naiveBayes function in library e1071 doesn't like binary features as numeric data.  You'll need the following:

# Converting them to factors works however

factorfp <- fingerprints
factorfp <- factor(factorfp)
factorfp <- matrix(factorfp,nrow=numCompounds,ncol=numFeatures, byrow=TRUE)

#And you can run naiveBayes as:
#naiveBayes(fingerprints,PAMPAPerm)

##Preliminary tests with fingerprints and ICA

set.seed(13)

# First, does the model work at all:

naiveFit <- naiveBayes(factorfp,PAMPAPerm,laplace=1)
suppressWarnings(table(predict(naiveFit, factorfp), PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high   57   5
##      low    76  24
# Yes, pretty badly!

# Now trying with ICA...
# First, calculate some smaller set of independent components, using numICAs as a guess.

numICAs <- 30
fpICA <- fastICA(fingerprints,numICAs) 
naiveFit <- naiveBayes(fpICA$S,PAMPAPerm,laplace=1)
suppressWarnings(table(predict(naiveFit, fpICA$S), PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high  103   2
##      low    30  27
# NaiveBayes from the klaR package is more flexible:

naiveFit <- NaiveBayes(factorfp,PAMPAPerm, usekernel=TRUE,fL=1)
suppressWarnings(table(predict(naiveFit, factorfp)$class, PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high   57   5
##      low    76  24
# Fingerprints by themselves are still crummy, but ...

naiveFit <- NaiveBayes(fpICA$S,PAMPAPerm, usekernel=TRUE,fL=1)
suppressWarnings(table(predict(naiveFit, fpICA$S)$class, PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high  129   1
##      low     4  28
# Gangbusters!!
# Hey!  Better!  But this is cheating of course. To know what is really going on, we need to perform training (with parameter estimation) and testing ... with some form cross-validation.  

Are the independent components correlated with any of the binary features?

This is a way of possibly arriving at some form of interpretability of the modeling by associating the independent components back to the original binary features. If the independent components are highly correlated with the some of the binary features, far more so than would be expected by random chance, then such features may be the important features that discriminate the classes.

# See how correlated the ICAs are to the features
fpindicesMaxCor <-  function(i, icas, fingerprints){
  cors <- abs(cor(icas[,i],fingerprints))
  which(cors==max(cors))
}

fpMaxCors <-  function(i, icas, fingerprints){
    max(abs(cor(icas[,i],fingerprints)))
}

maxFPIndices <- sapply(1:numICAs,fpindicesMaxCor,fpICA$S,fingerprints)

maxFPICACors <- sapply(1:numICAs,fpMaxCors,fpICA$S,fingerprints)
# There are 
length(unique(unlist(maxFPIndices)))
## [1] 30
# unique features that are correlated to the ICA features, and
length(unlist(maxFPIndices)) == length(unique(unlist(maxFPIndices)))
## [1] TRUE
#no overlap between ICA features and fingerprint features.

# Plot the sorted correlations and compared to the mqximum correlation, given random vectors:

fillunif <- function(i, size){runif(size)}

randmatICA <- sapply(1:numICAs,fillunif, numCompounds)

randmatFPS <- sapply(1:numFeatures,fillunif, numCompounds)

maxRandCors <- sapply(1:numICAs, fpMaxCors, randmatICA,randmatFPS)

plot(1:numICAs,sort(maxFPICACors),type="l",
     xlab=
       paste("Blue dotted line = typical random correlation \n of ", 
             as.character(numICAs), " by ", as.character(numFeatures), 
             " random vectors with ", as.character(numCompounds),  
             " elements"),
     ylab = "Absolute Correlation",ylim=c(0,1),col="darkred")

lines(1:numICAs, sort(maxRandCors),lty = 2, col="darkblue")

plot of chunk backingOutICA

A good two thirds of the correlations are 0.8 and above.The weaker the correlation the more likely that the independent component is possibly a combination of several (many?) binary features.

So what's next? Clearly, like the previous post with continuous data, the models need to be cross-validated, and the test set(s) needs to have the independent components generated from the training mixing matrix.

Cross validation

As in the previous post, there is a need for a function that can perform the cross-validation. The function will be slightly different. There is a greater chance of running into problems calculating SVD with these data (binary), so there is a error catch that resamples if there is an SVD error.

#  We need to alter the randomSS function to accommadate the cross-validation.
randomSSCV <- function(i,               # iterator
                     dimen,             # number of dimensions
                     splitsize,          # size of training set
                     lenData,           # total length of the data
                     numpred,           # number of perdictors
                     predictors,        # predictor variables
                     classVar,          # dependent variable
                     predictActStrings  # class labels
                     ){
  trainIndices <- sort(sample(lenData, splitsize))
  tryICA <- try(fastICA(predictors[trainIndices,],
                        n.comp=dimen, maxit=500,tol=0.000001))
  #Notice the maximum number of iteration has been increased over the default fo   fastICA, and the tolerence has been refined.

  #Occasionally, SVD for PCA fails in computing fastICA.  
  #Redo sampling until SVD works.  This is a pretty rare event.
  while(inherits(tryICA, "try-error")){
      trainIndices <- sort(sample(lenData,splitsize))
      tryICA <- try(fastICA(fingerprints[trainIndices,],dimen))
      print(paste("Sample Failure at the i_th sampling, i = ", 
                  as.character(i)))
  }
  #needed to mean center testing data to training data means
  means <- apply(predictors[trainIndices,], 2, mean) 
  trainSet <- tryICA$S
  #Note, deriving the testSet S, requires that the raw test set data is mean
  #centered to the training raw data means, hence the centering call to "scale"
  #testSet <- cbind(scale(predictors[-trainIndices,numpred],center=means) 
  #                %*% solve(ICA$A),
  #                classVar[-trainIndices])

  testSet <- (scale(predictors[-trainIndices,],center=means) 
                 %*% tryICA$K) %*% tryICA$W
  trainNBclassifier <-
               NaiveBayes(trainSet,as.factor(classVar[trainIndices]),
                          fL=1)
  ConfusionTable <- table(suppressWarnings(
                          predict(trainNBclassifier,testSet)$class),
                          classVar[-trainIndices], dnn=predictActStrings)

  100*(ConfusionTable[[2]] + ConfusionTable[[3]])/(lenData-splitsize)
}

With this function

set.seed(14) # Let's make things deterministic
lenData <- numCompounds
K <- 100 # Iterations (the number of repeats)
trainpc <- 0.66 # Take 66 percent of the data for the training set
splitsize <-  floor(trainpc*lenData)
ErrorVector <- numeric(K)
#Searching an interval for the number of independent components from 20 to 100
#where the test error is lowest shows that the error flattens out around 60 
#independent components, then the error goes up and down a small amount until 
#after 90 independent components, where the error begins to climb to nearly 50% 
#(random) around 100.

dimen <- 60 
numpred <- 1:dimen

#Commented code is to search an interval for a number of 
#independent components where the test error is small.
#This code is best run offline or in parallel as it takes some time.
#Plots of those results are shown below.  It was shown that at roughly 60 
#Independent components the test error reaches a relative minimum test error, #hence the choice of 60 for the dimension of the following training/testing error.

#interval <- 81
#componentMat <- matrix(0,nrow=interval, ncol=2)
#iter <- 1
#for(i in 20:(20+interval)){

  ErrorVector <- sapply(1:K,randomSSCV,
                                  dimen,splitsize,
                                  lenData,numpred,
                                  fingerprints,PAMPAPerm,
                                  predictActStrings)
## [1] "Sample Failure at the i_th sampling, i =  71"
  componentMat[iter,] <- c(mean(ErrorVector),sd(ErrorVector))
## Error in componentMat[iter, ] <- c(mean(ErrorVector), sd(ErrorVector)): object 'componentMat' not found
#iter <- iter + 1
#}

#The test error in the number of independent components.
#errbar(20:interval,componentMat[,1],componentMat[,1] + componentMat[,2], #componentMat[,1] - componentMat[,2],xlab = "Number of Independent Components", #ylab="Mean Expected Total Classification Error")
#
#OR
#plot(20:100,componentMat2[,1],xlab = "Number of Independent Components", 
#ylab="Total Mean Classification Error",type="l")
#
#And plot the standard deviation of the test error over all of the possible ICs #between 20 and 100
#plot(20:100,componentMat2[,2],xlab = "Number of Independent Components", 
#ylab="Standard Deviation of the Error",type="l")

hist(ErrorVector,freq=FALSE,10,xlim=c(0,50), xlab="Error in Percent",
     main="Total Test Classification Error Percent",
     col="lightblue")
lines(density(sort(ErrorVector)),lty=2,col="darkred")

plot of chunk KFCVICAMix

Here is what the full test error and the corresponding standard deviations look like for independent components 20 through 100

alt text

This little exploration leaves out important aspects of a thorough modeling project, such as tuning parameters (e.g., kernel or gaussian X laplace correction X dimension) and feature selection of the independent components. This latter aspect might proceed by using a form of backwards selection whereby features are removed by their ranked order of correlation to individual bit features (see correlation plot above with 30 independent components correlated with bit features and the random continuous vector comparison), starting with lowest to highest correlation. Some preliminary tests showed that this was of little use in improving the test error though more extensive tests could prove otherwise.

Lastly, here is the correlation plot for 60 independent components using all of the data and the corresponding random vector comparison.

fpICA <- fastICA(fingerprints,dimen, maxit=500,tol=0.000001) 

maxFPIndices <- sapply(1:dimen,fpindicesMaxCor,fpICA$S,fingerprints)

maxFPICACors <- sapply(1:dimen,fpMaxCors,fpICA$S,fingerprints)
# There are 
length(unique(unlist(maxFPIndices)))
## [1] 59
# unique features that are correlated to the ICA features, and
length(unlist(maxFPIndices)) == length(unique(unlist(maxFPIndices)))
## [1] FALSE
#no overlap between ICA features and fingerprint features.

# Plot the sorted correlations and compared to the mqximum correlation, given random vectors:

fillunif <- function(i, size){runif(size)}

randmatICA <- sapply(1:dimen, fillunif, numCompounds)

randmatFPS <- sapply(1:numFeatures, fillunif, numCompounds)

maxRandCors <- sapply(1:dimen, fpMaxCors, randmatICA, randmatFPS)

plot(1:dimen,sort(maxFPICACors),type="l",
     xlab=
       paste("Blue dotted line = typical random correlation \n of ", 
             as.character(dimen), " by ", as.character(numFeatures), 
             " random vectors with ", as.character(numCompounds),  
             " elements"),
     ylab = "Absolute Correlation",ylim=c(0,1),col="darkred")

lines(1:dimen, sort(maxRandCors),lty = 2, col="darkblue")

plot of chunk correlation60

#And if we use this to simply do a little feature selection... notice how it 
#Takes all 60 features to produce the lowest test error.

#Removing just the 8 lowest correlated to bit vectors
length(which(maxFPICACors >0.7))
## [1] 52
naiveFit <- naiveBayes(fpICA$S[,which(maxFPICACors >0.7)],PAMPAPerm,laplace=1)
suppressWarnings(table(predict(naiveFit, fpICA$S), PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high   96  21
##      low    37   8
#Versus not removing any

naiveFit <- naiveBayes(fpICA$S,PAMPAPerm,laplace=1)
suppressWarnings(table(predict(naiveFit, fpICA$S), PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high   94   2
##      low    39  27
#If we remove, say the top 6 correlated, namely start the backward selection by #removing the top correlated we get somewhat the same performance as removing a #few from the bottom end. 

naiveFit <- naiveBayes(fpICA$S[,which(maxFPICACors < 0.945)],PAMPAPerm,laplace=1)
suppressWarnings(table(predict(naiveFit, fpICA$S), PAMPAPerm, 
      dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high  100  14
##      low    33  15
#How about if the correlated fingerprint features are used instead:

naiveFit <- NaiveBayes(factorfp[,maxFPIndices],PAMPAPerm, 
                       usekernel=TRUE,fL=1)
suppressWarnings(table(predict(naiveFit, factorfp[,maxFPIndices])$class, 
                       PAMPAPerm, 
                       dnn=list("predicted","actual")))
##          actual
## predicted high low
##      high   71   5
##      low    62  24
#This is only marginally better than random, with an error rate of roughly 41 #percent, but it is better than our initial random performance of around 50 #percent error rate.  These fingerprint features are probably of some importance.