Friday, October 2, 2015

Independent Component Analysis and Naive Bayes Part 1 Continuous Data

Naive Bayes and Independent Component Analysis

Naive Bayes and Independent Component Analysis

Here is some fun with Naive Bayes modeling and Independent Component Analysis (ICA). Naive Bayes modeling assumes that the features are independent. The method works well, or reasonably well, even if there is some dependency among the features. Survey of Improving Naive Bayes for Classification This is nothing new. For instance, ICA as a transformation can be found in they predictive modeling framework of the R caret package as a preprocessing step. It may or may not be appropriate of course, depending on the data.

What is tougher to understand is how well it works with fixed-length, binary vector data. Empirically, it appears to do fine with some, possibly a lot of dependency among the binary features. Independent Component Analysis for Binary Data: An Experimental Study

So, naturally, if a modeler suspects the binary feature data to be highly dependent, transforming the features to a set of independent components might well be advised. In the simplest methods, the binary features are transformed to continuous independent components as in the citation above.. Transforming them optimally to binary independent components is NP hard, so other approximation approaches tend to be more computationally intensive. Here are some very worthy attempts at this problem: ICA-based Binary Feature Construction Generalized Independent Component Analysis Over Finite Alphabets

I was first drawn to this problem upon seeing that Naive Bayes was being used for modeling fixed length bit vector data (“fingerprints”), and had become quite popular a few years back in chemoinformatics. The construction of such types of fingerprints (molecular feature fingerprints) however are rife with dependency as it turns out (and it is not hard to show). This is something I know from having used them extensively and, indeed, designed some several.

I am not going to launch into binary data just yet (see the next post), but rather just using some simple continuous data to see what kind of effect ICA has on naive Bayes and several other classifiers with a simple continuous data set. Below there is also a comparison with PCA that addresses correlation for a comparison. The results show that independence has a greater impact on the modeling than that of correlation with these data – as chance would have it.

Libraries and data

library(fastICA) # For ICA functions
library(ica) # More specific ICA functions
library(e1071) # For naive bayes
library(klaR)  # For Naive Bayes and partimat (plotting features pair-wise to show 2D decision surface)
## Loading required package: MASS
library(AppliedPredictiveModeling) # For modelling tools
data(iris) # The hackneyed standby.  There could be some dependency among these features, but how much and why is not necessarily something that would be known before hand.

Naive Bayes with Raw Iris Data (no holdout or validation)

Here is a peak at all of the data – we know that with validation the performance will degrade.

predictActStrings <- list("predicted","actual")

irisNBclassifier <- NaiveBayes(iris[,1:4],iris[,5])

table(predict(irisNBclassifier,iris[,1:4])$class,iris[,5], 
      dnn=predictActStrings)
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47

Two classes have a small amount of misclassifications.

##Scaling

Throw some scaling in as is typical.

#scale the data, in the event that the scales of the predictor values are significantly different (common)
scaledIris <- scale(iris[,1:4])
irisNBclassifierScaled <- NaiveBayes(scaledIris[,1:4],iris[,5])

table(predict(irisNBclassifierScaled,scaledIris[,1:4])$class,iris[,5], 
      dnn=predictActStrings)
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         3
##   virginica       0          3        47

Scaling does not really change things.

Example with 3 Dimensions ICA

Here again is a peak at all of the data with ICA, first with 3 dimensions (one less than the raw data). Then 2, and then 1 dimension will be looked at.

irisICA3 <- fastICA(scaledIris[,1:4],3)$S

irisNBclassifier3 <- NaiveBayes(irisICA3,iris[,5])

table(predict(irisNBclassifier3,irisICA3)$class,iris[,5], 
      dnn=predictActStrings)
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         3
##   virginica       0          2        47

3D ICA is ever so slightly better.

Example with 2 Dimensions ICA

irisICA2 <- fastICA(scaledIris[,1:4],2)$S

irisNBclassifier2 <- NaiveBayes(irisICA2,iris[,5])

table(predict(irisNBclassifier2,irisICA2)$class,iris[,5], 
      dnn=predictActStrings)
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         45        11
##   virginica       0          5        39

2D ICA is worse.

Example with 1 Dimensions ICA

irisICA1 <- fastICA(scaledIris[,1:4],1)$S

irisNBclassifier1 <- NaiveBayes(irisICA1,iris[,5])

table(predict(irisNBclassifier1,irisICA1)$class,iris[,5], 
      dnn=predictActStrings)
##             actual
## predicted    setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         44         5
##   virginica       0          6        45

1D ICA is oddly better, but not as good as 3 dimensions (components).

Remove Setosa and Create Two-Class Problem.

As a class Setosa is well separated, so let's just get rid of it for simplicity. Ignore scaling as well.

iris2Class <- iris[iris[,5] != "setosa",]
iris2Class[,5] <- factor(iris2Class[,5])
#scale the data

iris2ClassNBclassifier <- NaiveBayes(iris2Class[,1:4],iris2Class[,5])

table(predict(iris2ClassNBclassifier,iris2Class[,1:4])$class,iris2Class[,5], 
      dnn=predictActStrings)
##             actual
## predicted    versicolor virginica
##   versicolor         47         3
##   virginica           3        47

Removing the class plays no role in the misclassification of the other two.

Cross-Validate with Repeated Random Sub-Sampling Validation

Now comes the fun part where the model is cross-validated, in this case with repeated random sub-sampling. There are other equally good validation methods along with their respective pros and cons which I will also omit.

We'll need a little function where we can change various parameters.

randomSS <- function(i,                 # iterator
                     trainpc,           # training percent of data
                     lenData,           # total length of the data
                     numpred,           # number of perdictors
                     predictors,        # predictor variables
                     classVar,          # dependent variable
                     predictActStrings  # class labels
                     ){
  trainIndices <- sort(sample(lenData, trainpc*lenData))
  trainSet <- cbind(predictors[trainIndices,numpred],classVar[trainIndices])
  testSet <- cbind(predictors[-trainIndices,numpred],classVar[-trainIndices])
  trainNBclassifier <- NaiveBayes(trainSet[,numpred],as.factor(trainSet[,ncol(trainSet)]))
  ConfusionTable <- table(predict(trainNBclassifier,testSet[,numpred])$class,
                          testSet[,ncol(testSet)], dnn=predictActStrings)
  100*(ConfusionTable[[2]] + ConfusionTable[[3]])/(lenData-length(trainIndices))
}

With all columns of the scaled data…

set.seed(13) # Let's make things deterministic

# The next three variables won't be changed throughout
lenData <- length(iris2Class[,1])
K <- 100 # Iterations (the number of repeats)
trainpc <- 0.66 # Take 66 percent of the data for the training set

# We will however change the number of predictor (or component) variables
numpred <- 1:4    # Number of predictors

#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2Class,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 7.176471
sd(ErrorVector)
## [1] 3.6996

Try PCA for a comparison.

PCA tries to ferret out correlation and not dependence.

PCA2Class <- prcomp(iris2Class[,1:4])
iris2ClassPCA <- as.data.frame(predict(PCA2Class,iris2Class))

plot(PCA2Class) #Plot the scree plot (the relative variability of the components from first (largest is left most) to last (smallest is right most))

plot of chunk PCAComparison

ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2ClassPCA,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

mean(ErrorVector)
## [1] 6.970588
sd(ErrorVector)
## [1] 3.868965

The scree plot shows that most of the variability is in the first principal component, with only a small contribution from the second and third, and very little from the forth. Nevertheless, the model performs ever so slightly better.

We can try it again with just the top three components

iris2ClassPCA <- iris2ClassPCA[,1:3]
numpred <- 1:3
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2ClassPCA,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

mean(ErrorVector)
## [1] 6.323529
sd(ErrorVector)
## [1] 3.794165

This is a bit better. Let's see if we just take the first two principal components.

iris2ClassPCA <- iris2ClassPCA[,1:2]
numpred <- 1:2

ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2ClassPCA,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

mean(ErrorVector)
## [1] 6.735294
sd(ErrorVector)
## [1] 3.745957

Not as good.

Cross-Validate ICA 3D without PCA

numpred <- 1:3    # Number of predictors

#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2Class,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 17
sd(ErrorVector)
## [1] 5.095639

Cross-Validate ICA 3D without PCA

numpred <- 1:2    # Number of predictors

#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
                      iris2Class,iris2Class[,ncol(iris2Class)],
                      predictActStrings)

print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 31.08824
sd(ErrorVector)
## [1] 5.781292

Transforming the testing data in relation to the training data

Different and better! The error has dropped some. But I am cheating. By calculating ICA for all of the data, I am including some information about the holdout test data. We can just calculate the ICA for the training data, retrieve the estimated mixing matrix and use the inverse of that to generate the ICA values for the testing predictor variables. Thus, \( X \) is the pre-processed data, \( A \) is the mixing matrix, and \( S \) is the matrix of ICA components, then \[ X = SA. \] Given that, we can perform \[ X'A^{-1} = S', \] where \( X' \) is the testing data predictor variables, and \( S' \) are the corresponding estimated ICA values for the testing data, derived from the training data via the estimated mixing matrix \( A \). There is one more detail however. In \( fastICA \) the matrix \( X \) is mean centered first. Those means are of the training data and must be used to center the testing data in order to calculate \( S' \). First off, we will use all of the features.

As it turns out \( K \), the prewhitening matrix, and \( W \), the un-mixing matrix can be used instead of \( A^{-1} \), such that \[ X'KW = S', \] Code for both ways is included. One can play around with system.time() for any performance difference, but they are roughly equivalent in this setting for this small problem.

#  We need to alter the randomSS function to accommadate the cross-validation.
randomSSCV <- function(i,               # iterator
                     dimen,             # number of dimensions
                     splitsize,         # size of training set
                     lenData,           # total length of the data
                     numpred,           # number of perdictors
                     predictors,        # predictor variables
                     classVar,          # dependent variable
                     predictActStrings  # class labels
                     ){
  trainIndices <- sort(sample(lenData, splitsize))
  trainSet <- predictors[trainIndices,numpred]
  #needed to mean center testing data to training data means
  means <- apply(trainSet, 2, mean) 
  ICA <- fastICA(trainSet, dimen)
  trainSet <- cbind(ICA$S,classVar[trainIndices])
  #Note, deriving the testSet S, requires that the raw test set data is mean
  #centered to the training raw data means, hence the centering call to "scale"
  #testSet <- cbind(scale(predictors[-trainIndices,numpred],center=means) 
  #                %*% solve(ICA$A),
  #                classVar[-trainIndices])

  testSet <- cbind(scale(predictors[-trainIndices,numpred],center=means) 
                 %*% (ICA$K %*% ICA$W),
                 classVar[-trainIndices])
  trainNBclassifier <- 
               NaiveBayes(trainSet[,numpred],as.factor(trainSet[,dimen+1]))
  ConfusionTable <- table(suppressWarnings(
                          predict(trainNBclassifier,testSet[,numpred])$class),
                          testSet[,dimen+1], dnn=predictActStrings)

  100*(ConfusionTable[[2]] + ConfusionTable[[3]])/(lenData-length(trainIndices))
}
ErrorVector <- numeric(K)
dimen <- 4
numpred <- 1:dimen
iris2Classmat <- as.matrix(iris2Class[,1:dimen])

splitsize <- floor(trainpc * lenData)
system.time(ErrorVector <- sapply(1:K,randomSSCV,dimen, splitsize,lenData,numpred,
                      iris2Class,iris2Class[,ncol(iris2Class)],
                      predictActStrings))
##    user  system elapsed 
##   3.621   0.023   3.656
mean(ErrorVector)
## [1] 20.05882
sd(ErrorVector)
## [1] 11.18897
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:4])
for(i in 1:K){
  trainIndices <- sort(sample(lenData, 0.66*lenData))

  trainSet <- iris2Classmat[trainIndices,]
  means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
  ICA <- fastICA(trainSet[,1:4],4)
  trainSet <- ICA$S
  #Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
  testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
  trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
  ConfusionTable <- table(suppressWarnings(
                          predict(trainNBclassifier,testSet)$class),
                      iris2Class[-trainIndices,5], 
                      dnn=predictActStrings)
  ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
    (lenData-length(trainIndices))
}

mean(ErrorVector)
## [1] 18.94118
sd(ErrorVector)
## [1] 9.572309
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:3])
for(i in 1:K){
  trainIndices <- sort(sample(lenData, 0.66*lenData))

  trainSet <- iris2Classmat[trainIndices,]
  means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
  ICA <- fastICA(trainSet[,1:3],3)
  trainSet <- ICA$S
  #Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
  testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
  trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
  ConfusionTable <- table(suppressWarnings(
                          predict(trainNBclassifier,testSet)$class),
                      iris2Class[-trainIndices,5], 
                      dnn=predictActStrings)
  ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
    (lenData-length(trainIndices))
}

mean(ErrorVector)
## [1] 18.70588
sd(ErrorVector)
## [1] 7.736801
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:2])
for(i in 1:K){
  trainIndices <- sort(sample(lenData, 0.66*lenData))

  trainSet <- iris2Classmat[trainIndices,]
  means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
  ICA <- fastICA(trainSet[,1:2],2)
  trainSet <- ICA$S
  #Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
  testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
  trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
  ConfusionTable <- table(suppressWarnings(
                          predict(trainNBclassifier,testSet)$class),
                      iris2Class[-trainIndices,5], 
                      dnn=predictActStrings)
  ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
    (lenData-length(trainIndices))
}

mean(ErrorVector)
## [1] 36.64706
sd(ErrorVector)
## [1] 7.377203

Pairs Plots of Two-Class Scaled Raw Data

plot of chunk unnamed-chunk-1

##The Correlation Matrix of the Two-Class Raw Data

cor(iris2Class[,1:4])
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000   0.5538548    0.8284787   0.5937094
## Sepal.Width     0.5538548   1.0000000    0.5198023   0.5662025
## Petal.Length    0.8284787   0.5198023    1.0000000   0.8233476
## Petal.Width     0.5937094   0.5662025    0.8233476   1.0000000

Plot class distributions by raw data features

par(mfrow=c(1,4))
for(i in 1:4){
plot(density(iris2Class[iris2Class$Species == "versicolor",i]), col = "red",
     main=names(iris2Class)[i])
lines(density(iris2Class[iris2Class$Species == "virginica",i]), col = "blue")
}

plot of chunk plotclassdistraw

Pairs Plots of 3D ICA

pairs(irisICA3,main="Iris Data 2 Class ICA 3D", pch=21,
      bg=c("red","blue")[unclass(iris2Class$Species)])

plot of chunk plot3DICA

Distributions given ICA 3D features

Here we can look at the class separation by independent component.

iris3DICA <- fastICA(iris2Class[,1:4],3)$S
VectorNames <- c("ICAV1","ICAV2","ICAV3")
par(mfrow=c(1,3))
for(i in 1:3){
   xlimits <- c(min(iris3DICA[,i]),max(iris3DICA[,i]))
   plot(density(iris3DICA[iris2Class$Species == "versicolor",i]), 
        xlim = xlimits,col = "red",main=VectorNames[i])
   lines(density(iris3DICA[iris2Class$Species == "virginica",i]), col = "blue")
}

plot of chunk plotdist3Dclassseparation

Mean and Variance of ICA 3D Features

meanVarTables <- NaiveBayes(iris3DICA,
                            iris2Class[,5])$tables
meanVarTables[[1]]
##                   [,1]     [,2]
## versicolor  0.02231372 0.953892
## virginica  -0.02231372 1.062963
Vector1Matrix <- meanVarTables[[1]]
Vector2Matrix <- meanVarTables[[2]]
Vector3Matrix <- meanVarTables[[3]]

Plot Mean and Variance of ICA 3D Features

par(mfrow=c(1,3))
plot(function(x) dnorm(x, Vector1Matrix[1,1], Vector1Matrix[1,2]), 
     min(Vector1Matrix[,1])-(3*max(Vector1Matrix[,2])), 
     max(Vector1Matrix[,1])+(3*max(Vector1Matrix[,2])),
     col="red", main="Vector 1 for the 2 different species",
     ylab="density")
curve(dnorm(x, Vector1Matrix[2,1], Vector1Matrix[2,2]), 
      add=TRUE, col="blue")

plot(function(x) dnorm(x, Vector2Matrix[1,1], Vector2Matrix[1,2]), 
     min(Vector2Matrix[,1])-(3*max(Vector2Matrix[,2])), 
     max(Vector2Matrix[,1])+(3*max(Vector2Matrix[,2])),
     col="red", main="Vector 2 for the 2 different species",
     ylab="density")
curve(dnorm(x, Vector2Matrix[2,1], Vector2Matrix[2,2]), 
      add=TRUE, col="blue")

plot(function(x) dnorm(x, Vector3Matrix[1,1], Vector3Matrix[1,2]), 
     min(Vector3Matrix[,1])-(3*max(Vector3Matrix[,2])), 
     max(Vector3Matrix[,1])+(3*max(Vector3Matrix[,2])),
     col="red", main="Vector 3 for the 2 different species",
     ylab="density")
curve(dnorm(x, Vector3Matrix[2,1], Vector3Matrix[2,2]), 
      add=TRUE, col="blue")

plot of chunk plotmeanVarICA3D

Mean and Variance of ICA 2D Features

iris2DICA <- fastICA(iris2Class[,1:4],3)$S
meanVarTables <- NaiveBayes(iris2DICA,
                            iris2Class[,5])$tables
meanVarTables[[1]]
##                 [,1]      [,2]
## versicolor -0.853422 0.5374195
## virginica   0.853422 0.5153742
Vector1Matrix <- meanVarTables[[1]]
Vector2Matrix <- meanVarTables[[2]]

Plot Mean and Variance of ICA 2D Features

par(mfrow=c(1,2))
plot(function(x) dnorm(x, Vector1Matrix[1,1], Vector1Matrix[1,2]), 
     min(Vector1Matrix[,1])-(3*max(Vector1Matrix[,2])), 
     max(Vector1Matrix[,1])+(3*max(Vector1Matrix[,2])),
     col="red", main="Vector 1 for the 2 different species",
     ylab="density")
curve(dnorm(x, Vector1Matrix[2,1], Vector1Matrix[2,2]), 
      add=TRUE, col="blue")

plot(function(x) dnorm(x, Vector2Matrix[1,1], Vector2Matrix[1,2]), 
     min(Vector2Matrix[,1])-(3*max(Vector2Matrix[,2])), 
     max(Vector2Matrix[,1])+(3*max(Vector2Matrix[,2])),
     col="red", main="Vector 2 for the 2 different species",
     ylab="density")
curve(dnorm(x, Vector2Matrix[2,1], Vector2Matrix[2,2]), 
      add=TRUE, col="blue")

plot of chunk plotmeanVarICA2D

How Does a Random Forest Model Compare, Given ICA

library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
irisRF <- randomForest(iris3DICA[,],
                          iris2Class[,5], importance=TRUE,
                          proximity=TRUE)
print(irisRF)
## 
## Call:
##  randomForest(x = iris3DICA[, ], y = iris2Class[, 5], importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 3%
## Confusion matrix:
##            versicolor virginica class.error
## versicolor         47         3        0.06
## virginica           0        50        0.00
irisRF <- randomForest(iris2Class[,1:4],
                          iris2Class[,5], importance=TRUE,
                          proximity=TRUE)
print(irisRF)
## 
## Call:
##  randomForest(x = iris2Class[, 1:4], y = iris2Class[, 5], importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 6%
## Confusion matrix:
##            versicolor virginica class.error
## versicolor         47         3        0.06
## virginica           3        47        0.06

This is pretty good! Remember independent components were generated from the scaled data.

…And a Quadratic Classifier?

This is yet one more way to look at the data and the decision surface. In the case of a quadratic classifier the surfaces are quartic. This can be shown with 3D interactive graphis in R with rgl, and though fun is not necessarily any more illuminating as the pair-wise plots shown below.

grouping <- factor(c(rep("E", 50),rep("I",50)))
partimat(grouping~iris3DICA[,1]+iris3DICA[,2]+iris3DICA[,3],
         method="qda",prec= 300,plot.matrix=TRUE)

plot of chunk decisionsurface

#Notice one of the variable combination is really worthless.
#As an aside one can move the decision surface by changing the 'prior' argument
partimat(grouping~iris3DICA[,1]+iris3DICA[,2]+iris3DICA[,3],
         method="rda",prec= 300,plot.matrix=TRUE,prior=c(1.75,0.25)/2)

plot of chunk decisionsurface

No comments:

Post a Comment