Naive Bayes and Independent Component Analysis
Here is some fun with Naive Bayes modeling and Independent Component Analysis (ICA). Naive Bayes modeling assumes that the features are independent. The method works well, or reasonably well, even if there is some dependency among the features. Survey of Improving Naive Bayes for Classification This is nothing new. For instance, ICA as a transformation can be found in they predictive modeling framework of the R caret package as a preprocessing step. It may or may not be appropriate of course, depending on the data.
What is tougher to understand is how well it works with fixed-length, binary vector data. Empirically, it appears to do fine with some, possibly a lot of dependency among the binary features. Independent Component Analysis for Binary Data: An Experimental Study
So, naturally, if a modeler suspects the binary feature data to be highly dependent, transforming the features to a set of independent components might well be advised. In the simplest methods, the binary features are transformed to continuous independent components as in the citation above.. Transforming them optimally to binary independent components is NP hard, so other approximation approaches tend to be more computationally intensive. Here are some very worthy attempts at this problem: ICA-based Binary Feature Construction Generalized Independent Component Analysis Over Finite Alphabets
I was first drawn to this problem upon seeing that Naive Bayes was being used for modeling fixed length bit vector data (“fingerprints”), and had become quite popular a few years back in chemoinformatics. The construction of such types of fingerprints (molecular feature fingerprints) however are rife with dependency as it turns out (and it is not hard to show). This is something I know from having used them extensively and, indeed, designed some several.
I am not going to launch into binary data just yet (see the next post), but rather just using some simple continuous data to see what kind of effect ICA has on naive Bayes and several other classifiers with a simple continuous data set. Below there is also a comparison with PCA that addresses correlation for a comparison. The results show that independence has a greater impact on the modeling than that of correlation with these data – as chance would have it.
Libraries and data
library(fastICA) # For ICA functions
library(ica) # More specific ICA functions
library(e1071) # For naive bayes
library(klaR) # For Naive Bayes and partimat (plotting features pair-wise to show 2D decision surface)
## Loading required package: MASS
library(AppliedPredictiveModeling) # For modelling tools
data(iris) # The hackneyed standby. There could be some dependency among these features, but how much and why is not necessarily something that would be known before hand.
Naive Bayes with Raw Iris Data (no holdout or validation)
Here is a peak at all of the data – we know that with validation the performance will degrade.
predictActStrings <- list("predicted","actual")
irisNBclassifier <- NaiveBayes(iris[,1:4],iris[,5])
table(predict(irisNBclassifier,iris[,1:4])$class,iris[,5],
dnn=predictActStrings)
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
Two classes have a small amount of misclassifications.
##Scaling
Throw some scaling in as is typical.
#scale the data, in the event that the scales of the predictor values are significantly different (common)
scaledIris <- scale(iris[,1:4])
irisNBclassifierScaled <- NaiveBayes(scaledIris[,1:4],iris[,5])
table(predict(irisNBclassifierScaled,scaledIris[,1:4])$class,iris[,5],
dnn=predictActStrings)
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 47 3
## virginica 0 3 47
Scaling does not really change things.
Example with 3 Dimensions ICA
Here again is a peak at all of the data with ICA, first with 3 dimensions (one less than the raw data). Then 2, and then 1 dimension will be looked at.
irisICA3 <- fastICA(scaledIris[,1:4],3)$S
irisNBclassifier3 <- NaiveBayes(irisICA3,iris[,5])
table(predict(irisNBclassifier3,irisICA3)$class,iris[,5],
dnn=predictActStrings)
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 3
## virginica 0 2 47
3D ICA is ever so slightly better.
Example with 2 Dimensions ICA
irisICA2 <- fastICA(scaledIris[,1:4],2)$S
irisNBclassifier2 <- NaiveBayes(irisICA2,iris[,5])
table(predict(irisNBclassifier2,irisICA2)$class,iris[,5],
dnn=predictActStrings)
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 45 11
## virginica 0 5 39
2D ICA is worse.
Example with 1 Dimensions ICA
irisICA1 <- fastICA(scaledIris[,1:4],1)$S
irisNBclassifier1 <- NaiveBayes(irisICA1,iris[,5])
table(predict(irisNBclassifier1,irisICA1)$class,iris[,5],
dnn=predictActStrings)
## actual
## predicted setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 44 5
## virginica 0 6 45
1D ICA is oddly better, but not as good as 3 dimensions (components).
Remove Setosa and Create Two-Class Problem.
As a class Setosa is well separated, so let's just get rid of it for simplicity. Ignore scaling as well.
iris2Class <- iris[iris[,5] != "setosa",]
iris2Class[,5] <- factor(iris2Class[,5])
#scale the data
iris2ClassNBclassifier <- NaiveBayes(iris2Class[,1:4],iris2Class[,5])
table(predict(iris2ClassNBclassifier,iris2Class[,1:4])$class,iris2Class[,5],
dnn=predictActStrings)
## actual
## predicted versicolor virginica
## versicolor 47 3
## virginica 3 47
Removing the class plays no role in the misclassification of the other two.
Cross-Validate with Repeated Random Sub-Sampling Validation
Now comes the fun part where the model is cross-validated, in this case with repeated random sub-sampling. There are other equally good validation methods along with their respective pros and cons which I will also omit.
We'll need a little function where we can change various parameters.
randomSS <- function(i, # iterator
trainpc, # training percent of data
lenData, # total length of the data
numpred, # number of perdictors
predictors, # predictor variables
classVar, # dependent variable
predictActStrings # class labels
){
trainIndices <- sort(sample(lenData, trainpc*lenData))
trainSet <- cbind(predictors[trainIndices,numpred],classVar[trainIndices])
testSet <- cbind(predictors[-trainIndices,numpred],classVar[-trainIndices])
trainNBclassifier <- NaiveBayes(trainSet[,numpred],as.factor(trainSet[,ncol(trainSet)]))
ConfusionTable <- table(predict(trainNBclassifier,testSet[,numpred])$class,
testSet[,ncol(testSet)], dnn=predictActStrings)
100*(ConfusionTable[[2]] + ConfusionTable[[3]])/(lenData-length(trainIndices))
}
With all columns of the scaled data…
set.seed(13) # Let's make things deterministic
# The next three variables won't be changed throughout
lenData <- length(iris2Class[,1])
K <- 100 # Iterations (the number of repeats)
trainpc <- 0.66 # Take 66 percent of the data for the training set
# We will however change the number of predictor (or component) variables
numpred <- 1:4 # Number of predictors
#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2Class,iris2Class[,ncol(iris2Class)],
predictActStrings)
print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 7.176471
sd(ErrorVector)
## [1] 3.6996
Try PCA for a comparison.
PCA tries to ferret out correlation and not dependence.
PCA2Class <- prcomp(iris2Class[,1:4])
iris2ClassPCA <- as.data.frame(predict(PCA2Class,iris2Class))
plot(PCA2Class) #Plot the scree plot (the relative variability of the components from first (largest is left most) to last (smallest is right most))
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2ClassPCA,iris2Class[,ncol(iris2Class)],
predictActStrings)
mean(ErrorVector)
## [1] 6.970588
sd(ErrorVector)
## [1] 3.868965
The scree plot shows that most of the variability is in the first principal component, with only a small contribution from the second and third, and very little from the forth. Nevertheless, the model performs ever so slightly better.
We can try it again with just the top three components
iris2ClassPCA <- iris2ClassPCA[,1:3]
numpred <- 1:3
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2ClassPCA,iris2Class[,ncol(iris2Class)],
predictActStrings)
mean(ErrorVector)
## [1] 6.323529
sd(ErrorVector)
## [1] 3.794165
This is a bit better. Let's see if we just take the first two principal components.
iris2ClassPCA <- iris2ClassPCA[,1:2]
numpred <- 1:2
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2ClassPCA,iris2Class[,ncol(iris2Class)],
predictActStrings)
mean(ErrorVector)
## [1] 6.735294
sd(ErrorVector)
## [1] 3.745957
Not as good.
Cross-Validate ICA 3D without PCA
numpred <- 1:3 # Number of predictors
#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2Class,iris2Class[,ncol(iris2Class)],
predictActStrings)
print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 17
sd(ErrorVector)
## [1] 5.095639
Cross-Validate ICA 3D without PCA
numpred <- 1:2 # Number of predictors
#Iterate through K repeats
ErrorVector <- sapply(1:K,randomSS,trainpc,lenData,numpred,
iris2Class,iris2Class[,ncol(iris2Class)],
predictActStrings)
print("The mean and standard deviation of the classification error, (0-1)")
## [1] "The mean and standard deviation of the classification error, (0-1)"
mean(ErrorVector)
## [1] 31.08824
sd(ErrorVector)
## [1] 5.781292
Transforming the testing data in relation to the training data
Different and better! The error has dropped some. But I am cheating. By calculating ICA for all of the data, I am including some information about the holdout test data. We can just calculate the ICA for the training data, retrieve the estimated mixing matrix and use the inverse of that to generate the ICA values for the testing predictor variables. Thus, \( X \) is the pre-processed data, \( A \) is the mixing matrix, and \( S \) is the matrix of ICA components, then \[ X = SA. \] Given that, we can perform \[ X'A^{-1} = S', \] where \( X' \) is the testing data predictor variables, and \( S' \) are the corresponding estimated ICA values for the testing data, derived from the training data via the estimated mixing matrix \( A \). There is one more detail however. In \( fastICA \) the matrix \( X \) is mean centered first. Those means are of the training data and must be used to center the testing data in order to calculate \( S' \). First off, we will use all of the features.
As it turns out \( K \), the prewhitening matrix, and \( W \), the un-mixing matrix can be used instead of \( A^{-1} \), such that \[ X'KW = S', \] Code for both ways is included. One can play around with system.time() for any performance difference, but they are roughly equivalent in this setting for this small problem.
# We need to alter the randomSS function to accommadate the cross-validation.
randomSSCV <- function(i, # iterator
dimen, # number of dimensions
splitsize, # size of training set
lenData, # total length of the data
numpred, # number of perdictors
predictors, # predictor variables
classVar, # dependent variable
predictActStrings # class labels
){
trainIndices <- sort(sample(lenData, splitsize))
trainSet <- predictors[trainIndices,numpred]
#needed to mean center testing data to training data means
means <- apply(trainSet, 2, mean)
ICA <- fastICA(trainSet, dimen)
trainSet <- cbind(ICA$S,classVar[trainIndices])
#Note, deriving the testSet S, requires that the raw test set data is mean
#centered to the training raw data means, hence the centering call to "scale"
#testSet <- cbind(scale(predictors[-trainIndices,numpred],center=means)
# %*% solve(ICA$A),
# classVar[-trainIndices])
testSet <- cbind(scale(predictors[-trainIndices,numpred],center=means)
%*% (ICA$K %*% ICA$W),
classVar[-trainIndices])
trainNBclassifier <-
NaiveBayes(trainSet[,numpred],as.factor(trainSet[,dimen+1]))
ConfusionTable <- table(suppressWarnings(
predict(trainNBclassifier,testSet[,numpred])$class),
testSet[,dimen+1], dnn=predictActStrings)
100*(ConfusionTable[[2]] + ConfusionTable[[3]])/(lenData-length(trainIndices))
}
ErrorVector <- numeric(K)
dimen <- 4
numpred <- 1:dimen
iris2Classmat <- as.matrix(iris2Class[,1:dimen])
splitsize <- floor(trainpc * lenData)
system.time(ErrorVector <- sapply(1:K,randomSSCV,dimen, splitsize,lenData,numpred,
iris2Class,iris2Class[,ncol(iris2Class)],
predictActStrings))
## user system elapsed
## 3.621 0.023 3.656
mean(ErrorVector)
## [1] 20.05882
sd(ErrorVector)
## [1] 11.18897
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:4])
for(i in 1:K){
trainIndices <- sort(sample(lenData, 0.66*lenData))
trainSet <- iris2Classmat[trainIndices,]
means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
ICA <- fastICA(trainSet[,1:4],4)
trainSet <- ICA$S
#Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
ConfusionTable <- table(suppressWarnings(
predict(trainNBclassifier,testSet)$class),
iris2Class[-trainIndices,5],
dnn=predictActStrings)
ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
(lenData-length(trainIndices))
}
mean(ErrorVector)
## [1] 18.94118
sd(ErrorVector)
## [1] 9.572309
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:3])
for(i in 1:K){
trainIndices <- sort(sample(lenData, 0.66*lenData))
trainSet <- iris2Classmat[trainIndices,]
means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
ICA <- fastICA(trainSet[,1:3],3)
trainSet <- ICA$S
#Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
ConfusionTable <- table(suppressWarnings(
predict(trainNBclassifier,testSet)$class),
iris2Class[-trainIndices,5],
dnn=predictActStrings)
ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
(lenData-length(trainIndices))
}
mean(ErrorVector)
## [1] 18.70588
sd(ErrorVector)
## [1] 7.736801
ErrorVector <- numeric(K)
iris2Classmat <- as.matrix(iris2Class[,1:2])
for(i in 1:K){
trainIndices <- sort(sample(lenData, 0.66*lenData))
trainSet <- iris2Classmat[trainIndices,]
means <- apply(trainSet, 2, mean) #needed to mean center testing data to training data means
ICA <- fastICA(trainSet[,1:2],2)
trainSet <- ICA$S
#Note, deriving the testSet S, requires that the raw test set data is mean centered to the training raw data means, hence the centering call to "scale"
testSet <- scale(iris2Classmat[-trainIndices,],center=means) %*% solve(ICA$A) #solve for the inverse
trainNBclassifier <- NaiveBayes(trainSet,iris2Class[trainIndices,5])
ConfusionTable <- table(suppressWarnings(
predict(trainNBclassifier,testSet)$class),
iris2Class[-trainIndices,5],
dnn=predictActStrings)
ErrorVector[i] <- 100*(ConfusionTable[[2]] + ConfusionTable[[3]])/
(lenData-length(trainIndices))
}
mean(ErrorVector)
## [1] 36.64706
sd(ErrorVector)
## [1] 7.377203
Pairs Plots of Two-Class Scaled Raw Data
##The Correlation Matrix of the Two-Class Raw Data
cor(iris2Class[,1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 0.5538548 0.8284787 0.5937094
## Sepal.Width 0.5538548 1.0000000 0.5198023 0.5662025
## Petal.Length 0.8284787 0.5198023 1.0000000 0.8233476
## Petal.Width 0.5937094 0.5662025 0.8233476 1.0000000
Plot class distributions by raw data features
par(mfrow=c(1,4))
for(i in 1:4){
plot(density(iris2Class[iris2Class$Species == "versicolor",i]), col = "red",
main=names(iris2Class)[i])
lines(density(iris2Class[iris2Class$Species == "virginica",i]), col = "blue")
}
Pairs Plots of 3D ICA
pairs(irisICA3,main="Iris Data 2 Class ICA 3D", pch=21,
bg=c("red","blue")[unclass(iris2Class$Species)])
Distributions given ICA 3D features
Here we can look at the class separation by independent component.
iris3DICA <- fastICA(iris2Class[,1:4],3)$S
VectorNames <- c("ICAV1","ICAV2","ICAV3")
par(mfrow=c(1,3))
for(i in 1:3){
xlimits <- c(min(iris3DICA[,i]),max(iris3DICA[,i]))
plot(density(iris3DICA[iris2Class$Species == "versicolor",i]),
xlim = xlimits,col = "red",main=VectorNames[i])
lines(density(iris3DICA[iris2Class$Species == "virginica",i]), col = "blue")
}
Mean and Variance of ICA 3D Features
meanVarTables <- NaiveBayes(iris3DICA,
iris2Class[,5])$tables
meanVarTables[[1]]
## [,1] [,2]
## versicolor 0.02231372 0.953892
## virginica -0.02231372 1.062963
Vector1Matrix <- meanVarTables[[1]]
Vector2Matrix <- meanVarTables[[2]]
Vector3Matrix <- meanVarTables[[3]]
Plot Mean and Variance of ICA 3D Features
par(mfrow=c(1,3))
plot(function(x) dnorm(x, Vector1Matrix[1,1], Vector1Matrix[1,2]),
min(Vector1Matrix[,1])-(3*max(Vector1Matrix[,2])),
max(Vector1Matrix[,1])+(3*max(Vector1Matrix[,2])),
col="red", main="Vector 1 for the 2 different species",
ylab="density")
curve(dnorm(x, Vector1Matrix[2,1], Vector1Matrix[2,2]),
add=TRUE, col="blue")
plot(function(x) dnorm(x, Vector2Matrix[1,1], Vector2Matrix[1,2]),
min(Vector2Matrix[,1])-(3*max(Vector2Matrix[,2])),
max(Vector2Matrix[,1])+(3*max(Vector2Matrix[,2])),
col="red", main="Vector 2 for the 2 different species",
ylab="density")
curve(dnorm(x, Vector2Matrix[2,1], Vector2Matrix[2,2]),
add=TRUE, col="blue")
plot(function(x) dnorm(x, Vector3Matrix[1,1], Vector3Matrix[1,2]),
min(Vector3Matrix[,1])-(3*max(Vector3Matrix[,2])),
max(Vector3Matrix[,1])+(3*max(Vector3Matrix[,2])),
col="red", main="Vector 3 for the 2 different species",
ylab="density")
curve(dnorm(x, Vector3Matrix[2,1], Vector3Matrix[2,2]),
add=TRUE, col="blue")
Mean and Variance of ICA 2D Features
iris2DICA <- fastICA(iris2Class[,1:4],3)$S
meanVarTables <- NaiveBayes(iris2DICA,
iris2Class[,5])$tables
meanVarTables[[1]]
## [,1] [,2]
## versicolor -0.853422 0.5374195
## virginica 0.853422 0.5153742
Vector1Matrix <- meanVarTables[[1]]
Vector2Matrix <- meanVarTables[[2]]
Plot Mean and Variance of ICA 2D Features
par(mfrow=c(1,2))
plot(function(x) dnorm(x, Vector1Matrix[1,1], Vector1Matrix[1,2]),
min(Vector1Matrix[,1])-(3*max(Vector1Matrix[,2])),
max(Vector1Matrix[,1])+(3*max(Vector1Matrix[,2])),
col="red", main="Vector 1 for the 2 different species",
ylab="density")
curve(dnorm(x, Vector1Matrix[2,1], Vector1Matrix[2,2]),
add=TRUE, col="blue")
plot(function(x) dnorm(x, Vector2Matrix[1,1], Vector2Matrix[1,2]),
min(Vector2Matrix[,1])-(3*max(Vector2Matrix[,2])),
max(Vector2Matrix[,1])+(3*max(Vector2Matrix[,2])),
col="red", main="Vector 2 for the 2 different species",
ylab="density")
curve(dnorm(x, Vector2Matrix[2,1], Vector2Matrix[2,2]),
add=TRUE, col="blue")
How Does a Random Forest Model Compare, Given ICA
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
irisRF <- randomForest(iris3DICA[,],
iris2Class[,5], importance=TRUE,
proximity=TRUE)
print(irisRF)
##
## Call:
## randomForest(x = iris3DICA[, ], y = iris2Class[, 5], importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 3%
## Confusion matrix:
## versicolor virginica class.error
## versicolor 47 3 0.06
## virginica 0 50 0.00
irisRF <- randomForest(iris2Class[,1:4],
iris2Class[,5], importance=TRUE,
proximity=TRUE)
print(irisRF)
##
## Call:
## randomForest(x = iris2Class[, 1:4], y = iris2Class[, 5], importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 6%
## Confusion matrix:
## versicolor virginica class.error
## versicolor 47 3 0.06
## virginica 3 47 0.06
This is pretty good! Remember independent components were generated from the scaled data.
…And a Quadratic Classifier?
This is yet one more way to look at the data and the decision surface. In the case of a quadratic classifier the surfaces are quartic. This can be shown with 3D interactive graphis in R with rgl, and though fun is not necessarily any more illuminating as the pair-wise plots shown below.
grouping <- factor(c(rep("E", 50),rep("I",50)))
partimat(grouping~iris3DICA[,1]+iris3DICA[,2]+iris3DICA[,3],
method="qda",prec= 300,plot.matrix=TRUE)
#Notice one of the variable combination is really worthless.
#As an aside one can move the decision surface by changing the 'prior' argument
partimat(grouping~iris3DICA[,1]+iris3DICA[,2]+iris3DICA[,3],
method="rda",prec= 300,plot.matrix=TRUE,prior=c(1.75,0.25)/2)
No comments:
Post a Comment