Learning and Other Things: Time Series Clustering

Density-based Clustering using UCI Simulated Time Series Data

In the following, synthetically generated control chart time series are clustered with DBSCAN and Mclust to explore density and model-based clustering. The results are then compared with Ward's hierarchical clustering.
The description of the data can be found at http://kdd.ics.uci.edu/databases/synthetic_control/synthetic_control.data.html Here is a sampling of the time series clustered hierarchically with dynamic time warping (DTW) is used as the measure of dissimiliarity: http://www.rdatamining.com/examples/time-series-clustering-classification See “Analysis of Similarity Measures in Times Series Clustering for the Discovery of Building Energy Patterns” by Igelsias and Kastner http://www.mdpi.com/1996-1073/6/2/579 for a discussion of DTW.
Model-based, finite Gaussian mixture models, and more broadly, density-based clustering often work well if the structure of the data can be estimated by standard statistical distributions. Generally, these clustering methods are used for low dimensional spatial or continuous valued data. For higher dimensional data sets, such clustering methods are more problematic. The Mclust represents a more traditional model-based clustering, whereas DBSCAN is a more general density-like method, where no specific statistical model need be given. It is actually a heuristic that is somewhat akin to leader-like or exclusion region clustering algorithms, but it is more effective when the data have a structure where considerable chaining is present. In the little exercise below, the data stretch the limit of what these algorithms can do without considerably more work. Namely, their clustering solutions are somewhat poor compared to the known classification, and even si ply using Ward's Hierarchical clustering. Possibly more data manipulation, different similarity measure, and parameter fine tuning could produce better solutions with these methods.
In this data set there are 6 control chart series, 100 each.

library(MASS)
library(dtw)  #So that dynamic time warping measure can be used.  Other options are for example various correlation measures. 
syncontrolData <- read.table("synthetic_control.data.txt", header = FALSE, sep = " ")

# Create class data in the last column.
classSize <- 100
numcols <- ncol(syncontrolData)
syncontrolData[, numcols + 1] <- unlist(lapply(1:6, rep, classSize))

The next step is to generate dissimilarity matrix using dynamic time warping as a dissimilarity measure. And, similarly, get the correlation matrix of the time series, to see if another measure might be used for comparison. Note, to do this we have to take the transpose of the data as a matrix and remove the last column (class assignments) of the data first.
Generating the DTW proximity matrix takes some time. Having it generated and printed to a file and instead read back in, saves time in knitter for regenerating the html from the .Rmd. This was done with the following code:

Ddtw <- dtwDist(syncontrolData[,-(numcols+1)], method="DTW")
write.table(as.matrix(Ddtw),"Ddtw.txt", quote=FALSE, sep = " ", row.names=FALSE, col.names=FALSE)
system("gzip Ddtw.txt") #If your are on Windows, my condolences ... as you will have to read the Windows specific instructions to system()

Below, it is read back in again.

Ddtw <- as.dist(read.table(gzfile("Ddtw.txt.gz"), sep = " "))
Dcormat <- abs(cor(t(as.matrix(syncontrolData[, -(numcols + 1)]))))

Since model-based clustering has trouble in high dimensions, and the time series have 60 time steps, however, by generating the proximity matrix with either the dynamic time warping or correlation, we can look at the effective dimensions of which may be much smaller that 60. One way to check this is to see how many positive eigenvalues the proximity matrices have from the use of classical multi-dimension scaling. If we look at the number of ranked positive eigenvalues that contain most of the variance in the data, we can use that as the number of dimensions. The proximity matrix created with DTW and one created with Pearson correlation (cor()) shows a much lower dimension for DTW, with 3, but many significant dimensions with correlation.
Show the drop off in eigenvalues for DTW with classical MDS, and in term of percentage of positive eigenvalues.

showdim <- 15
par(mfrow = c(1, 2))
eigen <- cmdscale(Ddtw, eig = TRUE, k = 60)$eig
plot(1:showdim, eigen[1:showdim], xlab = "Rank", ylab = "Eigenvalues", type = "l")

positiveEigen <- sum(eigen > 0)
sumEigen <- sum(eigen[eigen > 0])
positiveEigensums <- numeric(positiveEigen)
for (i in 1:positiveEigen) {
    positiveEigensums[i] <- sum(eigen[1:i])/sumEigen
}
plot(1:positiveEigen, 100 * positiveEigensums[1:positiveEigen], xlab = "Rank", 
    ylab = "Percent Eigenvalues", type = "l")

You can see that there need be only a few dimensions to capture most of the variance. 3 Dimensioms will capture roughly 85% of the variance.
Below is shown the same plots for Pearson correlation. One problem with this form of correlation how to scale or transform the negative correlations for this data and for clustering in general. There are several ways. Common is simply taking the absolute value. Here we add 1 in addition to taking the absolute value, otherwise the matrix returns nothing but negative eigenvalues. I tried several other transformations, none of which worked well.

par(mfrow = c(1, 2))
eigen <- cmdscale(Dcormat + 1, eig = TRUE, k = 60)$eig
plot(1:showdim, eigen[1:showdim], xlab = "Rank", ylab = "Eigenvalues", type = "l")
positiveEigen <- sum(eigen > 0)
sumEigen <- sum(eigen[eigen > 0])
positiveEigensums <- numeric(positiveEigen)
for (i in 1:positiveEigen) {
    positiveEigensums[i] <- sum(eigen[1:i])/sumEigen
}
plot(1:positiveEigen, 100 * positiveEigensums[1:positiveEigen], xlab = "Rank", 
    ylab = "Percent Eigenvalues", type = "l")

You can see that there needs to be many dimensions to capture most of the variance with Pearson correlation. To get 85% we need over 60 dimensions. This defeats the purpose of using the MDS and model-based clustering, so for the following, the use of correlation will be ignored.
Here are the ranges and distributions of the two similarities.

range(Ddtw)

## [1]  130.9 2971.1

hist(Ddtw)

range(Dcormat)

## [1] 1.557e-06 1.000e+00

hist(Dcormat)

$plot of chunk proximityDistributions$
We'll just use the DTW similarity measure to perform the clustering with the two clustering methods. Choosing eps and minPts to get 6 clusters (corresponding to the number of actual classes) has been done here by trial error. There is a method for finding the parameters by plotting the distribution of k-nearest neighbor, where k is a guess at minPts, and then looking for a natural break in the distribution. This was tried but the break wasn't close to the eventual value of eps in spite of choosing several values of MinPts to test with k in kNN.

library(fpc)

## Loading required package: cluster
## Loading required package: mclust
## Package 'mclust' version 4.3
## Loading required package: flexmix
## Loading required package: lattice


# isoMDS is seeded by cmdscale and can produce slightly to somewhat better
# MDS

# MDSDdtw <- cmdscale(Ddtw, eig=TRUE, k=3)$points
MDSDdtw <- isoMDS(Ddtw, k = 3)$points

## initial  value 17.771580 
## final  value 17.771563 
## converged

dbscanDTW <- dbscan(MDSDdtw, eps = 100, MinPts = 5, method = "raw")
# Show cluster vector

# specify the actual class colors
classcolors <- syncontrolData[, numcols + 1]
par(mfrow = c(1, 2))
plot(MDSDdtw[, 1], MDSDdtw[, 2], xlab = "First Dimension  Actual Classes", ylab = "Second Dimension", 
    type = "n")
points(MDSDdtw[, 1], MDSDdtw[, 2], pch = classcolors, col = classcolors)
# 0s in the cluster vector represent noise or outliers and will appear black
# in the plot below.
plot(MDSDdtw[, 1], MDSDdtw[, 2], xlab = "First Dimension  DBSCAN Clustering", 
    ylab = "Second Dimension", col = 1 + dbscanDTW$cluster)

The clustering and the true classes aren't matching up so well, even though the parameters (eps=100, and MinPts = 5) are used to specifically find 6 clusters. The colors don't match up as well because the cluster labels can be in a different permutation than the actual class labels. Sometimes this can be modified if there isn't two distinct clusters in a single class. See a case where this occurs at the end of this exercise.
Below the clusters and the true classes are shown in the 3 dimensions with 3D convex hulls.

library(rgl)
library(geometry)

## Loading required package: magic
## Loading required package: abind

## Given that most of the discriminating variance is in the first 3 or 4
## dimensions, why not plot the clusters in 3D.  Note how it separates into
## two clusters the cluster that is shown in 2D.  If you line up the 3D plot
## in the first and second labeled dimension, you will notice the 2D image

# Functions to draw the 2D and 3D hulls.
drawhullLinesXY <- function(j, hull, hpts, repmins, i) {
    rgl.lines(hull[hpts[j:(j + 1)], 1], hull[hpts[j:(j + 1)], 2], repmins, col = i + 
        1, lty = 2)
}
drawhullLinesYZ <- function(j, hull, hpts, repmins, i) {
    rgl.lines(repmins, hull[hpts[j:(j + 1)], 1], hull[hpts[j:(j + 1)], 2], col = i + 
        1, lty = 2)
}
drawhullLinesXZ <- function(j, hull, hpts, repmins, i) {
    rgl.lines(hull[hpts[j:(j + 1)], 1], repmins, hull[hpts[j:(j + 1)], 2], col = i + 
        1, lty = 2)
}

findHulls2D <- function(i, clustermembership, x, y, repmins, coordsflag) {
    hull <- cbind(x[clustermembership == i], y[clustermembership == i])
    hpts <- chull(hull)
    hpts <- c(hpts, hpts[1])
    if (coordsflag == 1) {
        sapply(1:(length(hpts) - 1), drawhullLinesXY, hull, hpts, repmins, i)
    } else if (coordsflag == 2) {
        sapply(1:(length(hpts) - 1), drawhullLinesYZ, hull, hpts, repmins, i)
    } else {
        sapply(1:(length(hpts) - 1), drawhullLinesXZ, hull, hpts, repmins, i)
    }
}

numClusters <- max(dbscanDTW$cluster)

# A function is created that will plot the dbscan clustering in 3D, and one
# that will plot the synthetic control classes for comparison.

# Because we set the the minimum points (MinPts) to 5, we know that all
# clusters have at least 4 points, and therefore have enough points to form
# a convex hull in 3D. Also, note that those points identified by 0s in
# `dbscanDTW$cluster` are noise or singletons, and therefore not in any
# cluster.
PlotClusters3D <- function(data, clusters, numClusters) {
    # Plot points as very small colored spheres.
    plot3d(data[, 1], data[, 2], data[, 3], xlab = "1st dim", ylab = "2nd dim", 
        zlab = "3rd dim", col = 1 + clusters)

    # Plot the 3D convex hulls with considerable transparency (alpha=.2)
    for (i in 1:numClusters) {
        Points3D <- cbind(data[clusters == i, 1], data[clusters == i, 2], data[clusters == 
            i, 3])
        ts.surf <- t(convhulln(Points3D))
        rgl.triangles(Points3D[ts.surf, 1], Points3D[ts.surf, 2], Points3D[ts.surf, 
            3], col = 1 + i, alpha = 0.2)
    }
    # Plot the 2D hulls on walls of bounding box.  Note the box has been equally
    # scaled, even though the range of each dimension from the MDS is
    # successively smaller.  This shows that in any combination of 2 dimensions,
    # the clusters overlap significantly
    sapply(1:numClusters, findHulls2D, clusters, data[, 1], data[, 2], rep(min(data[, 
        3]), 2), 1)
    sapply(1:numClusters, findHulls2D, clusters, data[, 2], data[, 3], rep(min(data[, 
        1]), 2), 2)
    sapply(1:numClusters, findHulls2D, clusters, data[, 1], data[, 3], rep(min(data[, 
        2]), 2), 3)
    rgl.bbox()
}

## interactive 3D plot -- to get two of them, call open3d() twice
firstfocus <- open3d()
# Here are the actual class assignments in 3D and their 3D convex hulls
PlotClusters3D(MDSDdtw, classcolors, 6)
# Here we plot the dbscan clustering
secondfocus <- open3d()
PlotClusters3D(MDSDdtw, dbscanDTW$cluster, numClusters)

If on your system you are having trouble switching back and forth between the two 3D interactive devices, type

rgl.set(firstfocus)

rgl.set(secondfocus)

to get the set the focus, so that you can interact with that device
Here is Mclust clustering the data in the 3 dimensions from isoMDS. The mclust package has its own plotting function that shows the various output statistics and the clusters as they would appear in a 2D pairs plot.

mclustDdtw <- Mclust(MDSDdtw)

## Warning: best model occurs at the min or max # of components considered
## Warning: optimal number of clusters occurs at max choice

summary(mclustDdtw)

## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm 
## ----------------------------------------------------
## 
## Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 9 components:
## 
##  log.likelihood   n df    BIC    ICL
##          -10781 600 89 -22132 -22164
## 
## Clustering table:
##   1   2   3   4   5   6   7   8   9 
## 100  61 100  86  42  32  75  53  51

plot(mclustDdtw)

The warnings mean that the maximum number of clusters for the default was reached (9). It doesn't seem to distinquish 6 clusters particularly well over any larger number of clusters up to 9, given the default values. Model-based hierarchical clustering is used within Mclust. Here we use it directly and extract the classes.

hcClasses <- hclass(hc(modelName = "VVV", data = MDSDdtw), 1:6)
# modelName are parameters of the covariance matrix model, in this case VVV
# allows for ellipsoidal, with varying volume, shape, and orientation --
# namely about as general a mixture model as you can get.  There are ten
# such models for multivariate data.

clPairs(data = MDSDdtw, classification = hcClasses[, "6"])

# We can directly check the classification error with the actual classes
classError(hcClasses[, "6"], truth = syncontrolData[, numcols + 1])

## $misclassified
##   [1]  27  45  54  62  67  70  85 204 220 222 230 234 243 250 255 267 401
##  [18] 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418
##  [35] 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435
##  [52] 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452
##  [69] 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469
##  [86] 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486
## [103] 487 488 489 490 491 492 493 494 495 496 497 498 499 500 503 504 505
## [120] 506 507 508 510 511 512 514 517 518 519 521 524 530 532 533 536 541
## [137] 543 545 546 548 551 554 555 556 557 558 559 560 561 566 569 570 571
## [154] 573 575 577 578 579 581 582 585 586 589 590 593 596 598 599
## 
## $errorRate
## [1] 0.28

# compare that with the DBSCAN clustering with 6 clusters
classError(dbscanDTW$cluster, truth = syncontrolData[, numcols + 1])

## $misclassified
##   [1] 201 202 203 205 206 208 211 212 213 214 215 216 217 219 222 224 225
##  [18] 227 228 231 232 233 235 236 239 240 241 242 244 245 247 248 250 252
##  [35] 253 254 255 256 257 258 260 261 263 265 266 268 269 270 271 272 273
##  [52] 274 275 276 277 279 280 281 282 283 284 286 287 288 289 290 291 292
##  [69] 293 294 295 296 297 298 299 301 311 337 342 372 388 503 504 505 506
##  [86] 507 508 510 511 512 514 517 518 519 521 524 530 532 533 536 539 541
## [103] 543 545 546 548 551 554 555 556 557 558 560 561 566 569 570 571 572
## [120] 573 575 577 578 579 581 582 585 586 589 593 596 598 599
## 
## $errorRate
## [1] 0.2217

Here a simple hierarchical clustering with no parameters, Ward's clustering, is used with the same data. Wards will allow for some chaining and in some ways mimics a model-based approach as the merging criterion is based on choosing the minimum variance.

wardcut6 <- cutree(hclust(dist(MDSDdtw), "ward.D"), 6)
classError(wardcut6, truth = syncontrolData[, numcols + 1])

## $misclassified
##  [1] 201 202 206 211 214 217 219 222 224 227 236 250 252 255 256 260 263
## [18] 271 273 274 276 280 282 286 289 294 296 297 299 302 303 305 306 308
## [35] 309 317 318 324 327 329 330 331 333 334 336 339 341 345 346 349 351
## [52] 354 355 357 361 362 364 366 373 375 376 377 378 379 383 384 387 390
## [69] 392 395 396 397 400 460 488
## 
## $errorRate
## [1] 0.125

# Note, the cluster labels may be different from the original class labels,
# but the classError is just checking the cluster groups against the
# preponderance of any class.  E.g., the cluster label of 3 corresponds to
# the class label of 5.  Most of the cluster labels of 3 are where the class
# labels are 5.  In this example Ward's does a good enough job so that we
# can change the labels

# To change the labelings we must look at the the cluster label permutation
# and make the necessary changes.
change4 <- which(wardcut6 == 4)
change5 <- which(wardcut6 == 5)
change3 <- which(wardcut6 == 3)
wardcut6[change3] <- 5
wardcut6[change4] <- 3
wardcut6[change5] <- 4
# Plotting in the first 2 dimensions we can see that the clustering is
# getting much closer to the real class divisions than DBSCAN or Mclust. The
# class label coloring differes at time from the cluster label, but one can
# see that the general groupings are the same.
par(mfrow = c(1, 2))
plot(MDSDdtw[, 1], MDSDdtw[, 2], xlab = "First Dimension  Actual Classes", ylab = "Second Dimension", 
    type = "n")
points(MDSDdtw[, 1], MDSDdtw[, 2], pch = classcolors, col = classcolors)

plot(MDSDdtw[, 1], MDSDdtw[, 2], xlab = "First Dimension  Wards Clustering", 
    ylab = "Second Dimension", type = "n")
points(MDSDdtw[, 1], MDSDdtw[, 2], , pch = wardcut6, col = wardcut6)

Nearly half the error rate of DBSCAN! Parsimony wins! …at least in this case. I tried single link, complete link, average link, as well as other similarity measures, and all were much worse than even the Mclust try above. Plotting the Ward produced clusters in 3D and comparing to the earlier 3D class plot shows that Ward's gets a good deal closer matching up the clusters to the classes.

## interactive 3D plot -- to get two of them, call open3d() twice
thirdfocus <- open3d()
PlotClusters3D(MDSDdtw, wardcut6, 6)

rgl.set(firstfocus)

The image on the left shows the six classes formed by DBSCAN clustering. On the right are the true classes.

Learning and Other Things

Monday, July 27, 2015

Time Series Clustering

Density-based Clustering using UCI Simulated Time Series Data

No comments:

Post a Comment