Monday, July 27, 2015

Anti-clustering

The Anti-Clustering Clustering

The Anti-Clustering Clustering

A set of points randomly sampled from a uniform distribution has faint weak clusters. This is a property of the fact that it is a sample, not a property necessarily of how the random points were generated (any decent pseudo-random number generator will work similarly in this connection). What is entirely pointless to cluster of course is points generated by a quasi-random sequence in N dimensions, where N is small (e.g., 1, 2, 3, …, 7). These points in low dimesions, though random in appearance, are spread out more uniformly, in a space filling way that leaves them neither to close nor far from their local neighbors. One way to see them is to plot a set of quasi-randomp points in 2D with their respective voronoi regions.

#First the libraries we'll need
library(deldir) #For plotting the voronoi regions
library(randtoolbox) #For generating quasi-random points from low discrepancy sequences
library(MASS) #For multidimensional scaling (MDS)
#Generate some quasi-random points using the Halton Sequence
qpoints <- halton(100,dim=40);dd<-deldir(qpoints[,1], qpoints[,2])
## 
##      PLEASE NOTE:  The components "delsgs" and "summary" of the 
##      object returned by deldir() are now DATA FRAMES rather than 
##      matrices (as they were prior to release 0.0-18). 
##      See help("deldir").
##  
##      PLEASE NOTE: The process that deldir() uses for determining
##      duplicated points has changed from that used in version
##      0.0-9 of this package (and previously). See help("deldir").
#Generate some points from a random uniform distribution.
rpoints <- cbind(runif(100), runif(100))
#Get 4 colors and repeat them to show the space filling properties of the points
ccc <- topo.colors(4)
ccc <- c(rep(ccc[1],25),rep(ccc[2],25),rep(ccc[3],25),rep(ccc[4],25))
#Plot the random points plot side by side with the quasi-random points plot
par(mfrow=c(1,2),cex=0.1) 
z <- deldir(rpoints[,1],rpoints[,2],rw=c(0,1,0,1))
w <- tile.list(z)
plot(w,fillcol=ccc,close=TRUE)

z <- deldir(qpoints[,1],qpoints[,2],rw=c(0,1,0,1))
w <- tile.list(z)
plot(w,fillcol=ccc,close=TRUE)

plot of chunk unnamed-chunk-2

Note how the uniform random points on the left are weakly clustered and the space filling in terms of the order of the points (the 4 coloring is 1-25, 26-50, 51-75, 76-100) is not as nicely spread out as the 4 coloring of the quasi-random points. Neither is a true 4-coloring of a planar map, though the quasi-random points fill the space so that there are fewer contiguous regions of the same color.

It is well-known that the property (low discrepancy) that makes quasi-random points so useful (quasi-Monte Carlo integration), begins to break down as the sequence is used to create more dimensions that just a few. There are several ways to show this, the most common is to simply plot the distribution of the pair-wise distances between points. There should be very few points that are close together and the distribution should then rise sharply.

par(mfrow=c(1,2),cex=1.0)
qpointsDis2D <- dist(qpoints[,1:2])
qpointsDis40D <- dist(qpoints)                    
hist(qpointsDis2D,freq=FALSE)
lines(density(qpointsDis2D))

hist(qpointsDis40D,freq=FALSE)
lines(density(qpointsDis40D))

plot of chunk PointDistributions The first histogram is of the distances in the 2D case; the second histogram is in the 40D case. Notice how in the second histogram, the distances don't rise sharply as in the first histogram. But there is another somewhat indirect way of looking at it. Multidimensional Scaling (MDS) algorithms try to project from many dimensions down to a smaller number of dimensions while preserving the distance relationships. This is an approximation of course, but let's see if we project the 40 dimensional quasi-random points down to 2D and see how well the low discrepancy property – that nicely spread out space filling points we plotted with the voronoi regions.

qpointMat <- dist(qpoints)
qpointMDS <- isoMDS(qpointMat) #Create the 2D projection (the default)
## initial  value 21.082259 
## iter   5 value 17.201207
## iter  10 value 16.687357
## iter  10 value 16.684123
## iter  10 value 16.684123
## final  value 16.684123 
## converged
#Scale the projection to be within the unit square. Add a fudge factor, epsilon, to make sure points don't lie on the [0,1] interval
epsilon <- 0.001
qmdspX <- (qpointMDS$points[,1] - min(qpointMDS$points[,1]) + epsilon)/(max(qpointMDS$points[,1]) - min(qpointMDS$points[,1]) + epsilon)

qmdspY <- (qpointMDS$points[,2] - min(qpointMDS$points[,2]) + epsilon)/(max(qpointMDS$points[,2]) - min(qpointMDS$points[,2]) + epsilon)

#Plot the 2D quasi-points again, then the 40D points
par(mfrow=c(1,2),cex=0.1)
z <- deldir(qpoints[,1],qpoints[,2],rw=c(0,1,0,1))
w <- tile.list(z)
plot(w,fillcol=ccc,close=TRUE)

z <- deldir(qmdspX,qmdspY,rw=c(0,1,0,1))
w <- tile.list(z)
plot(w,fillcol=ccc,close=FALSE)

plot of chunk Forty

Whoa! What happened? Well, it turns out that the order of the quasi-randomly generated points matters. The MDS projection suggests that depending on when the points are generated in the sequence (their rank) determines not only how close they are in many dimensions, but that that they are clustered with respect to their rank. Remember the coloring is by the rank ordering of the point generation.

Let's cluster the projection!

par(mfrow=c(1,1),cex=0.6)
quasiDendro <- hclust(dist(cbind(qmdspX,qmdspY)),"ward.D2")
plot(quasiDendro)

plot of chunk clustering

#Show the sequence and the clusters
cbind(1:100,cutree(quasiDendro,4))
##        [,1] [,2]
##   [1,]    1    1
##   [2,]    2    1
##   [3,]    3    1
##   [4,]    4    1
##   [5,]    5    1
##   [6,]    6    1
##   [7,]    7    1
##   [8,]    8    1
##   [9,]    9    1
##  [10,]   10    1
##  [11,]   11    1
##  [12,]   12    1
##  [13,]   13    1
##  [14,]   14    1
##  [15,]   15    2
##  [16,]   16    2
##  [17,]   17    2
##  [18,]   18    2
##  [19,]   19    2
##  [20,]   20    2
##  [21,]   21    2
##  [22,]   22    2
##  [23,]   23    2
##  [24,]   24    2
##  [25,]   25    2
##  [26,]   26    2
##  [27,]   27    2
##  [28,]   28    2
##  [29,]   29    2
##  [30,]   30    2
##  [31,]   31    2
##  [32,]   32    2
##  [33,]   33    2
##  [34,]   34    2
##  [35,]   35    2
##  [36,]   36    2
##  [37,]   37    2
##  [38,]   38    2
##  [39,]   39    2
##  [40,]   40    2
##  [41,]   41    2
##  [42,]   42    2
##  [43,]   43    2
##  [44,]   44    2
##  [45,]   45    2
##  [46,]   46    2
##  [47,]   47    2
##  [48,]   48    2
##  [49,]   49    2
##  [50,]   50    2
##  [51,]   51    2
##  [52,]   52    2
##  [53,]   53    3
##  [54,]   54    3
##  [55,]   55    3
##  [56,]   56    3
##  [57,]   57    3
##  [58,]   58    3
##  [59,]   59    3
##  [60,]   60    3
##  [61,]   61    3
##  [62,]   62    3
##  [63,]   63    3
##  [64,]   64    3
##  [65,]   65    3
##  [66,]   66    3
##  [67,]   67    3
##  [68,]   68    3
##  [69,]   69    3
##  [70,]   70    3
##  [71,]   71    3
##  [72,]   72    3
##  [73,]   73    3
##  [74,]   74    3
##  [75,]   75    3
##  [76,]   76    3
##  [77,]   77    3
##  [78,]   78    3
##  [79,]   79    3
##  [80,]   80    3
##  [81,]   81    3
##  [82,]   82    3
##  [83,]   83    4
##  [84,]   84    4
##  [85,]   85    4
##  [86,]   86    4
##  [87,]   87    4
##  [88,]   88    4
##  [89,]   89    4
##  [90,]   90    4
##  [91,]   91    4
##  [92,]   92    4
##  [93,]   93    4
##  [94,]   94    4
##  [95,]   95    4
##  [96,]   96    4
##  [97,]   97    4
##  [98,]   98    4
##  [99,]   99    4
## [100,]  100    4

The clustering is not quite the same as the coloring, but the clustering shows that the rank is largely preserved. That is a far cry from the simple 2D quasi-random points. Many more dimensions really messes up the low discrepancy. In a rather odd way though, it seems to show that an assumption about how data is generated, and therefore will behave, can be way off the mark.

No comments:

Post a Comment