October 18, 2012
Benchmarking distance calculation in R

A typical step in a lot of data mining methods is the calculation of a distance between entities. For example using the nearest-neighbor method it is crucial to do this calculation very efficiently because it is the most time-consuming step of the procedure. Just imagine you want to compute the Euclidean distance between a constantly changing database and incoming queries - each one consits of exactly one entity. Therefore you want to know how you can do the computation quick and easy. Bellow you see the output of some results using different packages and some self implemented approaches:

First we have to create some artificial data which matches nicely to the structure of the data in one of my projects - that is where the question about distance calculation came from. Typically the database consists of 3000 entities with 55 features

set.seed(1234)

# create some data
nFeature = 55
dataSet <- matrix(rnorm(3000 * nFeature), ncol = nFeature)
query <- matrix(rnorm(1 * nFeature), ncol = nFeature)

# parameter for distance calculations
K = 10  # number of nearest neighbors


The first package I stumbled upon was the yaImpute package - it contains the ann function which is based on the Approximate Nearest Neighbor Library (http://www.cs.umd.edu/~mount/ANN):

library(yaImpute)

ann_dist <- function(ref, target, k) {
res <- ann(ref = ref, target = target, tree.type = "kd", k = k, verbose = FALSE)
return(res)
}
resANN <- ann_dist(ref = dataSet, target = query, k = K)


The second package is pdist including the function pdist:

library(pdist)

pdist_wrapper <- function(ref, target, k) {
distAll <- pdist(X = ref, Y = target)
iNN <- order(distAll)[1:k]
return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]^2), nrow = 1), k = k))
# similar to ann the element knnIndexDist from the list is a vector which
# contains the indices of the nearest neighbors on position 1 to k and all
# distances afterwards (position k+1 to 2k)
}
resPDIST <- pdist_wrapper(ref = dataSet, target = query, k = K)


The next approach subtracts the query from every entity in the database and sums up the distance (call it rowDist):

rowDist <- function(ref, target, k) {
distAll <- colSums(apply(ref, 1, "-", target)^2)
iNN <- order(distAll)[1:k]
return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]), nrow = 1), k = k))
}
resRowDist <- rowDist(ref = dataSet, target = query, k = K)


And now the other way round - first you subtract every feature value from the corresponding feature vector of the database and second you sum up all parts of the overall distance (named colDist):

colDist <- function(ref, target, k) {
distAll <- rowSums(sapply(1:ncol(target), function(i) {
(ref[, i] - target[, i])^2
}))
iNN <- order(distAll)[1:k]
return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]), nrow = 1), k = k))
}
resColDist <- colDist(ref = dataSet, target = query, k = K)


My last approach is similar to rowDist but avoids calling apply (call it rowDistWithoutApply):

rowDistWithoutApply <- function(ref, target, k) {
dists = colSums((t(ref) - target[1, ])^2)
iNN = order(dists)[1:k]
return(list(knnIndexDist = matrix(c(iNN, dists[iNN]), nrow = 1), k = k))
}
resRowDistWA <- rowDistWithoutApply(ref = dataSet, target = query, k = K)


So let’s have a look at the results and check if all approaches give the same Nearest Neighbors and the correct results:

matrix(cbind(A = t(resANN$knnIndexDist), t(resPDIST$knnIndexDist),
t(resRowDist$knnIndexDist), t(resColDist$knnIndexDist),
t(resRowDistWA\$knnIndexDist)),
ncol = 5, dimnames = list(NULL,
c("ann", "pdist", "rowDist", "colDist", "rowDistWithoutApply")))

##           ann   pdist rowDist colDist rowDistWithoutApply
##  [1,]  526.00  526.00  526.00  526.00              526.00
##  [2,]  564.00  564.00  564.00  564.00              564.00
##  [3,] 2055.00 2055.00 2055.00 2055.00             2055.00
##  [4,]  402.00  402.00  402.00  402.00              402.00
##  [5,] 2703.00 2703.00 2703.00 2703.00             2703.00
##  [6,]  233.00  233.00  233.00  233.00              233.00
##  [7,]  274.00  274.00  274.00  274.00              274.00
##  [8,] 1749.00 1749.00 1749.00 1749.00             1749.00
##  [9,]  548.00  548.00  548.00  548.00              548.00
## [10,] 2358.00 2358.00 2358.00 2358.00             2358.00
## [11,]   57.50   57.50   57.50   57.50               57.50
## [12,]   58.10   58.10   58.10   58.10               58.10
## [13,]   59.72   59.72   59.72   59.72               59.72
## [14,]   60.27   60.27   60.27   60.27               60.27
## [15,]   61.33   61.33   61.33   61.33               61.33
## [16,]   62.47   62.47   62.47   62.47               62.47
## [17,]   62.49   62.49   62.49   62.49               62.49
## [18,]   63.07   63.07   63.07   63.07               63.07
## [19,]   63.10   63.10   63.10   63.10               63.10
## [20,]   63.18   63.18   63.18   63.18               63.18


Because there’s no mismatch between the five approaches we will step into the final part - benchmarking. Here the results from the ann function will be also used as baseline because my goal is to replace this function with something simpler and faster.

First let’s use the build in system.time() function to measure the time for 2000 repetitions:

iterN <- 2000L
matrix(cbind(system.time(for (i in 1:iterN) {
ann_dist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
pdist_wrapper(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
rowDist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
colDist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
rowDistWithoutApply(ref = dataSet, target = query, k = 50)
})[1:3]), ncol = 5, dimnames = list(c("user", "system", "elapsed"), c("ann",
"pdist", "rowDist", "colDist", "rowDistWithoutApply")))

##           ann pdist rowDist colDist rowDistWithoutApply
## user    15.07 17.23   63.87   19.11                8.83
## system   1.99  2.79    2.88    1.10                1.96
## elapsed 17.05 20.03   66.92   20.21               10.85


Second, for a more precisely measurement I use the package microbenchmark which uses some quite accurate timing functions

library(microbenchmark)
res <- microbenchmark(ann = ann_dist(ref = dataSet, target = query, k = 50),
pdist = pdist_wrapper(ref = dataSet, target = query, k = 50),
rowDist = rowDist(ref = dataSet, target = query, k = 50),
colDist = colDist(ref = dataSet, target = query, k = 50),
rowDistWithoutApply = rowDistWithoutApply(ref = dataSet, target = query, k = 50),
times = 2000L)
print(res)

## Unit: milliseconds
##                  expr    min     lq median     uq    max
## 1                 ann  5.836  7.095  8.279  9.553  88.08
## 2             colDist  4.400  6.746  7.656  8.859  81.09
## 3               pdist  2.828  5.035  5.867  7.259  84.60
## 4             rowDist 16.619 20.488 22.769 27.165 100.37
## 5 rowDistWithoutApply  2.097  3.202  4.435  5.360  78.61


Finally some nice plot is showing the results (there seems to be an interesting pattern, revealing that temporal “outliers” are building an individual cluster with much higher execution times than the average)

library(ggplot2)
plt <- qplot(y = time, data = res, colour = expr)
plt <- plt + scale_y_log10()
print(plt)


So happily we found a better solution - rowDistWithoutApply - for calculating a distance between a dataset and a query which is approximately two times faster (of course we have to be careful because there is no error checking) and does not need an additional library. The second winner is pdist from the pdist package followed by ann and colDist. rowDist performs worst.
So for my case this outcome remains quite stable but it’s interesting if this behavior is the general case or if there are any constraints. In this this sense any helpful comments are more than welcome.

1. memoclipstands reblogged this from things-about-r