June 22, 2014
Coloring the world – Extracting user specific color palettes from Tableau Workbooks

If you read some of my last blog post you may notice that R got a new companion called Tableau. Tableau is an easy to use and mighty BI toolbox for visualizing all kinds of data and I suggest everybody to give it a trial. One of the things that I like very much is that it gives you all the options to create simple graphics instantly and on the other side assemble multifunctional and ”customized” dashboards by using table calculation, R scripts and other more advanced features. But nobody is perfect. Especially one thing about Tableaus handling of user specific color schemas/palettes bothers me a little bit. Right now, there is no easy way to save individual color palettes out of Tableau for reusing them also in other workbooks. The current blog post will show you, how this can be achieved, if you don’t fear executing a little python code:

Imagine that you have a discrete attribute, where you want to color the different labels regarding some specific schema. This is easy to do in Tableau Desktop. Just click right on the dimension, click “Default Properties” and “Color…” and assign the colors to the responding labels. Double-clicking on a labels gives you the option to enter the RGB code.

Define individual color palette

If the same attribute is used in different workbooks, you don’t want to repeat the process every time - especially if there are a lot of labels or the colors are not part of one of the build-in color palettes. For this Tableau offers the possibility to save color schemas in a user specific preference file (Preferences.tps). There is a well written KB article about how to store your palette as xml in the preference file. If you bring the preference file in place, you can select your individual palette when editing the color of an attribute.

Choose color palette from preference file

For me this solution stops somewhere half the way. Imagine the common situation that you have assigned the colors manually in Tableau. Now you want to save your work, but there is no way to save it out of Tableau! Instead you have to click on every single label, memorize the color code and assemble a xml preference file piece by piece. I experienced saving manual created color schemas as a standard feature in other types of visualization systems for example geographical information systems (GIS).

Remark: Additionally Tableau requires you to transform all of your decimal RGB codes from Tableau Desktop to hexadecimal RCB codes if you want to put them into the preference file. Of course, this is no rocket science (using for example excel with its DEC2HEX function), but you may ask yourself if there shouldn’t be some better user experience. Please Tableau, can you just agree on one common format?

But here is help in form of a small python code that can extract the desired information. Here is, how it works: You may have noticed that if you view a workbook *.twb file in your text editor, you see XML code that defines the whole workbook. If you used a customized coloring inside a workbook the information about that has to be part of this XML code. Therefore, it should be possible to extract the information using a script that traverses the workbook. Because accessing the *.twb XML code directly is not supported by Tableau, there is no overall documentation. So any information about how things are stored need to be figured out in a trial-and-error manner. I recommend using versioning system like Git together with a text editor to track any change. Modify the workbook in tableau, save and inspect what’s changed with Git. Going this way, you will find out how the color information is saved: As long as you stick with the default coloring there will be no info in the XML code, but when you change the default coloring an “encoding” tag will appear as part of the data source under which you find the list of colors.

XML structure of Tableau workbook file

Now here is the python script to extract these informations:

The structure is very simple – it receives parameter from the command line, opens the workbook, extract the color palette for every requested attribute and prints out the information so that it can be easily saved as preference file. Dependent on the given parameters the script will write the preference file automatically at the end. Every color palette will be saved as categorical one. If you like to change it (for example to use it as sequential palette), you have to edit the “type” attribute with a text editor. Additionally behind every color the script will print printed the corresponding color as xml comment. This serves as a help to edit the file afterwards or create new palettes outside Tableau.

To run it, just type

python Extract_Color_Schema_TWB.py --s 'Color_and_Sort_Test.twb' 'Region' 'Region (Copy)'

to extract the color palette for the attributes “Region” and “Region (Copy)” from the workbook “Color_and_Sort_Test.twb”. The parameter “–s” will tell the script to save the results as “Preference.tps” in your current directory. You can copy this file directly to your Tableau repository and after restarting Tableau the new color palettes should be available.

I hope this will help other users sharing the same problem (if you want to help, you may vote for this feature idea/proposal in the Tableau community).

Please send me some feedback if the script isn’t working for you (together with the problematic workbook). As explained, I assembled the knowledge about how color palettes are stored by testing different cases. It is not so unlikely that I missed something ;-).

This time you can find the script here on github.

May 4, 2014
"The Winner Takes It All" - Tuning and Validating R Recommendation Models Inside Tableau

Tableau Dashboard for Tuning Recommendation Algorithm

Introduction

My last blog article shows how to build an interactive recommendation engine in Tableau using a simple model utilizing the cosine similarity measure. While this can be a good way to explore unknown data, it is wise to validate any model before using it for recommendation in practice in order to get an estimate of how the model performs on unknown data. In the interactive approach the parameters were not questioned, but in reality we have the choice to modify the parameters and thereby driving the performance of the resulting model. That’s why the current blog post shows how to prepare the data, setup a simple recommendation model in R and validating its performance over different sets of parameters.

Background

Most readers familiar with data analytics know that building, tuning and validating a model is in most cases not a sequential workflow, but is carried out in an iterative manner - going through these tasks in small cycles. In each cycle we think about how we should change parameters, then modify the model and validate it at last. This means that also the tools, a data analyst uses, should support this mode of work as good as possible - especially I want something that is capable of supporting all of these steps. That’s why I want to give Tableau a trial - as a frontend for developing, tuning and validating recommendation models where the computational work is done with R in the back. If you want to start with a general overview on how R is integrated in Tableau, you can find some useful information and tutorials here, here, here and here.

If you didn’t read the other two articles (here and here) on my blog explaining the used dataset, here is a short summary: The data are about check-ins from the location based social network (LBSN) Foursquare crawled over a couple of month during 2012 (data showing a sample for the city of Cologne in Germany). Every data point tells the location/venue, user and a timestamp when the user checked in at the specified location (and quit some other “metadata” about location type, latitude, longitude, e.g.). LBSNs are a good use case for recommendation because users that share similar “movement patterns” in the geographical space may have common interests and so recommendations about locations someone haven’t visited yet, but similar uses already had can be interesting for the person.

Recommendation Models and Workflow

I don’t want to dive too deep into the theory of recommendation models because there is a waste of good literature out there - for example the tutorial of the R package “recommenderlab” (that is used in this post) is a good way to start. Here, we use two very general modeling approaches, called “User-based Collaborative Filtering” and “Item-based Collaborative Filtering”. The first one selects the recommended items based on the ones from other similar users whereas the second one selects the recommended items as the ones that are similar to the items the users already had.

Let’s think about how the overall workflow should look like:
The first step is transform the raw data into a format that can be used for recommendation modeling. A simple format that is often used, is to prepare a table that contains a line for every venue and user together with the corresponding score. In our case we could use the number of user check-ins per venue as score. But this can be misleading as well because a user may have to visit a venue every day (e.g. bus station on the way to work) but that doesn’t mean he likes it very much or would recommend it to a friend. If no explicit score is given like in this case, we should think about using a form of a “binary” score instead. Here 1 indicates that the user has visited the venue and 0 means not.
Second step is then to feed the prepared data into some framework that can be used to test different algorithms and validate the results. Therefore I choose the recommenderlab package because within these framework we get everything we need (including the right data structures and functions for predicting future outcomes). The package is expandable, meaning that everybody can include his own algorithms if needed but also comes with a set of generic algorithms.
Third step is to parameterize the chosen algorithms in a way that we can use Tableau to modify them from our dashboard.
The fourth and last step is then to validate the resulting models for every specific set of parameters and bring back the validation results to Tableau for visualization purposes. We use two visualizations for this – both well-known in the area of recommendation systems: (1) ROC plot (showing the relation between true-positive rate (hit rate or recall) and false positive rate (false alarm rate) together with the AUC value - area under curve - a single value describing model performance) and (2) Precision-Recall plot (showing the relationship between precision and hit rate).
The validation scheme itself is based on a train / test data split, where the unseen data from the test sample is used to calculate the quality measures. In every run and for every model we determine the TOP-N recommendations, where we use twelve different values for \(N\epsilon\{1,2,3,4,5,6,7,8,9,10,15,20\}\). Based on a comparison of the TOP-N recommendations with the true venues that the user have visited, hit rate, false alarm rate and precision can be calculated.
Additionally to the “User-based Collaborative Filtering” and the “Item-based Collaborative Filtering” we use two more simple modeling approaches as benchmarks (a random model and a model that is based on the global popularity of the venues) to show if a “personalized” model can beat them.

Data Preparation

As mentioned the data preparation is done completely in R. The reason is that R offers all the capabilities and flexibility to build up some “mini ETL” to transform the raw data into the desired format. For me Tableau here is not an option because it is intended as a data visualization not a data preparation workbench. The result is a CSV file that can be fed into Tableau, where all other steps for modeling and validation take place. The code for creating the CSV looks as follows:

# -------------------------------------- 
# filter parameters
# --------------------------------------
minNumberOfCheckinsPerVenue = 12
minNumberOfUsersPerVenue = 4

# -------------------------------------- 
# load required libraries
# --------------------------------------
require(data.table)

# -------------------------------------- 
# load Foursquare data
# --------------------------------------
fileName <- "DATA/Foursquare_Checkins_Cologne.csv"
dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), 
	rep("factor",2), rep("character", 2)), 
	dec = ",", encoding = "UTF-8")


# how the first 10 elements look like
head(dataset)

# -------------------------------------- 
# data preprocessing
# --------------------------------------
dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, format = "%Y-%m-%d %H:%M:%S")
dataset$LAT <- sub(",", ".", dataset$LAT)
dataset$LNG <- sub(",", ".", dataset$LNG)
dataset$LAT <- as.numeric(dataset$LAT)
dataset$LNG <- as.numeric(dataset$LNG)
dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H"))
venueDataset <- unique(dataset[, c("VENUEID", "LNG", 
	"LAT", "VENUE_NAME", "CATEGORY_NAME")])

# use data.table for aggregation
datasetDT <- data.table(dataset)
venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), 
    by = list(VENUEID, USERID)]
# filter for artificial venues
tmpVenueDataset <- data.table:::merge.data.table(datasetDT[, 
	list(COUNT_USER_ALL = length(unique(USERID))),
		by = list(VENUEID)][COUNT_USER_ALL >= minNumberOfUsersPerVenue], 
	datasetDT[, list(COUNT_CHECKINS_ALL = length(unique(CHECKIN_ID))), 
		by = list(VENUEID)][COUNT_CHECKINS_ALL >= minNumberOfCheckinsPerVenue], 
	by = "VENUEID")
venueUserDataset <- data.table:::merge.data.table(venueUserDataset, 
	tmpVenueDataset, by = "VENUEID")
venueUserDataset <- data.frame(venueUserDataset)

# -------------------------------------- 
# merge scoring table with venue information and save as csv 
# --------------------------------------

# venueDataset contains all venues venueUserDataset contains all the
# relationships (ratings)
venueUserFullDataset <- merge(x = venueUserDataset, y = venueDataset, 
	by = "VENUEID")

# move IDs to 1:N range
venueUserFullDataset$USERID <- match(venueUserFullDataset$USERID, 
	unique(venueUserFullDataset$USERID))
venueUserFullDataset$VENUEID <- match(venueUserFullDataset$VENUEID, 
	unique(venueUserFullDataset$VENUEID))

write.csv2(venueUserFullDataset, 
	file = "DATA/Venue_Recommendation_Dataset_Example.csv",
	row.names = FALSE)

Integrating R-package “recommenderlab” in Tableau

The first step in Tableau is to connect to the CSV dataset. We focus on the two dimensions (USERID and VENUEID) that are absolutely necessary for creating our recommendations. Now it is time to think about the way we have to approach the R interface of Tableau. It is important to know that Tableau requires that the dimension of the input from Tableau to R is the same as the output from R to Tableau. This means that if you specify the inputs for Tableaus R table calculation functions as a couple of data vectors, each with length 100, then Tableau expects to get exactly one vector having 100 values back (one exception: a singular value is also fine). This concept has two important consequences:

  1. Because we want to display the values of the defined quality measures graphically, the number of return values is significantly lower than the number of input tuples. For example, the dataset contains more than 5,000 tuples, but the output for the false alarm rate consist of only 4 x 12 values (12 values per model - 1 for every Top N list).

  2. The result from calling R functions in Tableau is restricted to be exactly one measure, but we need several of them - one for each quality metric. We can solve this by calling the whole modeling procedure several times, where each iteration gives back the values for one quality metric. But this will kick down performance significantly and is, therefore, not an option.

The solution for the first problem is simple. The interesting values are written at the beginning of the result vector. One can fill the rest of the result vector with NA values and filter them out afterwards in Tableau - what’s left are the data we want.
The solution for the second issue is a little bit more complicated. An approach is, to do the whole calculation in a separate task (a calculated field) once and save the result as R object “in-memory”. After that, we create additional calculated fields, one for every quality metric we need. Their only purpose is to extract the metrics from the “in-memory” result object – vector by vector. To make this work, we need to run the R computations inside a “shared session” of Rserve (Rserve is the middleware needed for the communication between Tableau and R). This means that all the calculation share the same context (variable, functions, objects, …), which is not the case for Linux implementation. In Linux every calculation creates a new session. Therefore, the current implementation only works when Rserve is installed on a windows machine. I’m sure, there are options to bring this calculation also to a Linux system – If you have an idea, please leave a comment.

Having all these in mind the next step is to create a calculated field that transforms the input from Tableau into a suitable data structure for recommender systems. As mentioned above we want to model the score as a binary one, so starting with a rating matrix with decimal scores, we use the binarize function to yield the final 0/1 rating matrix:

# Tableau field: SCRIPT_INT('
require(recommenderlab)
rm(list = ls())
n <- length(.arg1)
userid <- .arg1
venueid <- .arg2
score <- .arg3
recMatrixTmp <- sparseMatrix(i = userid, j = venueid, 
	x = score, dimnames = list(as.character(sort(unique(userid))), 
	as.character(sort(unique(venueid)))))
recMatrix <- new("realRatingMatrix", data = recMatrixTmp)
recMatrix <- binarize(recMatrix, minRating = 1)
return(1)
# ',MAX([USERID]),MAX([VENUEID]),MAX([COUNT_CHECKINS]))

Subsequently all the parameters need to be defined, which we want to use when optimizing the recommendation model. For the user-based collaborative filtering approach these are the method to calculate the similarity between users (“Similarity Function”), the number of users for calculating the recommendation (“NN”), if we want to use weighted distances when aggregating the ratings of the most similar users (“Weighted (UBCF)”) and if we want to use only a sample of the given data (“Sample (UBCF)”). The parameters for the item-based collaborative filtering approach are the similarity method (“Similarity Function”), the number of most similar items that should be taken into account (“k”) and if we want to normalize the similarity matrix to remove user rating bias (“Normalize Similarity Matrix (ICBF)”).
We also need two parameters for the validation schema: The fraction of data we want to be used as test data (“Train / Test Split”) and how many iterations of the train/test cycle we need to calculate the quality metrics (“Evaluation Runs”). Of course more parameters are possible, but I limited the numbers because of the demo character of this implementation.

Overiew over defined Parameters

The main workhorse of the dashboard is the calculated field “RComputeRecommendation(Multi)” that does all the modeling, validation and output shaping. I will first show the code and afterwards explain parts in greater detail:

# Tableau field - SCRIPT_INT('
require(recommenderlab)

# extract parameters
seed <- 1234
trainProp <- .arg1[1]
simMethod <- .arg2[1]
evRuns <- .arg3[1]
NN <- .arg4[1]
WEIGHTED <- as.logical(.arg5[1])
if (.arg6[1] == "FALSE") {
    SAMPLE <- as.logical(.arg6[1])
} else {
    SAMPLE <- round(dim(recMatrix)[1] * trainProp * as.numeric(.arg6[1]))
}
kIBCF <- .arg7[1]
NORMALIZESIMMAT <- as.logical(.arg8[1])

# assemble parameterization
set.seed(seed)
es <- evaluationScheme(recMatrix, method = "split", 
	k = evRuns, given = 1, train = trainProp)
algorithms <- list(
	"random items" = list(name = "RANDOM", 
		param = list(normalize = "center")), 
	"popular items" = list(name = "POPULAR", 
		param = list(normalize = "center")), 
	"user-based CF" = list(name = "UBCF", 
		param = list(method = simMethod, nn = NN, 
        weighted = WEIGHTED, sample = SAMPLE)), 
	"item-based CF" = list(name = "IBCF", 
        param = list(method = simMethod, k = kIBCF, 
			normalize_sim_matrix = NORMALIZESIMMAT)))

# run validation and grab results
set.seed(seed + 1)
listN <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20)
ev <- evaluate(es, algorithms, n = listN)
resMat <- do.call("rbind", avg(ev))
resMat <- cbind(resMat, LIST_LENGTH = as.numeric(row.names(resMat)))
row.names(resMat) <- rep(names(avg(ev)), each = length(listN))

# calculate AUC
AUC <- as.numeric(by(resMat, rep(1:length(ev), 
	each = length(listN)), function(x) {
		tpr <- c(0, x[, "TPR"], 1)
		fpr <- c(0, x[, "FPR"], 1)
		i <- 2:length(tpr)
		return((fpr[i] - fpr[i - 1]) %*% (tpr[i] + tpr[i - 1])/2)
	}))
resMat <- cbind(resMat, AUC = rep(AUC, each = length(listN)))

# pull out quality metrics
recall <- as.numeric(c(resMat[, 6], rep(NA, n - length(resMat[, 6]))))
precision <- as.numeric(c(resMat[, 7], rep(NA, n - length(resMat[, 7]))))
tpr <- as.numeric(c(resMat[, 9], rep(NA, n - length(resMat[, 9]))))
fpr <- as.numeric(c(resMat[, 8], rep(NA, n - length(resMat[, 8]))))
listlength <- as.numeric(c(resMat[, 10], rep(NA, n - length(resMat[, 10]))))
type <- as.character(c(row.names(resMat), rep(NA, n - length(row.names(resMat)))))
config <- rep(sapply(algorithms, function(i) {
	as.character(paste(unlist(i), collapse = "/"))
}), each = length(listN))
algoConf <- as.character(c(config, rep(NA, n - length(config))))
auc <- as.numeric(c(resMat[, 11], rep(NA, n - length(resMat[, 11]))))
return(1)
# ',MAX([Train / Test Split]), MAX([Similarity Function]), MAX([Evaluation
# Runs]), MAX([NN (UBCF)]), MAX([Weighted (UBCF)]), MAX([Sample (UBCF)]),
# MAX([k (ICBF)]), MAX([Normalize Similarity Matrix (ICBF)]),
# [RInitDataStructure])

The first block stores all the parameter values from Tableau into their R counterparts. The only thing special is the “Sample (UBCF)” parameter, where the absolute size of the sample has to be calculated if sampling is enabled.
We also define a “seed” here, which is quite important to get equal quality metrics across different visualizations when sampling is involved. Because there will be two sheets at the end (ROC- and precision-recall-plot), showing two perspectives of the same model, we have to make sure that the computational results they show are the same.
Next we define the parameterization of the algorithms and afterwards run our validation schema using the train/test split.
In principle that’s all - everything later on is about how we extract the results and bring them back to Tableau. As explained before, we therefore create a vector of the same length as the input data for every quality metric. The first N elements contain the true values whereas the other only contain NA values. Later in the dashboard we filter all the NA elements.

To be able to visualize the defined quality metrics, we use calculated fields for them. To be more precise - for every metric we need one calculated field: true-positive rate, false positive rate, precision, recall, AUC, length of the TOP-N list (as label), type of the algorithm (as label) and parameterization (as label). In the following you find the code for the false-positive rate as an example - this is how every calculated field is defined:

# Tableau field: SCRIPT_REAL('
cat("++++++RFieldFPR++", format(Sys.time(), "%Y/%m/%d_%H:%M:%S"), "+++++\n")
return(fpr)
# ',[RComputeRecommendation(Multi)])

It is really simple: the field just contains one line that is used for debugging purposes and then return the proper metric. But take a close look at the field ‘RComputeRecommendation(Multi)’ (our computation workhorse) that is used as input but then is never used inside the R calculation. The reason for this is that we have to force Tableau, to do the recalculation of the models (based on changed model parameters) before we extract the results. Prior to this finding I always got into some strange situations, where parts of the metrics seems to contain the results previous to the parameter change. By using the calculated field as a parameter, we make sure that the extraction of the results takes place after the evaluation (I know doesn’t look nice - maybe someone knows how to handle this in a more gentle way).

Creating Visualizations for Model Tuning

Now we have everything we need to assemble those two visualizations showing the ROC plot and the precision-recall plot for the different models. To do so we place the true-positive rate on the row and the false-positive rate on the column shelf (respectively precision and recall for the precision-recall-plot). Furthermore, we need to place the type of algorithm field on color and shape to visualize the curves for the different algorithms. The length of the TOP-N list will serve as label. Additional fields are used as part of the tooltips. Important also to take one of the calculated field as filter and exclude all NULL values as described above. Last point is to deselect “Aggregate Measures” under “Analysis” to make all the single points visible.

Precision-recall plot

ROC plot

After we created the two different visualization, it is time to place everything onto on single dashboard, arrange all the parameter fields and configure them to trigger changes in both plots. Because every parameter change will cause Tableau to recalculate the model it can be wise to deselect the automatic update under “Dashboard” -> “Auto Update” and instead trigger an update manually by hitting F10. And then – happy tuning …

Tableau Dashboard for Tuning Recommendation Algorithm showing Tooltip

Summary

We now have a really nice dashboard that also enables users not familiar to R, to find well suited recommendation models and compare the effect of different parameterizations. Because the whole implementation is only based on three attributes (User ID, Product/Venue ID and Score), it is easy to change the dataset leaving most of the R calculations and the visualizations untouched.

A couple of words at the end about my experiences regarding the interface connecting Tableau and R. If you want the whole modeling cycle to take place in Tableau, you have to use some hacks to circumvent obstacles given by the current implementation of the interface. One is the dependence between the dimension of input and output. Especially in the scenario given here, a data analyst is often not interest into the outcomes of the prediction of the training data. Instead, he wants to know more about quality metrics that consists typically of only a small set of values. Using the “NA-padding” strategy is some way to bypass these issue.

The other point is that the output/result of an R table calculation can only contain one vector – in this case we can use the idea of storing the results as R object and accessing it with different calculated field (should only work under windows) or putting all the fields belonging to one entity into one string (together with a specific delimiter) and parsing it afterwards in Tableau (this option should also increase performance because we send less data back to Tableau).

Last point is that if I want to use the calculated fields in different visualization that are all part of one dashboard, Tableau will execute the code inside the fields for every sheet and this may introduce significant performance issues (e.g. calculates the model twice). An option to avoid those kinds of problems is to shift some logic into R, in a way such that R checks the given parameters and only recalculates the model if something changes.

As always feel free to comment on these post especially when you have good ideas/alternatives on how to solve some of the mentioned points.

Please also find the workbook with the R code from this blog post attached: Tuning_Recommendation_Models_in_Tableau.twbx

February 2, 2014
"Show me the way to the next whiskey bar" (The Doors - Alabama Song) - Interactive Location Recommendation using Tableau

Since I started using Tableau I’m quite fascinated about the capabilities of this piece of software. Before Christmas I was looking how I could build an interactive visualization that helps me to explore the relationships between different objects in a form that shows which objects are very close to each other according to some similarity measure or vice versa. This is one fundamental question in recommendation systems. It was a nice coincidence that at the same time Bora Beran published his blog post on how you could visualize a correlation matrix because his tutorial together with a well written article from Jonathan Drummey about how to calculate “No-SQL” cross product served as a starting point for me to create a second version of my previous blog post about venue recommendation - but now inside Tableau and interactive!

The current blog article is also based on data from the location based social network Foursquare which was crawled over a couple of month during 2012 (data showing a sample from the city of cologne). In principle every data point states a location/venue, a user and a timestamp when the user checked in at the specified location (and quit some other “metadata” about location type, latitude, longitude, e.g.). Based on these dataset the goal is to calculate and visualize the similarity of each pair of venues. The similarity could be used afterward to recommend locations based on venues that a user visited before (Please look into my older blog post to get additional information about the dataset and location recommendation). A similarity measure that is quite common in the domain of recommendation systems is the so-called cosine similarity. Basically the cosine similarity measures the angle between two vectors where every vector represents one of the objects (here venues) someone wants to compare. Every element \(i\) of the vector express the affinity of person \(i\) towards the object that is represented by that vector. In this example we measure the affinity by calculating the number of times that person visited a specific venue. For example, if person 2 travels to venue A three times, then A2 is 3 whereas it is 0 if person 2 never visited place A. The cosine similarity ranges between -1 and 1, where 1 means completely similar, 0 means no similarity and -1 means both venues are completely different. Because every score has to be zero or positive, the smallest value for the cosine similarity is 0 which makes sense because using the described methodology a user cannot express any negative affinity for a place.

Prepare raw data with R

My first objective was just to reproduce a visualization that is similar to Bora’s correlation matrix, but is based on the cosine similarity. The matrix itself should provide a nice overview to identify the most connected venues very quickly.

The first step therefore is to bring the data in the correct shape for Tableau. Because Tableau’s abilities to reshape and enrich data are limited this is a nice example how preprocessing data with R could leverage advanced analytics in Tableau. The simple idea is to calculate the number of check-ins by venue and user (~score) and then use the inner join of this dataset to itself based on the USERID (through a “multiple table” connection in Tableau). Now we can easily compare different venues by looking at both scores from a user using a set of Tableau table calculations. For the aggregation the R code below shows the necessary steps: First transform the necessary fields. Then, extract a single dataset that consists of all venues and their properties like latitude, longitude, venue name and category name.

# -------------------------------------- 
# filter parameters
# --------------------------------------
minNumberOfCheckinsPerVenue = 10
minNumberOfUsersPerVenue = 2

# -------------------------------------- 
# load required libraries
# --------------------------------------
require(data.table)

# -------------------------------------- 
# load Foursquare data
# --------------------------------------
fileName <- "Foursquare_Checkins_Cologne.csv"
dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), rep("factor", 
    2), rep("character", 2)), dec = ",", encoding = "UTF-8")

# how the first 10 elements look like
head(dataset)

# -------------------------------------- 
# data preprocessing
# --------------------------------------
dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, 
    format = "%Y-%m-%d %H:%M:%S")
dataset$LAT <- sub(",", ".", dataset$LAT)
dataset$LNG <- sub(",", ".", dataset$LNG)
dataset$LAT <- as.numeric(dataset$LAT)
dataset$LNG <- as.numeric(dataset$LNG)
dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H"))
venueDataset <- unique(dataset[, c("VENUEID", "LNG", 
    "LAT", "VENUE_NAME", "CATEGORY_NAME")])

# use data.table for aggregation
datasetDT <- data.table(dataset)
venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), 
    by = list(VENUEID, USERID)]
# filter for artificial venues
tmpVenueDataset <- data.table:::merge.data.table(
    datasetDT[, list(COUNT_USER_ALL = length(unique(USERID))), 
        by = list(VENUEID)][COUNT_USER_ALL >= minNumberOfUsersPerVenue], 
    datasetDT[, list(COUNT_CHECKINS_ALL = length(unique(CHECKIN_ID))), 
        by = list(VENUEID)][COUNT_CHECKINS_ALL >= minNumberOfCheckinsPerVenue], 
    by = "VENUEID")
venueUserDataset <- data.table:::merge.data.table(venueUserDataset, tmpVenueDataset, 
    by = "VENUEID")
venueUserDataset <- data.frame(venueUserDataset)

# -------------------------------------- 
# merge scoring table with venue information and save as csv 
# --------------------------------------

# venueDataset contains all venues
head(venueDataset)
## VENUEID LNG LAT VENUE_NAME 
## 1 4aef5d85f964a520dfd721e3 6.959 50.94 Koeln Hauptbahnhof 
## 24 4bade052f964a520506f3be3 6.949 50.93 Stadtbibliothek Koeln
## 25 4baf1998f964a52033eb3be3 6.964 50.93 Deutsches Sport & Olympia Museum
## 26 4baf428cf964a52024f43be3 6.962 50.92 Ubierschaenke 
## 27 4ba4f032f964a520dac538e3 6.849 50.92 OBI Baumarkt 
## 28 4bc210d92a89ef3b7925f388 6.927 50.95 Pfeiler Grill 
## CATEGORY_NAME 
## 1 Travel & Transport 
## 24 College & University 
## 25 Arts & Entertainment 
## 26 Nightlife Spot 
## 27 Shop & Service 
## 28 Food

# venueUserDataset contains all the relationships (ratings)
head(venueUserDataset)
## VENUEID USERID COUNT_CHECKINS COUNT_USER_ALL COUNT_CHECKINS_ALL 1
## 4a120aa1f964a5206d771fe3 594791 1 11 13 2 
## 4a120aa1f964a5206d771fe3 24447 2 11 13 3 
## 4a120aa1f964a5206d771fe3 139050 1 11 13 4 
## 4a120aa1f964a5206d771fe3 74382 1 11 13 5 
## 4a120aa1f964a5206d771fe3 2226845 2 11 13 6 
## 4a120aa1f964a5206d771fe3 13255988 1 11 13

venueUserFullDataset <- merge(x = venueUserDataset, y = venueDataset, by = "VENUEID")
head(venueUserFullDataset)
write.csv2(venueUserFullDataset, file = "140116_venue_recommendation_dataset.csv", 
    row.names = FALSE)

We use the data.table package to easily calculate the score and two other aggregates (COUNT_USER_ALL and COUNT_CHECKINS_ALL) that are used to filter the dataset for “artificial” venues like places that are only visited by one person or that had less than 10 check-ins. Last step then is to enrich the result with the venue information and save it as a simple CSV. Now we are ready to start with Tableau.

Creating a similarity matrix with Tableau

If a text file connection with “Multiple Tables” (common key is the USERID) is used, it is possible to create a data structure that allows a direct comparison of the scores for each venue pair.

Creating multiple tables connection

After that we rename some of the measures and dimensions, create a hierarchy based on VENUEID and VENUENAME and setup the correct type for both coordinates. After that it should look like this:

Overview dimensions and measures

To calculate the cosine similarity we need to precompute three things:

  • the dot product between the vectors of the two venues,
  • the Euclidean norm for the vector of the first venue and
  • the Euclidean norm for the vector of the second venue

Because of how we created our overall dataset (inner join) it is very easy to compute the dot product by multiplying both scores we have per comparison (aka row).

Table calculation for computing the dot product

Both Euclidean norms are somewhat more complex and we need the help of Tableaus “domain completion ” (explained here) to calculate them. We use this approach (changing this SUM(<field>) into SUM(POWER(<field>,2 )) ) and create two table calculations. Keeping in mind that we want to create a matrix like visualization, one of the created fields will compute the Euclidean norm down and one across the matrix. This can be configured after we placed both field onto the visualization shelf.

Table calculation for the Euclidean norm

Now we can compute the cosine similarity by putting all three things together:

Formula for the computation of the cosine similarity

After that lets start to build the visualization by bringing the venue hierarchy from the first table onto columns and the venue hierarchy from the second table onto rows.

Basic form of the similarity matrix

Now put the field Cosine Similarity on color (change the visualization to Square) and change the table calculation for Sum of Squared Scores (horizontal) to across and for Sum of Squared Scores (vertical) to down.

Using the correct table calculation settings for the cosine similarity

That’s it, you can also place the field on Label to show up the scores and/or use the computed cosine similarity as filter.

Complete similarity matrix

One step further - interactive exploration by bringing together venue similarity and spatial distance

The matrix shows a lot of information available and gives you a nice overview of the overall pattern (here one can see that for most of the venues the similarity is rather small) but on the other side lacks of clarity. You can sort by one column to bring up the most similar venues but, you still need to scroll up and down to oversee all values or from left to the right to compare it with other venues. And what’s also not possible based on this visualization type, is to examine for example if the spatial proximity is correlated with the computed similarity.

Putting both things together I’d like to show a visualization that use a map based approach to show similar locations towards a chosen venue (technically this corresponds to a visualization that is displaying either a row or a column from the matrix). Every time the user choose a different venue the map will update to show similar venues based on the calculated cosine similarity.

The first step is to create a parameter Reference Venue that contains all the venues the user can select and is used to be the “reference venue”. For this we create a calculated field that combines VENUEID and VENUE_NAME (because we can’t assume that the venue name is unique) and use it to “fill” the new parameter.

Creating the parameter for the reference venue

Next we need to adjust the calculation of the dot product. The idea is to set the dot product to zero if the selected venue is not involved (this will lead the cosine similarity for this pair of venues to be zero). After that, we can use the cosine similarity to adjust the size of any venue on the map - venues that are not relevant will “disappear”.

Calculating the adjusted dot product

Last but not least we need to change the calculation of the cosine similarity a little bit:

Cosine similarity for map display

And that’s it. Place both coordinates onto column and row and (Y_)VENUE_NAME and (Y_)VENUE_ID on the detail shelf. The calculated field for cosine similarity should be placed on color and size. Remember to configure the table calculation in the same way as before.

Interactive exploration of venue similarity using a map visualization

You will see that as you change the reverence venue with the dropdown field the map is updated accordingly.

Visualizing object similarity for non-spatial data

Because the foursquare dataset is about geographical data it seems natural to use a map as visualization method. But what if these spatial relationship is not given - for example how could we display the cosine similarity for different product sold in an online shop? I think treemaps are perfectly suited to display such an information. Similar to the map approach we use the cosine similarity for size and color. Using treemaps, it is easy to get on overview over the most similar venues as they are clustered around the upper left corner. The transformation from the map is simple - just drop the coordinates and add the venue name and cosine similarity to label.

Visualizing the cosine similarity using a treemap

There is one important thing left: if you download the attached workbook and play around with it you may experience that the performance is rather low. Especially if one use larger datasets Tableau takes some time to render the matrix. It would be interesting to see if someone can speedup the calculation.

Please also find my workbook with all the examples from this blog post attached: Tableau_Venue_Recommendation.twbx

November 3, 2013
Dream Team - combining Tableau and R

Last quarter was a bit too busy to write some new blog post because of a new job. And changing the job often come along with changing the tools you work with. That was my way to Tableau. Tableau is one of the new stars in the BI/Analytics world and definitely worth a look. The people at Tableau describe their tool as an instrument that combines interactive visual analytics, sophisticated visualization methods, an easy to use graphical user-interface and a lot of connectors for dozens of different data sources ranging from standard Excel to more “Big Data” like things such as different Hadoop distributions and NoSQL databases. What is really impressing is that you can create interactive visualizations in just no time.

I’m currently using Tableau to carry out explorative analysis and reporting up to management level (creating high polish PDFs). I’m quite satisfied with Tableaus capabilities for this even that Tableaus focuses more on Self-BI. What I missed are sophisticated analytical methods like different algorithms for classification, regression, clustering et cetera. So I was really happy that Tableau announced the integration of R in their upcoming release and quickly enrolled in their beta tester program a few weeks ago. Time now to share some insights from my first experiments with the beta version of the upcoming Tableau release 8.1.

First step is to get the connection between R and Tableau based on Rserve up and running. Following the approach given here it should be really easy even for Windows. It is good to know that you can launch Rserve with “R CMD Rserve_d” instead of “R CMD Rserve” to grab some information from the communication between Tableau and R for debugging (here you can find another short comment about some more advanced debugging techniques). Alternatively you can launch Rserve from within R with:

library(Rserve)
Rserve()

If everything works fine you can test you connection from inside Tableau by selecting “Help -> Manage R Connections …”. Be sure to replace “localhost” with “127.0.0.1” because of a bug in the current beta version (tested with 8.1 Beta 2 - may be fixed in newer versions) which will slow down performance otherwise.

The integration uses so called table calculations to retrieve results from calculations within R. A table calculation could be used to state a calculation like “Partition the sales data by region and calculate the overall revenue per product category”. This means that the arguments from Tableau to R have to be specified as aggregates. The result of the R calculation then should be one value or the same number of values as in the partition (vector). This sounds a little bit difficult and to be honest in the beginning it was difficult for me to understand the logic behind. But let’s try some examples to demonstrate what that means.

The first one is derived from the official Tableau blog and demonstrates the application of k-Means using the iris dataset.

Scatter plot of the iris dataset where the three different species could be identified by different colors

Our aim is visualize the resulting clustering which should give us some understanding about how it matches the true classification. We use a calculated field to pass the data to R and get the resulting clustering ids from the k-Means function. The code for the table calculation function inside the calculated field is very simple:

SCRIPT_INT('result <- kmeans(x = data.frame(.arg1,.arg2,.arg3,.arg4), 3)
  result$cluster', 
  SUM([Sepal#Length]),SUM([Sepal#Width]),SUM([Petal#Length]),SUM([Petal#Width])
) 

SCRIPT_INT( R code, Parameters ) is the Tableau table calculation for running R code that returns an integer value or vector (Tableau also offers functions that return boolean, string and real values). It is important to know, that this function only returns one value/vector and such, if you need more parts of the results from your R computation, you have to create additional calculated fields. In the code above we use data.frame on the given parameters (going from .arg1 to .argN where N is the number of parameters) first to create a data object for the k-Means function. The resulting clustering ids are then passed back to Tableau. As you see all four parameters must be aggregated first (SUM(...)) before we can use them in our R calculation.

We can then use the calculated field to show the cluster assignment by assigning every cluster id a different shape. Because we want to see the results on a detailed level we need to deactivate the default aggregation in Tableau by clicking on “Analysis -> Aggregate Measures”. The resulting plot reveals that there is a huge mismatch between the clusters and classes - not the best clustering solution.

Results from clustering the iris dataset with k-Means and k = 3

Now it would be nice to strengthen the interactive features of Tableau and therefore parameterize the R calculations in a way that we can modify the calculation from within Tableau. This leads to my second example visualizing the results from a PCA - showing a scatterplot of the objects in the space of the two choosen components. Therefore we use the USArrets dataset (Violent Crime Rates by US State) and the standard R function princomp for doing principal component analysis:

SCRIPT_REAL('ncomp <- max(.arg5)
  df <- data.frame(.arg1,.arg2,.arg3,.arg4)
  pc <- princomp(df, cor = TRUE)
  as.numeric(pc$scores[,ncomp])',
  SUM([Assault]),SUM([Murder]),SUM([Rape]),SUM([UrbanPop]),[No. Component - x axis]
)

We can extract the scores for every component from the result of princomp. Because we need the scores for x and y axis we create two calculated fields - one for each axis (above you can see the code for the calculated field that shows the x axis). Together with the data we can pass a parameter from Tableau to R that determines which component scores we want (one parameter for every calculated field).

"Parameters and calculated fields for the PCA example"

Because R retrieves this parameter the same size as the other inputs we need to aggregate the parameter to yield a scalar value for selecting the right scores. Now it’s beautiful to see how we can integrate these R calculations into Tableau’s typical interactive workflow: Drag one of the attributes to column and one to ‘Shape’ - now you can see how theses features load on the selected components. If you activate the parameter control for each of the parameters you can change the scores mapped on x and y axis on the fly. Really a nice tool to explore the results of a principal component analysis.

"Dashboard that could be used to explore the results from a principal component analysis"

My last example demonstrates a use case that is common for me but that in my opinion also shows the limitations of the current R integration. Imagine that your personal data scientist sends you a fine tuned data mining model that you want to use for further prediction. You want to apply these model on your current dataset in Tableau and use the predictions as additional feature for your visualizations to answer further questions. How could the new R integration in Tableau helps us with that? Let’s start and first create the model outside of Tableau because this is likely the way the data scientist will do it (using tools from his own development environment). For this we create a simple classification model with rpart (decision trees) for predicting the outcome of the species feature from our iris dataset and save the model in an RData file ready for sharing (we could also use a more standardized format like PMML available in the corresponding package pmml).

require(rpart)
data(iris)
rpartModel <- rpart(Species ~ ., data = iris)
save(rpartModel, file = "C:/tmp/RpartModel.RData")

Now we could try to utilize the model from within Tableau. For this we create the following calculated field that loads the model and predicts the outcome of the given data. The code looks like this

SCRIPT_STR('
  require(rpart)
  data(iris)
  load(file = "C:/tmp/RpartModel.RData")
  newiris <- data.frame(Sepal.Length = .arg1, 
                        Sepal.Width = .arg2, 
                        Petal.Length = .arg3, 
                        Petal.Width = .arg4)
  as.character(predict(rpartModel, newiris, type ="class"))',
  ATTR([Sepal#Length]),ATTR([Sepal#Width]),ATTR([Petal#Length]),ATTR([Petal#Width])
) 

It is not surprising that the visualization of true class (color) versus predictions (shape) reveals good results. As before you need to deselect “Aggregate Measures” for the plot.

"visualization of predicted vs true class label"

So far so good. But what if we want to use the predictions together with the class labels to create a simple confusion matrix? Let’s try and plot both features against each other and color the results using an additional calculated field called Misclassification (IF ATTR([Species]) = [R Prediction] THEN 'No' ELSE 'Yes' END). What I achieved after some experiments are either disaggregated results (showing a “1” for each entry colored according to “Misclassification”“ - which is not very helpful) or a single prediction for each partition of the aggregated data that gets multiplied by the number of data points inside each partition. Here are two pictures showing both possible results:

"Confusion matrix - disaggregated results"

"Confusion matrix - aggregated results"

Sadly, until now I didn’t get the things running in way it should be (maybe some of you have an helpful idea idea - if so then don’t hesitate to leave a comment) which make me think if some of my desired use cases are not possible with the current R integration in Tableau.

Finally I want to summarize what I observed so far: I’m really excited whenever a commercial software company announces that it will integrate R into their software stack. What Tableau did with R is really useful as the shown uses cases hopefully demonstrated. But the integration is also not that simple than other things in Tableau and I still need more time to get a better picture of things that work and things that won’t work. Therefore it is perfect, that Tableau has a very active community and I’m sure that given some weeks we will see dozens of new possibilities on how this integration of R can produce fruitful results.

Please also find my workbook with all the examples from this blog post attached: Test_R_Integration_Tableau_8_1.twbx

June 23, 2013
Time Is on My Side - A Small Example for Text Analytics on a Stream

Introduction and Background

While my last posting was about recommendation in the context of Location Based Social Networks there are also other interesting topics regarding the analysis of unstructured data. The most established one is probably Text Analytics/Mining focusing on all sorts of text data.For me, coming from spatial analysis, these topic is relatively new but I couldn’t help noticing that more and more easy-to-use packages for analyzing various kinds of text sources entered the R ecosystem and a lot of postings and tutorials about this topic were created. Time for me to carry out some experiments - this post is about the outcomes. A last short notice: You will read about a lot of different techniques you might not directly relate to Text Analytics like Shiny, rCharts and Data Stream Mining. That’s true. But all of them can be connected with Text Analytics and I will show a few examples for that. Mostly I use these blog posts as a playground for exploring how different new techniques could be nicely fit together and how worthwhile they could be for my daily work. This posting is a good example for that and I hope you can benefit at least a little bit from it.

Social media services like Twitter are clearly a rich source for analyzing texts and also show some clear challenges for simple methods (slang, spelling, …). Imagine you want to explore a special topic that people discuss on twitter. Examining terms used by the users could give you a first impression of the development of the discussion and also serve as a starting point for exploring different sub-themes or connected topics of the overall topic. A simple approach might look like this: One could extract all the terms/words out of some collected tweets and use them for exploratory data analysis. A potential drawback of such an approach might be that the size of the data is too large (for processing, computing or saving) especially when the topic is very popular and there are thousands or millions of corresponding tweets. To overcome this problem you could think about working only with a sample of terms. This should give you approximately the same information than working with all discovered terms/words and needs less resources. But taking a proper sample (each term has the same probability to be selected) from an ongoing discussion (a stream of terms) is not trivial. You could wait until the discussion is over and then sample from the collected tweets but this gives you no actionable knowledge (because the discussion is over) and you have to store all the tweets or terms (size!). The second approach could be to save all the extracted terms (size!) and take a new sample from all the saved terms each time a new word “arrives”. This is neither efficient nor feasible depending on the size of the data (imagine the discussion on twitter lasts for weeks or months). I will show a well established sampling method to overcome these problems but at the moment let me state the concrete objective of my experiment discussed so far:

Constantly crawl all tweets regarding a special topic while managing a proper sample of the terms (I don’t want to save all the extracted words). In addition create some prototypical tool to explore the discussion based on the sampled words.

Reservoir Sampling

So, first we have to think about the sampling method, that should work on a data stream of unknown length. Reservoir Sampling is such a method which is very simple. First define the sample size \(R\) (~ size of the reservoir) and fill it with the first items (~terms/words) from the stream. After every place in the reservoir is occupied for every new item on the stream the following steps have to be performed:

  1. Take a random number from the uniform distribution between 0 and 1 and compare it with \(R/Iter\) where \(Iter\) is the index of the current item.
  2. If the random number is smaller replace a randomly chosen term from the reservoir with the current item.

The statistics behind ensures it that at iteration K every term from the stream seen so far had the same probability to be part of the sample. Some more detailed explanation could be found here and here. The code for Reservoir sampling is shown below.

rsize = 1000
iTerm = 0
wordReservoir = character(rsize)
...

termVec <- ...  # vector with words

for (i in 1:length(termVec)) {
    if (iTerm <= rsize) {
        # still initializing reservoir
        wordReservoir[iTerm] <- termVec[i]
        iTerm <- iTerm + 1
    } else {
        pI <- runif(1)
        if (pI < (rsize/iTerm)) {
            # keep
            itemRemove <- sample(1:rsize, size = 1)
            wordReservoir[itemRemove] <- termVec[i]
        }
        iTerm <- iTerm + 1
    }
}

A Small Program for Crawling and Processing Tweets

The second step was to implement the Crawler for the tweet stream. Because some time ago I read a blog post about a new R package for working with streams I decided to follow the proposed framework and implemented my own stream data source for working with twitter. Therefore I had to create my own “DSD” class (Data Stream Data) that only needs a constructor function and a get_points() method. Although it is not really necessary to use the framework for my simple example I think that the small amount of extra time for implementation was worthwhile because in future we could now also use the algorithms offered by the stream package together with a twitter stream. The code for the “ DSD_Twitter” class is shown below (Note: You will need an account on Twitter and proper credentials to get it running - see also this new post):

# --------------------------------------------- 
# Content of file DSD_Twitter.R 
# ---------------------------------------------
DSD_Twitter <- function(searchTerm, lang = NULL,
                        streamMinLength = 20,
                        pathToCredentials = "twitterCredentials.RData") {
    require(twitteR)
    require(plyr)
    require(ROAuth)

    # Initialize OAuth
    options(RCurlOptions = list(verbose = FALSE, capath = system.file("CurlSSL", 
        "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))
    load(pathToCredentials)
    registerTwitterOAuth(cred)

    # streamMinLength ensures us that we could read at least streamMinLength
    # Items from the stream .recentID stores the latest tweet ID from the
    # stream - so we will not read in the same tweet multiple times
    .recentID <<- min(twListToDF(searchTwitter(searchString = searchTerm, 
                                               n = streamMinLength))$id)

    l <- list(description = "Twitter Data Stream", searchTerm = searchTerm, 
        lang = lang)
    class(l) <- c("DSD_Twitter", "DSD_R", "DSD")
    l
}

get_points.DSD_Twitter <- function(x, n = 1, ...) {
    if (is.null(x$lang)) {
        tweets <- searchTwitter(searchString = x$searchTerm, n = n, sinceID = .recentID)
    } else {
        tweets <- searchTwitter(searchString = x$searchTerm, lang = x$lang, 
            n = n, sinceID = .recentID)
    }

    tweetsDF <- twListToDF(tweets)

    .recentID <<- max(tweetsDF$id)

    tweetsDF
}

The crawler itself then only is a simple endless loop which grasps the latest tweets every certain interval. What certainly is of more interest is how the tweets will be processed in every iteration. After getting the next tweets from the datastream all words/terms are extracted from the tweet messages, processed and transformed to their base form. These are typical steps in Text Mining and I will explain them a bit later on. Then Reservoir Sampling is used to update the sample. In the last two steps a snapshot of the current reservoir is saved for further offline processing and a simple wordcloud is created that shows the frequency distribution over the current sample. In this posting the wordcloud is only a placeholder for some more advanced analytical method working on the sample. The code for all this is as follows:

# --------------------------------------------- 
# Content of file Twitter_Stream_Mining.R 
# ---------------------------------------------
require(stream)
require(tm)
require(wordcloud)

source("DSD_Twitter.R")  # Data Stream Data class
source("TextMining_Functions.R")  # Functions used for text processing

rsize = 1000  # size of the reservoir
wordReservoir = character(rsize)
minToWait = 10
nIter <- 20

# sTerm is the search term and need to be specified before
dir.create(sTerm)
setwd(sTerm)

# initialize stream
mydatastream <- DSD_Twitter(searchTerm = paste("#", sTerm, sep = ""), 
                            streamMinLength = 50, 
                            lang = "de", 
                            pathToCredentials = "../twitterCredentials.RData")

iTerm = 0
while (TRUE) {
    res <- try(get_points(mydatastream, n = nIter), silent = TRUE)  # get next tweets
    if (!inherits(res, "try-error")) {
        # create vector of terms
        tdMatrix <- getTermDocumentMatrixFromTweets(tweetTextVector = res$text, 
            wordsToRemove = sTerm)
        termVec <- getTermVectorFromTermDocumentMatrix(tdMatrix)
        cat("+ ", length(termVec), " new terms (", format(Sys.time(), 
            "%H:%M:%S"), ")\n")

        # update sample
        for (i in 1:length(termVec)) {
            if (iTerm <= rsize) {
                # still initializing reservoir
                wordReservoir[iTerm] <- termVec[i]
                iTerm <- iTerm + 1
            } else {
                pI <- runif(1)
                if (pI < (rsize/iTerm)) {
                  # keep
                  itemRemove <- sample(1:rsize, size = 1)
                  wordReservoir[itemRemove] <- termVec[i]
                }
                iTerm <- iTerm + 1
            }
        }

        # a simple wordcloud as placeholder for more advanced methods
        png(paste("WordCloud_", format(Sys.time(), "%Y-%m-%d_%H_%M_%S"), ".png", 
            sep = ""), width = 12, height = 8, units = "in", res = 300)
        tM <- table(wordReservoir[wordReservoir != ""])
        wordcloud(names(tM), tM, random.order = FALSE, colors = 
            brewer.pal(8, "Dark2"), min.freq = 1)
        dev.off()

        ## save snapshot
        write.table(x = wordReservoir, file = paste("TwSrMin_Terms_Reservoir_", 
            format(Sys.time(), "%Y-%m-%d_%H_%M_%S"), ".txt", sep = ""), append = FALSE, 
            row.names = FALSE, col.names = FALSE)
    } else {
        cat("+ There are no new tweets available (", format(Sys.time(), 
            "%H:%M:%S"), ")\n")
    }

    Sys.sleep(60 * minToWait)
}

Some words about the text processing. I tried to keep it very simple and followed the instructions from the tm package and the tutorials given here and here. I put all things into a separate file that could then be sourced by my main script. The first functions contains most of the functionality. I transform the whole text into lower letters, remove punctuation and numbers. I also remove typical stop words, the search term and URLs (if I can find them). The final step is stemming yielding the base form of every word. The result after the preprocessing is a so called term-document matrix describing the frequency of the words over the tweets. Because we need this in the form of a “stream of words” the second function transform the matrix into a simple character vector. The code for both functions is as follows:

# --------------------------------------------- 
# Content of file TextMining_Functions.R 
# ---------------------------------------------
getTermDocumentMatrixFromTweets <- function(tweetTextVector, lang = "de", enc = "UTF-8", 
    wordsToRemove = "") {
    require(tm)

    myCorpus <- Corpus(VectorSource(tweetTextVector, encoding = enc), 
	readerControl = list(language = lang))
    myCorpus <- tm_map(myCorpus, tolower)
    # remove punctuation
    myCorpus <- tm_map(myCorpus, removePunctuation)
    # remove numbers
    myCorpus <- tm_map(myCorpus, removeNumbers)
    tmpTerms <- row.names(as.matrix(TermDocumentMatrix(myCorpus, 
	control = list(minWordLength = 1))))
    httpUrlsRemove <- tmpTerms[grep("httptco", tmpTerms)]
    myCorpus <- tm_map(myCorpus, removeWords, c(stopwords("german"), httpUrlsRemove, 
	wordsToRemove))
    dictCorpus <- myCorpus
    myCorpus <- tm_map(myCorpus, stemDocument)
    # stem completion
    myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = dictCorpus)
    myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))
    return(myDtm)
}

getTermVectorFromTermDocumentMatrix <- function(tdm) {
    tM <- rowSums(as.matrix(tdm))
    return(rep(names(tM), times = tM))
}

Then only two lines of code were needed to start the crawler:

sTerm = ... # Twitter Hashtag
source("Twitter_Stream_Mining.R")

Experiment

By the time I perform these experiment the ongoing discussion about the Deutsche Telekom, one of the world’s largest telecommunication company, provides me with a vibrant discussion so I decided to follow up on this. The cause of the discussion was an announcement by Deutsche Telekom to limit the bandwidth after reaching some monthly threshold for all of their customers from 2016 onwards (you can read more about this for example here). The discussion was divided into two main threads: What will this mean to most popular broadband services and does it violate “network neutrality”. The crawler was running for 8 days beginning on the fourth of May. The following three pictures show the status of the discussion after the first, the fourth and the eighth day (from the left).

Day 1 Day 2 Day 1

The term dominating the discussion clearly is “drosselkom”, a satirical combination of the words “drosseln” (reducing speed) and “Telekom”. The analyst should remove the term from the plots to gain a better picture of the situation (done for the following three pictures corresponding to the three above).

Day 1 Day 2 Day 1

The User Interface

This also shows the demand for a graphical interface to allow the analyst to explore a specific snapshot in more detail. I will leave this for the future and will be showing instead, how to build a simple tool to explore the occurrence of words over time (along all snapshots). I therefore used the Shiny package to create a small web application. For this I first had to aggregate the information from the different snapshots to yield a “word-snapshot matrix” showing the frequency of a word in every snapshot. Because analyzing words over a long time is not typically “Real-time Analytics” I implement this preprocessing as a batch job. The following code shows how I created the matrix:

# --------------------------------------------- 
# Content of file Prepare_Batch_View.R 
# ---------------------------------------------
require(plyr)
require(reshape2)
require(ggplot2)

snapshotDirectory = "telekom"  # the same as the search term

resFileList <- list.files(snapshotDirectory)[grep("TwSrMin_Terms_Reservoir", 
    list.files(snapshotDirectory))]

wordMatrix <- matrix(character(0), nrow = 1000, ncol = length(resFileList))
for (i in 1:length(resFileList)) {
    wordReservoir <- read.table(file = paste(snapshotDirectory, "/", resFileList[i], 
        sep = ""), colClasses = c("character"))$V1
    reservoirLength <- length(wordReservoir)
    wordMatrix[1:reservoirLength, i] <- wordReservoir
}
colnames(wordMatrix) <- as.POSIXct(substr(resFileList, 25, 43), 
    "%Y-%m-%d_%H_%M_%S", tz = "GMT")
# first create key value pairs
wordKeyValueList <- melt(wordMatrix, na.rm = TRUE)
wordKeyValueList$Var1 <- NULL
colnames(wordKeyValueList) <- c("TIMESTAMP", "WORD")
# create word x snapshot matrix
occurrencePerWordAndTime <- dcast(wordKeyValueList, WORD ~ TIMESTAMP, length, fill = 0)

save(occurrencePerWordAndTime, 
    file = paste(snapshotDirectory, "/Result_Batch_Processing.RData", sep = ""))

The Shiny application consists of three files in a new subdirectory called “Shiny_Timeline”. The first one is “global.R” which loads the prepared dataset and precompute some information needed for creating the dynamic interface. The interface enables the user to select a couple of words via a section box to visualize their occurrence in the sample over time. The content of the selection box itself is a named list of terms ordered by the overall occurrence of the terms over all snapshots. This could be calculated directly from the “word-snapshot matrix”. The code therefore is part of “global.R”:

# --------------------------------------------- 
# Content of file Shiny_Timeline/global.R 
# ---------------------------------------------
snapshotDirectory = "../telekom"

# load dataset
load(paste(snapshotDirectory, "/Result_Batch_Processing.RData", sep = ""))

# first column contains the term itself
indOrderWords <- order(rowSums(occurrencePerWordAndTime[, -1]), decreasing = TRUE)
# compute the input for the selection box
inputList <- as.list(as.character(occurrencePerWordAndTime$WORD)[indOrderWords])
names(inputList) <- paste(as.character(occurrencePerWordAndTime$WORD)[indOrderWords],
	" (", rowSums(occurrencePerWordAndTime[, -1])[indOrderWords], 
	")", sep = "")

The second file of the Shiny application is my interface description “ui.R”. We offer three ways to interact. The user can select words, choose between two plots showing the same content but offer a different degree of interaction and set the start and end date for the visualization. The user interface is dynamic, meaning that the user first has to select a word before he can select any plot or modify the dates. The output consists of a text box showing all labels and two conditional panels showing the plots as requested.

# ---------------------------------------------
# Content of file Shiny_Timeline/ui.R
# ---------------------------------------------
shinyUI(
  pageWithSidebar(

    headerPanel("Term Analyzer"),

    sidebarPanel(
      selectInput(inputId = "word", 
                  label = "Term (number of occurrences over all snaphshots):", 
                  inputList, multiple = TRUE),
      uiOutput("triggerGGplot2"),
      uiOutput("triggerRChart"),
      uiOutput("triggerDateSlider")
    ),

    mainPanel(
      h4("Selected words"),
      textOutput("selectedWords"),
      conditionalPanel(
        condition = "input.ggplot2Plot == true",
        div(),
        h4("Plot showing term occurrence over time (using ggplot2)"),
        plotOutput("timeseries")
      ),
      conditionalPanel(
        div(),
        h4("Plot showing term occurrence over time (using rCharts and nvd3)"),
        condition = "input.rchartPlot == true",
        showOutput("timeseriesRChartsControls","nvd3")
      )
    )
  )
)

The following screenshot shows both plots activated (click to enlarge).

User Interface

The “server.R” file provides the core functionality. Especially the handling of the inputs and the creation of the two plots. The first plot is based on the ggplot2 package the second one on the rCharts package. While ggplot2 is de facto standard in R, rCharts offers some real nice interactive features. You can (de)select different time series inside the plot or for example hover a time series to show its value. The only thing I didn’t figure out was how to display dates on the x-axis like in the ggplot2 plot. So the last part of code looks like this:

# --------------------------------------------- 
# Content of file Shiny_Timeline/server.R 
# ---------------------------------------------

# code to run: require(shiny); runApp('Shiny_Timeline')

require(wordcloud)
require(tm)
require(plyr)
require(reshape2)
require(ggplot2)
require(rCharts)


shinyServer(function(input, output) {

    getSelectedWords <- reactive({
        if (length(input$word) < 1) {
            return(NULL)
        } else {
            return(as.character(input$word))
        }
    })

    getDateRange <- reactive({
        dateRange <- as.numeric(input$daterange) * 24 * 60 * 60
        # because we need the complete day
        dateRange[2] <- dateRange[2] + 23 * 60 * 60 + 59 * 60 + 59
        return(dateRange)
    })

    output$selectedWords <- renderText({
        text <- getSelectedWords()
        if (length(text) == 0) {
            return("Nothing selected")
        } else {
            return(text)
        }
    })

    # dynamic ui components
    output$triggerGGplot2 <- renderUI({
        if (length(input$word) > 0) {
            checkboxInput("ggplot2Plot", "Show ggplot2 plot", FALSE)
        }
    })

    output$triggerRChart <- renderUI({
        if (length(input$word) > 0) {
            checkboxInput("rchartPlot", "Show rCharts plot", FALSE)
        }
    })

    output$triggerDateSlider <- renderUI({
        if (length(input$word) > 0) {
            dateSeq <- as.POSIXct(
                as.numeric(as.character(colnames(occurrencePerWordAndTime)[-1])), 
                origin = "1970-01-01", tz = "GMT")
            dateRangeInput("daterange", "Date range:", start = format(min(dateSeq), 
                "%Y-%m-%d"), end = format(max(dateSeq), "%Y-%m-%d"), 
                min = format(min(dateSeq), "%Y-%m-%d"), max = format(max(dateSeq), 
                "%Y-%m-%d"), language = "de", separator = " - ")
        }
    })

    output$timeseries <- renderPlot({
        indSelectedWords <- which(as.character(occurrencePerWordAndTime$WORD) %in% 
            getSelectedWords())
        dateRange <- getDateRange()
        indSelectedSnapShots <- which(
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) >= dateRange[1] & 
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) <= dateRange[2]) + 1
        tmp <- melt(occurrencePerWordAndTime[indSelectedWords, 
                c(1, indSelectedSnapShots)], id.vars = c("WORD"))
        colnames(tmp) <- c("WORD", "TIMESLOT", "VALUE")
        tmp$TIMESLOT <- as.POSIXct(as.numeric(as.character(tmp$TIMESLOT)), 
            origin = "1970-01-01", tz = "GMT")

        p <- ggplot(tmp, aes(TIMESLOT, VALUE, group = WORD, colour = WORD)) + 
            geom_path(alpha = 0.5) + scale_colour_hue() + ylab("occurrence") + 
            xlab("Snapshot (datetime)")
        print(p)
    })

    output$timeseriesRChartsControls <- renderChart({
        indSelectedWords <- which(as.character(occurrencePerWordAndTime$WORD) %in% 
            getSelectedWords())
        dateRange <- getDateRange()
        indSelectedSnapShots <- which(
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) >= dateRange[1] & 
            as.numeric(colnames(occurrencePerWordAndTime)[-1]) <= dateRange[2]) + 1
        tmp <- melt(occurrencePerWordAndTime[indSelectedWords, 
                c(1, indSelectedSnapShots)], id.vars = c("WORD"))
        colnames(tmp) <- c("WORD", "TIMESLOT", "VALUE")
        tmp$TIMESLOT  <- as.numeric(as.character(tmp$TIMESLOT))
        tmp$TIMESLOT  <- tmp$TIMESLOT - (min(tmp$TIMESLOT) - 1)

        p2 <- nPlot(VALUE ~ TIMESLOT, group = "WORD", data = tmp, type = "lineChart", 
            dom = "timeseriesRChartsControls", width = 800)
        p2$xAxis(axisLabel = "time (in seconds from the first snapshot)", width = 40)
        p2$yAxis(axisLabel = "occurrence", width = 40)
        return(p2)
    })
})

Interpretation

Running this small application gives you the ability to show the development of the discussion in more detail. It is great to see how different subtopics enrich the discussion over time as it could be seen on the following picture with four terms selected.

User Interface

Some interpretation: Malte Götz is the name of a person who started a large online petition against the intention of Deutsche Telekom. On the third of May he handed over all signatures to Deutsche Telekom in a public event. Because the crawler was started one day later and also needed some time to initialize we can only see the last part of the discussion on that. But as the discussion goes on the importance relative to the overall discussion on this subtopic decreases. On the sixth of May the “Verbraucherzentrale Nordrhein-Westfalen”, a large consumer protection agency, announced that it sent a warning letter to Deutsche Telekom to stop their plan to reduce bandwidth. You can see that this caused a lot of discussion on twitter until the end with a peak around the seventh of May. On the same day the “re:publica” conference with topics related to the digital society opened in Berlin. Because the topic is very close to topic of the conference you can see an emerging relation between the conference and the case of Deutsche Telekom. This is especially important because as an analyst you are able to see in “Real-time” how events will “position” themselves to the topic under observation (become an important part of the discussion or not). This is also true for the last term I want to mention - “wiwo”. It is the abbreviation for a large business magazine called “Wirtschaftswoche”. On May the eighth they published an article about the relation between Deutsche Telekom and German politics in particular agencies responsible to oversee the telecommunication market. I could not find any other news article that got such a high attention for this twitter discussion in my data.

Conclusion

There are a lot of things I want to mention at the end - so I will keep it short:

  • You could easily start with Text Mining in R but as things will get more complex results will not always be as expected. For example I tried some simple sentiment analysis but slang, spelling and irony on twitter made it impossible for me to get some useful outcome.
  • The user interface described in my post and a good search engine made it really easy to get a dynamic picture of the discussion on twitter. Surely you can improve this by viewing topics instead of single terms. For this topic modeling could be a good starting point.
  • Reservoir sampling helped me great in keeping the data to handle small. And it is good to know this method ;-).
  • Shiny is my most important new observation from all the techniques. You could easily develop some prototypical web interfaces to show to colleagues and customers. Until now it felt like a restriction to show only static pictures and tables but that is over now.
  • Graphics based on JavaScript like rCharts are great and look very nice. But the available documentation is “limited” compared to ggplot2 and there are a lot of different libraries you have to deal with. Consulting the original java script code is necessary for customizing your charts.

Now, that’s all for this posting and I hope you can get something out of it. If you have questions or any comments feel free to leave a message.

April 7, 2013
Venue Recommendation - A Simple Use Case Connecting R and Neo4j

Last month I attended the CeBIT trade fair in Hannover. Besides the so called “shareconomy” there was also another main topic across all expedition halls - Big Data. This subject is not completely new and I think that a lot of you also have experiences with some of the tools associated with Big Data. But due to the great number of databases, frameworks and engines in this field, there will always be something new to learn. So two weeks ago I started my own experiments with a graph database called Neo4j. This is one of the NoSQL databases, intended to distribute all of the computation across dozens of clusters in a fault-tolerant way. What attracted me was that I read that it is well suited for highly connected data and offers a descriptive language for querying the graph database. Roughly speaking, a graph database consists of nodes and edges connecting nodes. Both could also be enriched with properties. Some introduction which helped me can be found here and here. The graph query language "Cypher" then can be used to query the data by traversing the graph. Cypher itself is a declarative “Pattern-Matching” language and should be easily understandable for all folks familiar with SQL. There is a well arranged overview under this address. If you look at my older posts, you will see that most of them are about spatial data or data with at least some spatial dimension. This kind of data often has some inherent relationships - for example streets connected in a street network, regions connected through some border, places visited by people and so on. Thus I decided to connect one of the most discussed use cases from Big Data - Recommendation/Recommender Systems - with an attractive dataset about the Location Based Social Network Foursquare I collected last year, for my first experiment with Neo4j.

The main plot behind this simple “Spatial Recommendation Engine” is to utilize public available check-in data to recommend users new types of places they never visited before. Such a “check-in” consists of a user ID, a place (called venue) and a check-in time plus additional information (venue type, ..). The following code will show the structure of the already preprocessed data:

options(width = 90)

# load required libraries
require(data.table)
require(reshape)
require(reshape2)
require(bitops)
require(RCurl)
require(RJSONIO)
require(plyr)

# load Foursquare data
fileName <- "DATA/Foursquare_Checkins_Cologne.csv"
dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), 
    rep("factor", 2), rep("character", 2)), dec = ",", encoding = "UTF-8")

# how the first 10 elements look like
head(dataset)
##                 CHECKIN_ID        CHECKIN_DATE CHECKIN_TEXT   USERID
## 1 4ff0244de4b000351aa08c35 2012-07-01 11:19:57              15601925
## 2 50a66a8ee4b04d0625654fad 2012-11-16 16:32:14               7024136
## 3 50fbe6b6e4b03e4eab759beb 2013-01-20 12:44:38                193265
## 4 50647c22e4b011670f2a173e 2012-09-27 17:17:38              10795451
## 5 500fc5b9e4b0d630c79ab4f8 2012-07-25 11:08:57              13964243
## 6 50d09108e4b013668d5538f3 2012-12-18 15:51:36                126823
##                    VENUEID        VENUE_NAME   GKZ_ID              CATEGORY_ID
## 1 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 2 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 3 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 4 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 5 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 6 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
##        CATEGORY_NAME              LAT              LNG
## 1 Travel & Transport 50,9431986273333 6,95889741182327
## 2 Travel & Transport 50,9431986273333 6,95889741182327
## 3 Travel & Transport 50,9431986273333 6,95889741182327
## 4 Travel & Transport 50,9431986273333 6,95889741182327
## 5 Travel & Transport 50,9431986273333 6,95889741182327
## 6 Travel & Transport 50,9431986273333 6,95889741182327

The data was crawled last year as basis for an academic paper in the field of Urban Computing (which will be presented in May at the AGILE Conference on Geographic Information Science in Brussels) and contains public available check-ins for Germany. It seems to me, that such a kind of data is ideally suited for doing recommendations in a graph database and avoids the use of well-known toy datasets. The key idea behind our recommendation system is the following: Starting with a person for whom we want to make a recommendation, we will calculate the most similar users. A similar user is someone who rated venues in the same way the person of interest did. Because there is no explicit rating in the foursquare data, we take the number of visits as rating. The logic behind this is that either the person likes this place or it is important to him. So if both, the person and a user, will give high “ratings” to venues visited by both (thus both are similar), then the person may also be interested in visiting other venues highly rated by the other user, that the person has not seen yet. Technically speaking, this approach is called a collaborative filtering (calculate user similarity based on behavior) while the data collection is implicit (we have no explicit rating). Our data model therefore is straightforward: We take the venues and the users as nodes and transform all the related attributes from both into corresponding node properties. Then we connect every user node and venue node with a relationship if the user has visited this venue. The number of visits will be coded as a property of the relationship. For the recommender system we will use a combination of R and Cypher statements, the second primarily for loading the data into Neo4j and traversing the graph. To send Cypher statements to Neo4j the REST-API is of great value. We then could use the great abilities of R to preprocess the data, catch the results and calculate the final recommendation list.

The following is a short overview of all the steps:

  1. Extracting all relevant information (venues, users, ratings) from the check-in data
  2. Loading the data into Neo4j
  3. Calculating similarities for a specific user and making a recommendation on-the-fly
  4. Plotting the results on a map

I assume that Neo4j is installed (it’s very simple - look here) and the graph database is empty. For this delete the “graph.db” directory. After this start Neo4j.

So our first step is to extract all venues, users and ratings from the check-in data.

# -------------------------------------- 
# data preprocessing
# --------------------------------------
dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, format = "%Y-%m-%d %H:%M:%S")
dataset$LAT <- sub(",", ".", dataset$LAT)
dataset$LNG <- sub(",", ".", dataset$LNG)
dataset$LAT <- as.numeric(dataset$LAT)
dataset$LNG <- as.numeric(dataset$LNG)
dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H"))
venueDataset <- unique(dataset[, c("VENUEID", "LNG", 
    "LAT", "VENUE_NAME", "CATEGORY_NAME")])

# use data.table for aggregation
datasetDT <- data.table(dataset)
venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), 
    by = list(VENUEID, USERID)]
venueUserDataset <- data.frame(venueUserDataset)

# now unique(venueUserDataset$USERID) contains all user IDs,
head(unique(venueUserDataset$USERID))
## [1] "15601925" "7024136"  "193265"   "10795451" "13964243" "126823"
# venueDataset contains all venues and
head(venueDataset)
##                     VENUEID   LNG   LAT                       VENUE_NAME
## 1  4aef5d85f964a520dfd721e3 6.959 50.94                Köln Hauptbahnhof
## 24 4bade052f964a520506f3be3 6.949 50.93             Stadtbibliothek Köln
## 25 4baf1998f964a52033eb3be3 6.964 50.93 Deutsches Sport & Olympia Museum
## 26 4baf428cf964a52024f43be3 6.962 50.92                     Ubierschänke
## 27 4ba4f032f964a520dac538e3 6.849 50.92                     OBI Baumarkt
## 28 4bc210d92a89ef3b7925f388 6.927 50.95                    Pfeiler Grill
##           CATEGORY_NAME
## 1    Travel & Transport
## 24 College & University
## 25 Arts & Entertainment
## 26       Nightlife Spot
## 27       Shop & Service
## 28                 Food
# venueUserDataset contains all the relationships (aka ratings)
head(venueUserDataset)
##                    VENUEID   USERID COUNT_CHECKINS
## 1 4aef5d85f964a520dfd721e3 15601925              5
## 2 4aef5d85f964a520dfd721e3  7024136              1
## 3 4aef5d85f964a520dfd721e3   193265              1
## 4 4aef5d85f964a520dfd721e3 10795451              6
## 5 4aef5d85f964a520dfd721e3 13964243              6
## 6 4aef5d85f964a520dfd721e3   126823             11

The next thing is to import all that data into Neo4j. We will do this by generating dynamic Cypher statements to create all the nodes and relationships. This will of course take some time. If you have more data, then it’s maybe wiser to use the “Batch Importer”. But this needs more development and will not be explained here. Neo4j’s website offers a lot of possibilities to import data from various sources into the graph database. All of our Cypher statements will be sent to Nei4j via the “query” method, which I got from here.

# Function for querying Neo4j from within R 
# from http://stackoverflow.com/questions/11188918/use-neo4j-with-r
query <- function(querystring) {
    h = basicTextGatherer()
    curlPerform(url = "localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query", 
        postfields = paste("query", curlEscape(querystring), 
        sep = "="), writefunction = h$update, verbose = FALSE)
    result <- fromJSON(h$value())
    data <- data.frame(t(sapply(result$data, unlist)))
    names(data) <- result$columns
    return(data)
}
# -------------------------------------- 
# import all data into neo4j
# --------------------------------------
nrow(venueDataset)  # number of venues
## [1] 3352
length(unique(venueUserDataset$USERID))  # number of users
## [1] 3306
nrow(venueUserDataset)  # number of relationships
## [1] 11293
# venues (-> nodes)
for (i in 1:nrow(venueDataset)) {
    q <- paste("CREATE venue={name:\"", venueDataset[i, "VENUEID"], "\",txt:\"", 
        venueDataset[i, "VENUE_NAME"], "\",categoryname:\"",
	venueDataset[i, "CATEGORY_NAME"], "\",type:\"venue\",\nlng:", 
	venueDataset[i, "LNG"], ", lat:", venueDataset[i, "LAT"], 
	"} RETURN venue;", sep = "")
    data <- query(q)
}
# users (-> nodes)
for (i in unique(venueUserDataset$USERID)) {
    q <- paste("CREATE user={name:\"", i, "\",type:\"user\"} RETURN user;", sep = "")
    data <- query(q)
}
# number of checkins (-> relationships)
for (i in 1:nrow(venueUserDataset)) {
    q <- paste("START user=node:node_auto_index(name=\"", venueUserDataset[i, "USERID"], 
        "\"), venue=node:node_auto_index(name=\"", venueUserDataset[i, "VENUEID"], 
	"\") CREATE user-[:RATED {stars : ", 
	venueUserDataset[i, "COUNT_CHECKINS"], "}]->venue;", sep = "")
    data <- query(q)
}

So before we start with the recommender itself, I will discuss some of it’s the details. First part of the plan is to compute the similarities between a person and all other users, who also visited at least one of the venues the person did. Based on these similarities we will then determine the recommendations. This means, that we need a similarity measure first. In our case we will use the cosines similarity, a similarity measure typically used in text mining for high dimensional data (this also fits our case). A maximum value of 1 means that both users rated all venues they visited in the same way (“the profiles of both are similar”). If you calculate the similarity in the traditional way, you would first have to build up a feature table of the size \(m\) x \(n\) (\(m\) ~ number of users and \(n\) ~ number of venues) where a value \((i,j)\) represents the rating from user \(i\) about venue \(j\). This feature table would be huge and sparse because most users only visited a few venues. A graph is an efficient way to represent that, because only the ratings that already exist have to be encoded as explicit relationships.

After we choose a person for whom we want to compute recommendations, we start by calculating all of the relevant similarities. To get some more meaningful recommendations we exclude all venues related to the venue type “Travel & Transport”“ and only take those users into account, who have at least two visited venues in common with the chosen person. For the last part we have to use R because if I’m right, Neo4j is unable at the moment to carry out “Subselects”.

# -------------------------------------- 
# simple venue recommendation 
# --------------------------------------
userName <- "7347011"  # chose username/ID
nOfRecommendations <- 20  # number of recommendations

# Determine similiar users using the cosinus distance measure
q <- paste("START me=node:node_auto_index(name=\"", userName, "\")
    MATCH (me)-[r1]->(venue)<-[r2]-(simUser)
	WHERE venue.categoryname <> \"Travel & Transport\"
	RETURN id(me) as id1, id(simUser) as id2,
		sqrt(sum(r1.stars*r1.stars)) as mag1, 
		sqrt(sum(r2.stars*r2.stars)) as mag2,
		sum(r1.stars * r2.stars) as dotprod,
		sum(r1.stars * r2.stars)/ 
		(sqrt(sum(r1.stars*r1.stars)) * sqrt(sum(r2.stars*r2.stars))) as cossim,
		count(venue) as anz_venues
	ORDER BY count(venue) DESC;", 
    sep = "")
ans <- query(q)
simUser <- subset(ans, anz_venues >= 2)
head(simUser)
##    id1  id2   mag1   mag2 dotprod cossim anz_venues
## 1 3450 3518 16.371 16.643     196  0.7194         15
## 2 3450 3782  3.000  3.000       8  0.8889          6
## 3 3450 3382  2.828  2.828       7  0.8750          5
## 4 3450 4031  4.690  2.236      10  0.9535          5
## 5 3450 3860  2.236  7.483      12  0.7171          5
## 6 3450 3537  2.236 11.180      15  0.6000          5

The second query then selects all venues (call them recommendation candidates) rated by similar users which are still not visited by the person. It returns the user ratings and the venue properties like name, type and the geographic coordinates.

# Query all venues from the similar users still not visited by the chosen user
q2 <- paste("START su=node(", paste(simUser$id2, collapse = ","), "), 
    me=node:node_auto_index(name=\"", userName, "\")
	MATCH su-[r]->v
	WHERE NOT(v<-[]-me) AND v.categoryname <> \"Travel & Transport\"
	RETURN id(su) as id_su, r.stars as rating, id(v) as id_venue, 
		v.txt as venue_name, v.lng as lng, v.lat as lat ORDER BY v.txt;", 
    sep = "")
ans2 <- query(q2)
head(ans2)
##   id_su rating id_venue            venue_name              lng              lat
## 1  3480      1     1297 . HEDONISTIC CRUISE .         6.926502        50.949425
## 2  3436      1     1269               30works 6.93428135142764 50.9401634603315
## 3  3480      1     1376             3DFACTORY  6.9209361076355 50.9508572326839
## 4  3381      1     2274    4 Cani della Citta         6.942126        50.938178
## 5  3369      1     1418     4010 Telekom Shop          6.94348         50.93852
## 6  3547      1     1418     4010 Telekom Shop          6.94348         50.93852

The last step is to determine the top X recommendations. Therefore we compute a weighted (by the similarity between the user and the chosen person) rating for every recommendation candidate over all of the similar users that had already visited it and pick the top X venues as our recommendations.

# Calculate top X recommendations
recommendationCandidates <- ans2
venueRecommendationCandidates <- merge(ans, recommendationCandidates, 
	by.x = "id2", by.y = "id_su")
venueRecommendationCandidates$rating <- 
	as.numeric(as.character(venueRecommendationCandidates$rating))
venueRecommendation <- ddply(venueRecommendationCandidates, 
	c("id_venue", "venue_name", "lng", "lat"), function(df) {
		sum(df$cossim * as.numeric(df$rating))/sum(df$cossim)
})
venueRecommendation <- 
	venueRecommendation[order(venueRecommendation[, 5], decreasing = TRUE), ]
venueRecommendation$lat <- as.numeric(as.character(venueRecommendation$lat))
venueRecommendation$lng <- as.numeric(as.character(venueRecommendation$lng))

# Our recommendations for the chosen user
venueRecommendation[c(1:nOfRecommendations), ]
##     id_venue                                     venue_name   lng   lat     V1
## 687      168                                     Wohnung 16 6.922 50.94 100.00
## 697      187                       Fork Unstable Media GmbH 6.966 50.93  52.00
## 152       56                  Pixum | Diginet GmbH & Co. KG 7.000 50.87  44.00
## 49       536     Fachhochschule des Mittelstands (FHM) Köln 6.939 50.94  37.00
## 635      154 Seminar für Politische Wissenschaft - Uni Köln 6.924 50.93  30.00
## 752      201      Sinn und Verstand Kommunikationswerkstatt 6.954 50.95  27.00
## 789      677                               Happyjibe's Loft 6.919 50.97  26.00
## 831      425                     PlanB. GmbH Office Cologne 6.963 50.93  26.00
## 484      586                               Praxis Dokter H. 6.989 50.93  25.00
## 666      303                      Bürogemeinschaft Eckladen 6.962 50.93  24.00
## 223      337                                Köln-Lindweiler 6.887 51.00  23.00
## 516      723                                    Health City 6.932 50.94  23.00
## 611      306     Kreuzung Zollstockgürtel / Vorgebirgstraße 6.945 50.90  23.00
## 201      784                            Paul-Humburg-Schule 6.922 50.99  21.00
## 412      122                                     unimatrix0 6.934 50.93  21.00
## 601      989                             Prinzessinnenküche 6.947 50.90  20.00
## 754      957                Reitergemeinschaft Kornspringer 7.076 50.97  19.00
## 371      233                                            MTC 6.939 50.93  17.56
## 721      188                           ESA-Besprechungsecke 6.976 50.96  17.00
## 694      973                                  fischerappelt 6.966 50.93  16.00

We will close the coding section by a nice visualization of already visited (red) and recommended (blue) venues in a map. This gives a first impression on how the venues and the recommendations are distributed in geographic space.

# -------------------------------------- 
# Plot all of the visited and recommended venues
# -------------------------------------- 
# get coordinates of the venues already visited
qUserVenues <- paste("START me=node:node_auto_index(name=\"", 
	userName, "\") MATCH me-[r]->v
	WHERE v.categoryname <> \"Travel & Transport\"
	RETURN r.stars, v.txt as venuename, v.type as type, v.lng as lng, v.lat as lat 
	ORDER BY r.stars, v.txt;", 
    sep = "")
userVenues <- query(qUserVenues)
userVenues$lng <- as.numeric(as.character(userVenues$lng))
userVenues$lat <- as.numeric(as.character(userVenues$lat))

# plot venues using the ggmap package
require(ggmap)
theme_set(theme_bw(16))
hdf <- get_map(location = c(lon = mean(userVenues$lng), 
	lat = mean(userVenues$lat)), zoom = 11)
ggmap(hdf, extent = "normal") + geom_point(aes(x = lng, y = lat), 
	size = 4, colour = "red", data = userVenues) + 
	geom_point(aes(x = lng, y = lat), size = 4, colour = "blue", 
		data = venueRecommendation[c(1:nOfRecommendations), ])

Recommendations and visited venues

Finally it is time to summarize what was done: We built a simple recommendation engine which could be used to recommend new places to users, which they should visit. The recommendation is based on their past behavior and can be computed in near real-time (there is no long running batch job). What we left out, is a rigor evaluation of the recommendation performance. But because recommendation only serves as a demo use case, this was not the topic of this posting. More important I have to say, that I’m really impressed on how easy it is to set up a Neo4j graph database and how simple it is to make the first steps with the query language Cypher. The SQL style of Cypher makes the determination of the most similar users straightforward. What’s also interesting is the simplicity of the connection from R to Neo4j via the REST-Interface. No additional things are needed. The “outlook” is even more promising due to the main advantages of NoSQL databases like fast operations on huge datasets or easy and fault-tolerant scalability (not shown here). Even though the operations run fast on that moderate sized dataset, a broad test lies beyond that session. But maybe in one of the next postings …

February 24, 2013
The Wisdom of Crowds - Clustering Using Evidence Accumulation Clustering (EAC)

Today’s blog post is about a problem known by most of the people using cluster algorithms on datasets without given true labels (unsupervised learning). The challenge here is the “freedom of choice” over a broad range of different cluster algorithms and how to determine the right parameter values. The difficulty is the following: Every clustering algorithm and even any set of parameters will produce a somewhat different solution. This makes it hard to decide, which of the results should be kept. Because there is no reference when using clustering in an unsupervised fashion, the analyst has to decide whether the results describe some causal or artificial patterns. I will present a method, which tackles the described problem and is also very simple to apply. I think that this makes it really interesting for a lot of practical problems and time-bounded projects. Like for most of the data analytics problems, the rule "There is No Free Lunch for the Data Miner" is still valid and hence also the limitations of the approach will be discussed. For illustration I used three datasets: The first two are artificial datasets and their purpose is to demonstrate the benefits and the limitations from the presented method. This will be done by contrasting the results with those from “classical methods”. For this we use the datasets named “Aggregation” (1) and “Spiral” (2). You can download them from http://cs.joensuu.fi/sipu/datasets/ together with some information about the true number of clusters. The third dataset is about clustering the McDonalds menu and has more of a “real world character”. The clusters will consist of products featuring the same nutrition profile. You can find a different clustering with the same dataset at this blog post from Edwin Chen, where also the dataset was originally from. At the end it would be interesting to compare our result with those presented by the mentioned blog.

Our first step is to load all three datasets and to do some data preprocessing (standardizing the features and omitting all tuples containing NA’s). The plots for dataset 1 and 2 show some typical shapes creating real challenges for most cluster algorithms.

require(ggplot2)
require(cluster)
require(reshape)

# Dataset 1 - Aggregation dataset (http://cs.joensuu.fi/sipu/datasets/)
dataAggregatione <- read.csv("DATA/Aggregation.txt", sep = "\t", 
	header = FALSE)
dataAggregationeScaled <- scale(dataAggregatione[, -3])  # normalize data
dataAggregatione <- data.frame(dataAggregationeScaled, 
	name = as.character(c(1:nrow(dataAggregationeScaled))))
rownames(dataAggregatione) <- dataAggregatione$name
ggplot(dataAggregatione, aes(V1, V2)) + geom_point()

Aggregation dataset

# Dataset 2 - Spiral dataset (http://cs.joensuu.fi/sipu/datasets/)
dataSpiral <- read.csv("DATA/spiral.txt", sep = "\t", header = FALSE)
dataSpiralScaled <- scale(dataSpiral[, -3])  # normalize data
dataSpiral <- data.frame(dataSpiralScaled, 
	name = as.character(c(1:nrow(dataSpiralScaled))))
rownames(dataSpiral) <- dataSpiral$name
ggplot(dataSpiral, aes(V1, V2)) + geom_point()

Spiral dataset

# Dataset 3 - the mcdonalds menu
# (https://github.com/echen/dirichlet-process)
dataMC <- read.csv("DATA/mcdonalds-normalized-data.tsv")
dataMC$total_fat <- as.numeric(dataMC$total_fat)
numericAttr <- c("total_fat", "cholesterol", "sodium", "dietary_fiber", 
	"sugars", "protein", "vitamin_a_dv",
	"vitamin_c_dv", "calcium_dv", 		
	"iron_dv", "calories_from_fat", 
	"saturated_fat", "trans_fat", "carbohydrates")
dataMC <- na.omit(dataMC)  # drop NAs
dataMCScaled <- scale(dataMC[, -ncol(dataMC)])  # normalize data
dataMC <- data.frame(dataMCScaled, name = dataMC$name)

Now, let’s start with the analytical part by describing a typical challenge occurring in clustering. Most of the algorithms require some parameter \( k \) which is used to determine the number of clusters (I will also use the word partitions synonymously) produced by the clustering/partitioning algorithm (e.g. k-means, spectral clustering) and hence the final partitioning. But it is inherent in the nature of the problem, that k cannot be specified a priori. This is because in unsupervised clustering the true number of partitions is not known prior. An approach to coop with that problem is to use some relative validation criteria. The logic behind this is the following: The user computes a score over partitions resulting from different parameterizations where the score expresses the quality of a partition. At the end, the partition with the highest score should be kept as final result (It depends on the meaning of the score if it should be minimized or maximized. But for simplicity I assume that every score could be expressed as such that a high value is associated with a high clustering quality). The package "clusterCrit" contains most relevant cluster validity indices coming from a broad range of research literature (e.g. "On the Comparison of Relative Clustering Validity Criteria", Vendramin et al.). In our first experiment I will use the "Dunn-Index", the "Calinski-Harabasz-Index" and the popular Silhouette measure) together with the k-means algorithm on the two synthetic datasets. To account for the uncertainty about the right value of k, all possible values between 2 and 50 will be evaluated:

# ------------------------------- 
# Application of k-means
# -------------------------------
require(clusterCrit)

# Aggregation dataset
set.seed(1234)
# matrix to hold the score value
vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(), 
	c("Dunn", "Calinski-Harabasz", "Silhouette")))  
for (k in 2:50) {
    cl <- kmeans(dataAggregatione[, c(1, 2)], k)
    vals[(k - 1), 1] <- as.numeric(intCriteria(
		as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster, 
		"Dunn"))
    vals[(k - 1), 2] <- as.numeric(intCriteria(
		as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster, 
		"Calinski_Harabasz"))
    vals[(k - 1), 3] <- as.numeric(intCriteria(
		as.matrix(dataAggregatione[, c(1, 2)]), cl$cluster, 
		"Silhouette"))
}
vals <- data.frame(K = c(2:50), vals)
choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"], 
	vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"], 
	vals[bestCriterion(vals[, 4], "Silhouette"), "K"]), 
	ncol = 3, 
	dimnames = list(c("Aggregation"), 
		c("Dunn", "Calinski_Harabasz", "Silhouette")))
choosen_k
##             Dunn Calinski_Harabasz Silhouette
## Aggregation   46                45          4
# Spiral dataset
set.seed(1234)
vals <- matrix(rep(NA, 49 * 3), ncol = 3, dimnames = list(c(), 
	c("Dunn", "Calinski-Harabasz", "Silhouette")))
for (k in 2:50) {
    cl <- kmeans(dataSpiral[, c(1, 2)], k)
    vals[(k - 1), 1] <- as.numeric(intCriteria(
		as.matrix(dataSpiral[, c(1, 2)]), cl$cluster, 
		"Dunn"))
    vals[(k - 1), 2] <- as.numeric(intCriteria(
		as.matrix(dataSpiral[, c(1, 2)]), cl$cluster, 
		"Calinski_Harabasz"))
    vals[(k - 1), 3] <- as.numeric(intCriteria(
		as.matrix(dataSpiral[, c(1, 2)]), cl$cluster, 
		"Silhouette"))
}
vals <- data.frame(K = c(2:50), vals)
choosen_k <- matrix(c(vals[bestCriterion(vals[, 2], "Dunn"), "K"], 
	vals[bestCriterion(vals[, 3], "Calinski_Harabasz"), "K"], 
	vals[bestCriterion(vals[, 4], "Silhouette"), "K"]), 
	ncol = 3, 
	dimnames = list(c("Spiral"), 
		c("Dunn", "Calinski_Harabasz", "Silhouette")))
choosen_k
##        Dunn Calinski_Harabasz Silhouette
## Spiral   34                50         37

Hence the first surprise is how distinct some of the validity scores are for both datasets. Even though the special structure of the data is challenging for the k-means algorithm, the results indicates that the choice of the right validation criteria is non-trivial, difficult and shows significant impact on the results. If we plot the resulting clustering for the best value obtained with the Silhouette score, the figures show that the approach failed for both datasets, even in picturing the coarse structure:

# ------------------------------- 
# Plot results
# ------------------------------- 

# Aggregation dataset
kmeansResultsAggreation <- kmeans(x = dataAggregatione[, c(1, 2)], 
	centers = 3)$cluster
dataAggregatione$clusterSimpleKmeans <- as.character(kmeansResultsAggreation)
ggplot(dataAggregatione, aes(V1, V2)) + 
	geom_point(aes(colour = clusterSimpleKmeans)) +
	opts(legend.position = "none")

clustered Aggregation dataset (Silhouette)

# Spiral dataset
kmeansResultsSpiral <- kmeans(x = dataSpiral[, c(1, 2)], 
	centers = 37)$cluster
dataSpiral$clusterSimpleKmeans <- as.character(kmeansResultsSpiral)
ggplot(dataSpiral, aes(V1, V2)) + 
	geom_point(aes(colour = clusterSimpleKmeans)) + 
    opts(legend.position = "none")

clustered Spiral dataset (Silhouette)

Now, let me introduce a method which overcomes the problem of choosing the right k value and which gives better result even when a simple clustering algorithm like k-means is used. Even that this blog post focus on the choice of the parameter k, the method is a form of a general approach to “directly estimate” the final clustering without relying too much on a single set of more or less arbitrarily estimated parameters. It is called “Evidence Accumulation Clustering” and you can find some more deep information here and here. The notion behind this method is to build partitions with different algorithms and parameterizations and to aggregate all solutions into one final partition using every single partition as a voting if instances should be placed together. Hence if two venues will be placed together in most solutions, it is reasonable to assign them to the same cluster in the final partition. In this context, this method could also be understood as a tool to enhance the validity of the resulting partition by reducing the impact of a single non-optimal clustering. The algorithmic part is simple: The first part is about the creation of different partitions and aggregating the “votes”. For this we do two steps in each iteration: First, we cluster all data points with k-means using a randomly sampled value from a given interval for \( k \). Second, we note if two instances (\( i \) and \( j \)) are placed together inside of one cluster. In this case we increment the corresponding entry (\( A(i,j) \) and \( A(j,i) \)) in our so called association matrix \( A \) by 1. At the end of the first part we divide all entries by R, the number of iterations. In the second part the “votes” are aggregated through hierarchical clustering and we obtain our final partition by selecting an appropriate cutting point in the resulting dendogram. For this we transform the association matrix \( A \) from the first part into a distance matrix and feed it into a single linkage clustering. On the resulting dendogram we calculate the so called “maximal cluster livetime”. It is the longest “gap” between two successive merges. Using this as our cutting point we obtain a final partition. The code below is some technically simple draft of this algorithm (e.g. no parallelization):

# ------------------------------- 
# Evidence Accumulation Clustering
# -------------------------------
createCoAssocMatrix <- function(Iter, rangeK, dataSet) {
    nV <- dim(dataSet)[1]
    CoAssoc <- matrix(rep(0, nV * nV), nrow = nV)

    for (j in 1:Iter) {
        jK <- sample(c(rangeK[1]:rangeK[2]), 1, replace = FALSE)
        jSpecCl <- kmeans(x = dataSet, centers = jK)$cluster
        CoAssoc_j <- matrix(rep(0, nV * nV), nrow = nV)
        for (i in unique(jSpecCl)) {
            indVenues <- which(jSpecCl == i)
            CoAssoc_j[indVenues, indVenues] 
                        <- CoAssoc_j[indVenues, indVenues] + (1/Iter)
        }
        CoAssoc <- CoAssoc + CoAssoc_j
    }
    return(CoAssoc)
}

eac <- function(Iter, rangeK, dataset, hcMethod = "single") {
    CoAssocSim <- createCoAssocMatrix(Iter, rangeK, dataset)

    # transform from similiarity into distance matrix
    CoAssocDist <- 1 - CoAssocSim  
    hclustM <- hclust(as.dist(CoAssocDist), method = hcMethod)
    # determine the cut
    cutValue <- hclustM$height[which.max(diff(hclustM$height))]  
    return(cutree(hclustM, h = cutValue))
}

Now let’s try this method on both artificial datasets using exactly the same set of possible values for k like in the first case above:

# ------------------------------- 
# Application of EAC
# -------------------------------

# # Aggregation dataset
set.seed(1234)
EACResults_Aggregatione <- eac(Iter = 200, rangeK = c(2, 50), 
	dataset = dataAggregatione[, c(1, 2)], hcMethod = "single")
table(EACResults_Aggregatione)
## EACResults_Aggregatione
##   1   2   3   4   5 
## 170 307 232  45  34
dataAggregatione$clusterEAC <- as.character(EACResults_Aggregatione)
ggplot(dataAggregatione, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) +
	opts(legend.position = "none")

clustered Aggregation dataset (EAC)

# Spiral dataset
set.seed(1234)
EACResults_Spiral <- eac(Iter = 200, rangeK = c(2, 50), 
	dataset = dataSpiral[, c(1, 2)], hcMethod = "single")
table(EACResults_Spiral)
## EACResults_Spiral
##   1   2   3 
## 106 101 105
dataSpiral$clusterEAC <- as.character(EACResults_Spiral)
ggplot(dataSpiral, aes(V1, V2)) + geom_point(aes(colour = clusterEAC)) + 
	opts(legend.position = "none")

clustered Spiral dataset (EAC)

The new results clearly show a real progress even when the algorithm slightly underestimates the true number of clusters for the Aggregation dataset (five instead of seven). But as the discussion in the mentioned paper on EAC depicts, the algorithm has some problems with not so well separated clusters - nobody is perfect. The results using the Spiral dataset look flawless. All three clusters are identified even though we use a simple k-means algorithm with moderate overhead for building the EAC algorithm. Let’s close this post with an application of the described method on the mentioned McDonalds dataset. The preprocessing of the data and the plotting is taken from the code published by Edwin Chen.

# ------------------------------- 
# Clustering the McDonalds menue
# -------------------------------
set.seed(1234)
EACResults_MC <- eac(Iter = 1000, rangeK = c(2, 50), 
	dataset = dataMC[, numericAttr], 
    hcMethod = "single")
table(EACResults_MC)
## EACResults_MC
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14 
##  19 126  39   2   9  15  16   1  21  20   4  27  18   8
dataMC$cluster <- as.character(EACResults_MC)

# the following code snippet is taken from
# "http://blog.echen.me/2012/03/20/infinite-mixture-models-with-nonparametric
# -bayes-and-the-dirichlet-process/"

# ignore duplicate food items
x = ddply(dataMC, .(name), function(df) head(df, 1))
# for each cluster, take at most 5 items 
# (to avoid the plot being dominated by large clusters)
x = ddply(x, .(cluster), function(df) head(df, 5))
# Reorder names by cluster 
# (so we can get a plot where all points in a cluster are together)
x$name = factor(x$name, levels = x$name[order(x$cluster)], ordered = T)
# Turn this into a tall-thin matrix
m = melt(x, id = c("name", "cluster"))

m$name <- paste("(", m$cluster, ") ", m$name, sep = "")
qplot(variable, weight = value, data = m, color = cluster, fill = cluster, 
	geom = "bar", width = 1, 
	xlab = "nutritional variable", ylab = "z-scaled value", 
	main = "McDonald's food clusters") + 
    facet_wrap(~name, ncol = 5) + coord_flip() + scale_colour_hue("cluster") + 
    opts(axis.text.y = theme_text(size = 8), 
            axis.text.x = theme_text(size = 8)) + 
    scale_fill_hue("cluster")

image

Here EAC gives us 14 partitions, 3 more than Edwin Chen found. In this example, it is more difficult to verify the solution. Hence our validation is more a qualitative one. The final plot shows five instances at maximum per partition together with their nutrition profiles. A first look at our results indicates that most of the clusters exhibit a good separation (except cluster 2 which includes some more heterogeneous items). I will not discuss every cluster in detail but here are some specific remarks: Cluster 3, 10, 12 and 14 show somehow similar profiles, nevertheless:

  • cluster 12 (non-fat variants of coffee with syrup) is rich on carbohydrates and sugar
  • cluster 10 (non-fat variants of coffee pure) has more protein and calcium
  • cluster 3 is similar to cluster 12 but with less proteins and it contains more items with “extreme” profiles like apple dippers and
  • cluster 14 - consisting of fruit smoothies with a lot of dietary fiber and proteins

Maybe it would be more plausible to shift the apple dippers in cluster 8, which only contains the apple slices.

So what could finally be said about the presented method? Although it has some limitations it poses an interesting and simple method, which gives much better results than single runs with an algorithm like k-means. The shown algorithm could be extended as well. One option could be to substitute the base method with a more sophisticated method. Another idea is to use a form of weighted aggregation of the partitions - mentioned here.

Like always, any helpful comment or idea is welcome.

December 13, 2012
Predictive Modeling using R and the OpenScoring-Engine – a PMML approach

On November, the 27th, a special post took my interest. Scott Mutchler presented a small framework for predictive analytics based on the PMML (Predictive Model Markup Language) and a Java-based REST-Interface. PMML is a XML based standard for the description and exchange of analytical models. The idea is that every piece of software which supports the corresponding PMML (newest version 4.1) version could utilize such a model description to predict outcomes regardless of the creation process. Using PMML offers some big advantages especially when using R for modeling with the final task to bring a model into production afterwards. Often the IT-staff is not willing to support yet another piece of software (like the open source R) beside the legacy systems, particularly when there are no service agreements, 24/7 support or guaranteed ‘High Availability’. R also lacks an “enterprise workflow” to send your model right into production for being used as a “prediction engine”. This is one of the key points most commercial software vendors in the field of analytical software offer (e.g. Revolution Enterprise, TIBCO Spotfire, …). Scott Mutchler’s OpenScoring is a first draft for such an engine and definitely worth to have an eye on it. While he created mainly the server side of the engine, my intention was to create some simple sketch for the client side, so that you can call the servlet engine from within R. Therefore two main functions have to be provided:

  1. Exporting a predictive model into XML
  2. Predicting a target variable, given a dataset and a model (specified in PMML)

A scenario how this all could be used is depicted as an overview in the following picture:

Use Case Overview

So the first step was to get the engine running. I experienced some small problems but Scott fixed them very fast and offered a new war file ready for deployment inside tomcat. It can be obtained under http://code.google.com/p/openscoring/. If you still encounter some problems you could also try my war file. It was built with java version “1.6.0_26” and tested under tomcat 7. After installing the war file using the management console, you could test it by sending a POST request in XML like the following to ‘http://localhost:8080/OpenScoring/Scoring’:

<?xml version="1.0" encoding="UTF-8"?>
<scoring_request>
  <pmml_url>http://www.dmg.org/pmml_examples/knime_pmml_examples/IrisNeuralNet.xml
  </pmml_url>
  <model_name></model_name>
  <csv_input_rows><![CDATA[sepal_length,sepal_width,petal_length,petal_width|
      5.1,3.5,1.4,0.2|7,3.2,4.7,1.4|6.3,2.9,5.6,1.8|]]>
  </csv_input_rows>
</scoring_request>

A nice tool for first experiments is the poster plugin for Firefox.

Now let’s turn to developing the client side. The implementation of the first mentioned functionality is pretty straight forward – build your model in the classical way and export it using the PMML package. You can find the supported methods inside PMMLs documentation or here. The following code shows a small example using decision trees and artificial data consisting only of numerical attributes (including target):

require(ggplot2)  # for visualization
require(pmml)  # for storing the model in xml
require(rpart)  # use decision trees

# ------------------------------------------------------- 
# Example 1 - categorical target with numeric inputs
# -------------------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- rnorm(100, mean = 0, sd = 0.5)
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- rnorm(100, mean = 1, sd = 0.5)
rY <- rnorm(100, mean = 0, sd = 0.5)
dataset <- data.frame(X = c(gX, rX), Y = c(gY, rY), 
    Col = c(rep("red", 100), rep("green", 100)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp + scale_colour_manual(name = "Col", values = c("green", "red"), 
        labels = c("GREEN","RED"), breaks = c("green", "red"))

plot of chunk unnamed-chunk-1


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

## export as pmml
# place to store the xml file
localFilenameAndPath = "/tmp/treeModel_numericalInput_categTarget.xml"  
# save model using pmml
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
     dataset = dataset), file = localFilenameAndPath) 

Most effort goes into developing the predict function. We should first transform the input data frame and split it into strings where every string represents one row in the data frame (including the header as one string). The fields have to be comma delimited. After that, all strings are combined into one string separated by the pipe (‘|’) symbol. The second step is creating the XML structure consisting of at least the main (<scoring_request>) and two subordinated nodes (<pmml_url> and <csv_input_rows>). The node <pmml_url> includes the url with schema ‘http://’, ‘file://’ or ‘ftp://’ and points to the place of your model. The node <csv_input_rows> contains the data formatted like mentioned above, wrapped inside a CDATA node. Last but not least the whole XML document needs to be transformed into a string. As a third we send the POST-Request to the server using curlPerform from the RCurl package: Last part will be to extract the results from the server response. The format is the same like in the request but it contains the predictions in one additional column. So using a combination of strsplit, gsub and sapply will get the job done. Because of that the prediction is always given as a character we need to convert it properly. Therefore the user has to specify a proper transformation function as input to the prediction function.

predictPMMLModel <- function(dataset,   # dataset for prediction 
        transformTargetAttribute,       # type of target attribute
        modelURL,                       # place where the scoring 
                                        # engine could find the pmml model
        applServerURL                   # servlet url
){
    require(XML)
    require(RCurl)

    header <- paste(colnames(dataset), collapse=",") # extract header
    # transformation to characters is necessary to avoid some “bad surprise” 
    # from R's handling of factor attributes
    datasetTransf <- data.frame(lapply(dataset, as.character), 
            stringsAsFactors=FALSE)
    dataString <- paste(header,"|", paste(do.call("rbind",
                            by(datasetTransf, 1:nrow(datasetTransf), function(row){
                                        paste(row, collapse=",")
                                    }, simplify = FALSE)), collapse ="|"), 
                            "|", sep = "")

    # create xml document
    xmlHeader <- xmlPINode(sys = 'xml', value = 'version="1.0" encoding="UTF-8"')
    xmlRequest <- xmlNode("scoring_request", 
            xmlNode("pmml_url", modelURL), 
            xmlNode("model_name"),
            xmlNode("csv_input_rows",xmlCDataNode(dataString)))

    # xml request as string
    fullXMLRequest <- paste(toString(xmlHeader),"\n", 
        gsub(" ", "", toString(xmlRequest, sep=""), fixed = TRUE))

    # http post request
    r = dynCurlReader()
    curlPerform(postfields = fullXMLRequest, url = applServerURL, 
            verbose = TRUE, post = 1L, writefunction = r$update)
    r$value()

    # parse results - !!caution: currently no error checking!!
    tmp <- xmlTreeParse(r$value())
    predictionString <- xmlValue(tmp[[1]][[1]][[4]])
    # extract predictions line by line
    predictionLines 
     <- strsplit(predictionString, split ="|", fixed = TRUE)[[1]][-1]
    predictions <- transformTargetAttribute(sapply(predictionLines, function(s){
                        gsub('\"','',
        tail(strsplit(s, ',', fixed = TRUE)[[1]], n=1))
                    }))
    names(predictions) <- NULL
    return(predictions)
}

Calling the final prediction function then is straightforward and the given examples show some cases with different combinations of categorical/numeric input/target attributes. Simple results for the training error are also given:

# prediction
prediction1 <- predictPMMLModel(dataset = dataset, 
    transformTargetAttribute = factor, 
    modelURL = paste("file://", localFilenameAndPath, sep = ""), 
    applServerURL = "http://localhost:8080/OpenScoring/Scoring")
table(dataset$Col, prediction1)  # tabulate results

# ------------------------------------------------------- 
# Example 2 - categorical target with mixed inputs
# -------------------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.7, 0.2, 0.1)))
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.1, 0.2, 0.7)))
rY <- rnorm(100, mean = 0, sd = 0.5)

# http://stackoverflow.com/questions/8229904/r-combining-two-factors
dataset <- data.frame(X = unlist(list(gX, rX)), Y = c(gY, rY), 
   Col = c(rep("red", 100), rep("green", 100)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp + scale_colour_manual(name = "Col", values = c("green", "red"), 
   labels = c("GREEN", "RED"), breaks = c("green", "red"))

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export as pmml
localFilenameAndPath = "/tmp/treeModel_mixedInput_categTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
    dataset = dataset), file = localFilenameAndPath)
# prediction
prediction2 <- predictPMMLModel(dataset = dataset, 
    transformTargetAttribute = factor, 
    modelURL = paste("file://", localFilenameAndPath, sep = ""), 
    applServerURL = "http://localhost:8080/OpenScoring/Scoring")

table(dataset$Col, prediction2)  # tabulate results

# ----------------------------------------------- 
# Example 3 - numerical target with mixed input 
# -----------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.7, 0.2, 0.1)))
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.1, 0.2, 0.7)))
rY <- rnorm(100, mean = 0, sd = 0.5)

dataset <- data.frame(X = unlist(list(gX, rX)), Y = c(gY, rY), 
   Col = c(rnorm(100, mean = -5, sd = 1), rnorm(100, mean = 5, sd = 1)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export model as pmml
localFilenameAndPath = "/tmp/treeModel_mixedInput_numTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
   dataset = dataset), file = localFilenameAndPath)
# prediction
prediction3 <- predictPMMLModel(dataset = dataset, 
   transformTargetAttribute = as.numeric, 
   modelURL = paste("file://", localFilenameAndPath, sep = ""), 
   applServerURL = "http://localhost:8080/OpenScoring/Scoring")

modelResults <- data.frame(Y = dataset$Col, Y_hat = prediction3)
cor(modelResults$Y, modelResults$Y_hat)^2  # computing r squared

# ----------------------------------------------- 
# Example 4 - numerical target with numerical input
# -----------------------------------------------

# create first artificial dataset
set.seed(1234)
gX <- rnorm(100, mean = 0, sd = 0.5)
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- rnorm(100, mean = 1, sd = 0.5)
rY <- rnorm(100, mean = 0, sd = 0.5)
dataset <- data.frame(X = c(gX, rX), Y = c(gY, rY), 
   Col = c(rnorm(100, mean = -5, sd = 1), rnorm(100, mean = 5, sd = 1)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export model as pmml
localFilenameAndPath = "/tmp/treeModel_numericalInput_numTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
   dataset = dataset), file = localFilenameAndPath)
# prediction
prediction4 <- predictPMMLModel(dataset = dataset, 
   transformTargetAttribute = as.numeric, 
   modelURL = paste("file://", localFilenameAndPath, sep = ""), 
   applServerURL = "http://localhost:8080/OpenScoring/Scoring")

modelResults <- data.frame(Y = dataset$Col, Y_hat = prediction4)
cor(modelResults$Y, modelResults$Y_hat)^2  # computing r squared

Finally, here are some thoughts about what has been build. We should begin with a contra-point, which is the fact that PMML (at the moment) restricts you to use some standard modeling techniques (depending on the PMML version and the implementation of the used software for modeling and prediction). I know some projects where this would be a serious concern. But on the other side I think that a lot of analytic questions will be fine with the offered portfolio of approaches – looking at my own projects it may be 50 to 70 percent. On the pro-side there are a lot more things to mention. First, even if restricted to some standard models, you could use the full power of R to build your model, to try experiments with it, to optimize and tweak it and to visualize the results. Second, developer and user no longer need to use the same kind of software – the server/client could be written in any language able to understand XML and PMML (only the server if the client is not used for modeling) and on how to communicate with web services. Third, like mentioned at the beginning of the post, you now could utilize “enterprise-proven” software (Tomcat) to run your model. This really is one of the points which matters in practice (“… thanks for your nice prototype, but we should now have to bring it to java/c#/… “). Although there are some limitation, I think there are even more positive aspects (e.g. Documentation, versioning of models) and I am really interested about any complementary and helpful comments.

November 19, 2012
Matching clustering solutions using the ‘Hungarian method’

Some time ago I stumbled upon a problem connected with the labels of a clustering. The partition an instance belongs to, is mostly labeled through an integer ranging from 1 to K, where k is the number of clusters. The task at that time was to plot a map of the results from the clustering of spatial polygons where every cluster is represented by some color. Like in most projects the analysis was performed multiple times and we used plotting to monitor the changes resulting from the iterations. But after rerunning the clustering algorithm (k-means in this case) the assignment between the clusters and the labeling completely changed, even when using the same parameters. This is because there is no unique connection between a partition (a group of elements) and a specific label (eg. “1”). So even when two solutions match perfectly the assigned labels changed completely. So the graphical representations of two clusterings (which only have some slight differences) look like they are completely different. This is because the coloring relates to the labels. The following R code depicts a simple example for this matter:

require(spdep)
require(rgdal)
require(maptools)
require(ggplot2)
require(plyr)
library(grid)

# panel function for the plots
vplayout <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y) 

gpclibPermit()  # required for fortify method 
#(see https://github.com/hadley/ggplot2/wiki/plotting-polygon-shapefiles)
bh <- readShapePoly(system.file("etc/shapes/bhicv.shp", package = "spdep")[1])

# prepare input data
bh@data$id = rownames(bh@data)
bh.points = fortify(bh, region = "id")
bh.df = join(bh.points, bh@data, by = "id")

# clustering after standardization of data
dpad <- data.frame(scale(bh@data[, 5:8]))
set.seed(1234)
res1 <- kmeans(dpad, centers = 10)
set.seed(9999)
res2 <- kmeans(dpad, centers = 10)

# add cluster id to polygon layer
bh.df.cl = merge(bh.df, data.frame(id = bh@data$id, CL1 = res1$cluster, 
CL2 = res2$cluster), by = "id")

# plot
p1 <- ggplot(bh.df.cl) + aes(long, lat, group = group, fill = as.factor(CL1)) + 
    geom_polygon() + geom_path(color = "white") + coord_equal() + 
    scale_fill_brewer(palette = "Set3") + 
    theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"), legend.position = "none")
p2 <- ggplot(bh.df.cl) + aes(long, lat, group = group, fill = as.factor(CL2)) + 
    geom_polygon() + geom_path(color = "white") + coord_equal() + 
    scale_fill_brewer(palette = "Set3") + 
    theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"), legend.position = "none")
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = vplayout(1, 1))
print(p2, vp = vplayout(1, 2))

plot of chunk unnamed-chunk-2

After searching the internet for a possible approach my first results point to the direction of methods for cluster validation (subsequently I found out that this problem is also evident when it comes to consensus clustering). In a research paper from Lange et. al. “Stability-based validation of clustering solutions” the authors describe a sampling based approach for evaluating the stability of clustering solutions. Therefore they hade to compare partitions from different runs over the data. This matches exactly the same question I described above. Here a method from Kuhn called the ‘Hungarian method’ for minimum weighted bipartite matching is mentioned which should solve the assignment of two different clustering solutions onto each other. As a result we could rearrange the labels from one clustering.

But what is the idea of formulating this correspondence problem as an optimization exercise? You can relate this type of question to weighted bipartite graphs and subsets of them. In a bipartite graph a matching is a subset of the edges so that no two edges in the subset share a common vertex. It is called a minimum weighted bipartite matching when the graph is a weighted bipartite graph and the sum of all edges in the subset is minimal. This could be represented as a distance matrix having the dimension of the number of clusters where the value between two instances depicts the agreement between these two partitions (one constraint for this approach is that there is the same number of partitions in both clusterings). So, one clustering is represented by columns and the other one by row or vice versa. The agreement can be calculated as follows: Calculate the number of elements in the intersection of the two partitions and subtract it twice from the sum of the number of elements in both clusters. The notion behind this computation is that if all elements are in the intersection, the value is zero and hence it is very likely that these two partitions are mapped on each other. The higher the value the more different are the two partitions. One approach for calculating this distance matrix in R looks like the following (herby we us the method solve_LSAP from the package clue, where some additional explanations could also be found inside the associated paper “A CLUE for CLUster Ensembles”):

# labels from cluster A will be matched on the labels from cluster B
minWeightBipartiteMatching <- function(clusteringA, clusteringB) {
    require(clue)
    idsA <- unique(clusteringA)  # distinct cluster ids in a
    idsB <- unique(clusteringB)  # distinct cluster ids in b
    nA <- length(clusteringA)  # number of instances in a
    nB <- length(clusteringB)  # number of instances in b
    if (length(idsA) != length(idsB) || nA != nB) {
        stop("number of cluster or number of instances do not match")
    }

    nC <- length(idsA)
    tupel <- c(1:nA)

    # computing the distance matrix
    assignmentMatrix <- matrix(rep(-1, nC * nC), nrow = nC)
    for (i in 1:nC) {
        tupelClusterI <- tupel[clusteringA == i]
        solRowI <- sapply(1:nC, function(i, clusterIDsB, tupelA_I) {
            nA_I <- length(tupelA_I)  # number of elements in cluster I
            tupelB_I <- tupel[clusterIDsB == i]
            nB_I <- length(tupelB_I)
            nTupelIntersect <- length(intersect(tupelA_I, tupelB_I))
            return((nA_I - nTupelIntersect) + (nB_I - nTupelIntersect))
        }, clusteringB, tupelClusterI)
        assignmentMatrix[i, ] <- solRowI
    }

    # optimization
    result <- solve_LSAP(assignmentMatrix, maximum = FALSE)
    attr(result, "assignmentMatrix") <- assignmentMatrix
    return(result)
}

A simple example will illustrate the matching:

minWeightBipartiteMatching(
   c(1, 1, 2, 3, 3, 4, 4, 4, 2), 
   c(2, 2, 3, 1, 1, 4, 4, 4, 3)
)
## Optimal assignment:
## 1 => 2, 2 => 3, 3 => 1, 4 => 4

The mapping resulting for the example should be done in the following way: rename cluster 1 from the first partition as 2, cluster 2 as 3, cluster 3 as 1 and cluster 4 keeps its name.
The last step is to write some code to carry out the mapping automatically.

matching <- minWeightBipartiteMatching(res1$cluster, res2$cluster)
clusterA <- res1$cluster  # map the labels from cluster A
tmp <- sapply(1:length(matching), function(i) {
    clusterA[which(res1$cluster == i)] <<- matching[i]
})

clusterB <- res2$cluster
bh.df.cl.mwbm = merge(bh.df, data.frame(id = bh@data$id, CL1 = clusterA, CL2 = clusterB), 
    by = "id")

# plot
p1 <- ggplot(bh.df.cl.mwbm) + aes(long, lat, group = group, fill = as.factor(CL1)) + 
    geom_polygon() + geom_path(color = "white") + coord_equal() + 
    scale_fill_brewer(palette = "Set3") + 
    theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"), legend.position = "none")
p2 <- ggplot(bh.df.cl.mwbm) + aes(long, lat, group = group, fill = as.factor(CL2)) + 
    geom_polygon() + geom_path(color = "white") + coord_equal() + 
    scale_fill_brewer(palette = "Set3") + 
    theme(plot.margin = unit(c(0.2, 0.2, 0.2, 0.2), "cm"), legend.position = "none")
grid.newpage()
pushViewport(viewport(layout = grid.layout(1, 2)))
print(p1, vp = vplayout(1, 1))
print(p2, vp = vplayout(1, 2))

plot of chunk unnamed-chunk-5

Looking at the final plot reveals, that besides the inherent instability from the k-mean method the clustering looks approximately identical and thus the mapping was done successfully. Hoping that this code will also help others, I’m looking forward for any helpful comment.

October 18, 2012
Benchmarking distance calculation in R

A typical step in a lot of data mining methods is the calculation of a distance between entities. For example using the nearest-neighbor method it is crucial to do this calculation very efficiently because it is the most time-consuming step of the procedure. Just imagine you want to compute the Euclidean distance between a constantly changing database and incoming queries - each one consits of exactly one entity. Therefore you want to know how you can do the computation quick and easy. Bellow you see the output of some results using different packages and some self implemented approaches:

First we have to create some artificial data which matches nicely to the structure of the data in one of my projects - that is where the question about distance calculation came from. Typically the database consists of 3000 entities with 55 features

set.seed(1234)

# create some data
nFeature = 55
dataSet <- matrix(rnorm(3000 * nFeature), ncol = nFeature)
query <- matrix(rnorm(1 * nFeature), ncol = nFeature)

# parameter for distance calculations
K = 10  # number of nearest neighbors

The first package I stumbled upon was the yaImpute package - it contains the ann function which is based on the Approximate Nearest Neighbor Library (http://www.cs.umd.edu/~mount/ANN):

library(yaImpute)

ann_dist <- function(ref, target, k) {
    res <- ann(ref = ref, target = target, tree.type = "kd", k = k, verbose = FALSE)
    return(res)
}
resANN <- ann_dist(ref = dataSet, target = query, k = K)

The second package is pdist including the function pdist:

library(pdist)

pdist_wrapper <- function(ref, target, k) {
    distAll <- pdist(X = ref, Y = target)
    iNN <- order(distAll)[1:k]
    return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]^2), nrow = 1), k = k))
    # similar to ann the element knnIndexDist from the list is a vector which
    # contains the indices of the nearest neighbors on position 1 to k and all
    # distances afterwards (position k+1 to 2k)
}
resPDIST <- pdist_wrapper(ref = dataSet, target = query, k = K)

The next approach subtracts the query from every entity in the database and sums up the distance (call it rowDist):

rowDist <- function(ref, target, k) {
    distAll <- colSums(apply(ref, 1, "-", target)^2)
    iNN <- order(distAll)[1:k]
    return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]), nrow = 1), k = k))
}
resRowDist <- rowDist(ref = dataSet, target = query, k = K)

And now the other way round - first you subtract every feature value from the corresponding feature vector of the database and second you sum up all parts of the overall distance (named colDist):

colDist <- function(ref, target, k) {
    distAll <- rowSums(sapply(1:ncol(target), function(i) {
        (ref[, i] - target[, i])^2
    }))
    iNN <- order(distAll)[1:k]
    return(list(knnIndexDist = matrix(c(iNN, distAll[iNN]), nrow = 1), k = k))
}
resColDist <- colDist(ref = dataSet, target = query, k = K)

My last approach is similar to rowDist but avoids calling apply (call it rowDistWithoutApply):

rowDistWithoutApply <- function(ref, target, k) {
    dists = colSums((t(ref) - target[1, ])^2)
    iNN = order(dists)[1:k]
    return(list(knnIndexDist = matrix(c(iNN, dists[iNN]), nrow = 1), k = k))
}
resRowDistWA <- rowDistWithoutApply(ref = dataSet, target = query, k = K)

So let’s have a look at the results and check if all approaches give the same Nearest Neighbors and the correct results:

matrix(cbind(A = t(resANN$knnIndexDist), 
     t(resPDIST$knnIndexDist), 
     t(resRowDist$knnIndexDist), 
     t(resColDist$knnIndexDist), 
     t(resRowDistWA$knnIndexDist)), 
     ncol = 5, dimnames = list(NULL, 
    c("ann", "pdist", "rowDist", "colDist", "rowDistWithoutApply")))
##           ann   pdist rowDist colDist rowDistWithoutApply
##  [1,]  526.00  526.00  526.00  526.00              526.00
##  [2,]  564.00  564.00  564.00  564.00              564.00
##  [3,] 2055.00 2055.00 2055.00 2055.00             2055.00
##  [4,]  402.00  402.00  402.00  402.00              402.00
##  [5,] 2703.00 2703.00 2703.00 2703.00             2703.00
##  [6,]  233.00  233.00  233.00  233.00              233.00
##  [7,]  274.00  274.00  274.00  274.00              274.00
##  [8,] 1749.00 1749.00 1749.00 1749.00             1749.00
##  [9,]  548.00  548.00  548.00  548.00              548.00
## [10,] 2358.00 2358.00 2358.00 2358.00             2358.00
## [11,]   57.50   57.50   57.50   57.50               57.50
## [12,]   58.10   58.10   58.10   58.10               58.10
## [13,]   59.72   59.72   59.72   59.72               59.72
## [14,]   60.27   60.27   60.27   60.27               60.27
## [15,]   61.33   61.33   61.33   61.33               61.33
## [16,]   62.47   62.47   62.47   62.47               62.47
## [17,]   62.49   62.49   62.49   62.49               62.49
## [18,]   63.07   63.07   63.07   63.07               63.07
## [19,]   63.10   63.10   63.10   63.10               63.10
## [20,]   63.18   63.18   63.18   63.18               63.18

Because there’s no mismatch between the five approaches we will step into the final part - benchmarking. Here the results from the ann function will be also used as baseline because my goal is to replace this function with something simpler and faster.

First let’s use the build in system.time() function to measure the time for 2000 repetitions:

iterN <- 2000L
matrix(cbind(system.time(for (i in 1:iterN) {
    ann_dist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
    pdist_wrapper(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
    rowDist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
    colDist(ref = dataSet, target = query, k = 50)
})[1:3], system.time(for (i in 1:iterN) {
    rowDistWithoutApply(ref = dataSet, target = query, k = 50)
})[1:3]), ncol = 5, dimnames = list(c("user", "system", "elapsed"), c("ann", 
    "pdist", "rowDist", "colDist", "rowDistWithoutApply")))
##           ann pdist rowDist colDist rowDistWithoutApply
## user    15.07 17.23   63.87   19.11                8.83
## system   1.99  2.79    2.88    1.10                1.96
## elapsed 17.05 20.03   66.92   20.21               10.85

Second, for a more precisely measurement I use the package microbenchmark which uses some quite accurate timing functions

library(microbenchmark)
res <- microbenchmark(ann = ann_dist(ref = dataSet, target = query, k = 50), 
    pdist = pdist_wrapper(ref = dataSet, target = query, k = 50), 
    rowDist = rowDist(ref = dataSet, target = query, k = 50), 
    colDist = colDist(ref = dataSet, target = query, k = 50), 
    rowDistWithoutApply = rowDistWithoutApply(ref = dataSet, target = query, k = 50), 
    times = 2000L)
print(res)
## Unit: milliseconds
##                  expr    min     lq median     uq    max
## 1                 ann  5.836  7.095  8.279  9.553  88.08
## 2             colDist  4.400  6.746  7.656  8.859  81.09
## 3               pdist  2.828  5.035  5.867  7.259  84.60
## 4             rowDist 16.619 20.488 22.769 27.165 100.37
## 5 rowDistWithoutApply  2.097  3.202  4.435  5.360  78.61

Finally some nice plot is showing the results (there seems to be an interesting pattern, revealing that temporal “outliers” are building an individual cluster with much higher execution times than the average)

library(ggplot2)
plt <- qplot(y = time, data = res, colour = expr)
plt <- plt + scale_y_log10()
print(plt)

plot of chunk unnamed-chunk-10

So happily we found a better solution - rowDistWithoutApply - for calculating a distance between a dataset and a query which is approximately two times faster (of course we have to be careful because there is no error checking) and does not need an additional library. The second winner is pdist from the pdist package followed by ann and colDist. rowDist performs worst.
So for my case this outcome remains quite stable but it’s interesting if this behavior is the general case or if there are any constraints. In this this sense any helpful comments are more than welcome.

Liked posts on Tumblr: More liked posts »