April 7, 2013
Venue Recommendation - A Simple Use Case Connecting R and Neo4j

Last month I attended the CeBIT trade fair in Hannover. Besides the so called “shareconomy” there was also another main topic across all expedition halls - Big Data. This subject is not completely new and I think that a lot of you also have experiences with some of the tools associated with Big Data. But due to the great number of databases, frameworks and engines in this field, there will always be something new to learn. So two weeks ago I started my own experiments with a graph database called Neo4j. This is one of the NoSQL databases, intended to distribute all of the computation across dozens of clusters in a fault-tolerant way. What attracted me was that I read that it is well suited for highly connected data and offers a descriptive language for querying the graph database. Roughly speaking, a graph database consists of nodes and edges connecting nodes. Both could also be enriched with properties. Some introduction which helped me can be found here and here. The graph query language "Cypher" then can be used to query the data by traversing the graph. Cypher itself is a declarative “Pattern-Matching” language and should be easily understandable for all folks familiar with SQL. There is a well arranged overview under this address. If you look at my older posts, you will see that most of them are about spatial data or data with at least some spatial dimension. This kind of data often has some inherent relationships - for example streets connected in a street network, regions connected through some border, places visited by people and so on. Thus I decided to connect one of the most discussed use cases from Big Data - Recommendation/Recommender Systems - with an attractive dataset about the Location Based Social Network Foursquare I collected last year, for my first experiment with Neo4j.

The main plot behind this simple “Spatial Recommendation Engine” is to utilize public available check-in data to recommend users new types of places they never visited before. Such a “check-in” consists of a user ID, a place (called venue) and a check-in time plus additional information (venue type, ..). The following code will show the structure of the already preprocessed data:

options(width = 90)

# load required libraries
require(data.table)
require(reshape)
require(reshape2)
require(bitops)
require(RCurl)
require(RJSONIO)
require(plyr)

# load Foursquare data
fileName <- "DATA/Foursquare_Checkins_Cologne.csv"
dataset <- read.csv2(fileName, colClasses = c(rep("character", 7), 
    rep("factor", 2), rep("character", 2)), dec = ",", encoding = "UTF-8")

# how the first 10 elements look like
head(dataset)
##                 CHECKIN_ID        CHECKIN_DATE CHECKIN_TEXT   USERID
## 1 4ff0244de4b000351aa08c35 2012-07-01 11:19:57              15601925
## 2 50a66a8ee4b04d0625654fad 2012-11-16 16:32:14               7024136
## 3 50fbe6b6e4b03e4eab759beb 2013-01-20 12:44:38                193265
## 4 50647c22e4b011670f2a173e 2012-09-27 17:17:38              10795451
## 5 500fc5b9e4b0d630c79ab4f8 2012-07-25 11:08:57              13964243
## 6 50d09108e4b013668d5538f3 2012-12-18 15:51:36                126823
##                    VENUEID        VENUE_NAME   GKZ_ID              CATEGORY_ID
## 1 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 2 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 3 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 4 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 5 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
## 6 4aef5d85f964a520dfd721e3 Köln Hauptbahnhof 05315000 4d4b7105d754a06379d81259
##        CATEGORY_NAME              LAT              LNG
## 1 Travel & Transport 50,9431986273333 6,95889741182327
## 2 Travel & Transport 50,9431986273333 6,95889741182327
## 3 Travel & Transport 50,9431986273333 6,95889741182327
## 4 Travel & Transport 50,9431986273333 6,95889741182327
## 5 Travel & Transport 50,9431986273333 6,95889741182327
## 6 Travel & Transport 50,9431986273333 6,95889741182327

The data was crawled last year as basis for an academic paper in the field of Urban Computing (which will be presented in May at the AGILE Conference on Geographic Information Science in Brussels) and contains public available check-ins for Germany. It seems to me, that such a kind of data is ideally suited for doing recommendations in a graph database and avoids the use of well-known toy datasets. The key idea behind our recommendation system is the following: Starting with a person for whom we want to make a recommendation, we will calculate the most similar users. A similar user is someone who rated venues in the same way the person of interest did. Because there is no explicit rating in the foursquare data, we take the number of visits as rating. The logic behind this is that either the person likes this place or it is important to him. So if both, the person and a user, will give high “ratings” to venues visited by both (thus both are similar), then the person may also be interested in visiting other venues highly rated by the other user, that the person has not seen yet. Technically speaking, this approach is called a collaborative filtering (calculate user similarity based on behavior) while the data collection is implicit (we have no explicit rating). Our data model therefore is straightforward: We take the venues and the users as nodes and transform all the related attributes from both into corresponding node properties. Then we connect every user node and venue node with a relationship if the user has visited this venue. The number of visits will be coded as a property of the relationship. For the recommender system we will use a combination of R and Cypher statements, the second primarily for loading the data into Neo4j and traversing the graph. To send Cypher statements to Neo4j the REST-API is of great value. We then could use the great abilities of R to preprocess the data, catch the results and calculate the final recommendation list.

The following is a short overview of all the steps:

  1. Extracting all relevant information (venues, users, ratings) from the check-in data
  2. Loading the data into Neo4j
  3. Calculating similarities for a specific user and making a recommendation on-the-fly
  4. Plotting the results on a map

I assume that Neo4j is installed (it’s very simple - look here) and the graph database is empty. For this delete the “graph.db” directory. After this start Neo4j.

So our first step is to extract all venues, users and ratings from the check-in data.

# -------------------------------------- 
# data preprocessing
# --------------------------------------
dataset$CHECKIN_DATE <- as.POSIXct(dataset$CHECKIN_DATE, format = "%Y-%m-%d %H:%M:%S")
dataset$LAT <- sub(",", ".", dataset$LAT)
dataset$LNG <- sub(",", ".", dataset$LNG)
dataset$LAT <- as.numeric(dataset$LAT)
dataset$LNG <- as.numeric(dataset$LNG)
dataset$HOUR24 <- as.numeric(format(dataset$CHECKIN_DATE, "%H"))
venueDataset <- unique(dataset[, c("VENUEID", "LNG", 
    "LAT", "VENUE_NAME", "CATEGORY_NAME")])

# use data.table for aggregation
datasetDT <- data.table(dataset)
venueUserDataset <- datasetDT[, list(COUNT_CHECKINS = length(unique(CHECKIN_ID))), 
    by = list(VENUEID, USERID)]
venueUserDataset <- data.frame(venueUserDataset)

# now unique(venueUserDataset$USERID) contains all user IDs,
head(unique(venueUserDataset$USERID))
## [1] "15601925" "7024136"  "193265"   "10795451" "13964243" "126823"
# venueDataset contains all venues and
head(venueDataset)
##                     VENUEID   LNG   LAT                       VENUE_NAME
## 1  4aef5d85f964a520dfd721e3 6.959 50.94                Köln Hauptbahnhof
## 24 4bade052f964a520506f3be3 6.949 50.93             Stadtbibliothek Köln
## 25 4baf1998f964a52033eb3be3 6.964 50.93 Deutsches Sport & Olympia Museum
## 26 4baf428cf964a52024f43be3 6.962 50.92                     Ubierschänke
## 27 4ba4f032f964a520dac538e3 6.849 50.92                     OBI Baumarkt
## 28 4bc210d92a89ef3b7925f388 6.927 50.95                    Pfeiler Grill
##           CATEGORY_NAME
## 1    Travel & Transport
## 24 College & University
## 25 Arts & Entertainment
## 26       Nightlife Spot
## 27       Shop & Service
## 28                 Food
# venueUserDataset contains all the relationships (aka ratings)
head(venueUserDataset)
##                    VENUEID   USERID COUNT_CHECKINS
## 1 4aef5d85f964a520dfd721e3 15601925              5
## 2 4aef5d85f964a520dfd721e3  7024136              1
## 3 4aef5d85f964a520dfd721e3   193265              1
## 4 4aef5d85f964a520dfd721e3 10795451              6
## 5 4aef5d85f964a520dfd721e3 13964243              6
## 6 4aef5d85f964a520dfd721e3   126823             11

The next thing is to import all that data into Neo4j. We will do this by generating dynamic Cypher statements to create all the nodes and relationships. This will of course take some time. If you have more data, then it’s maybe wiser to use the “Batch Importer”. But this needs more development and will not be explained here. Neo4j’s website offers a lot of possibilities to import data from various sources into the graph database. All of our Cypher statements will be sent to Nei4j via the “query” method, which I got from here.

# Function for querying Neo4j from within R 
# from http://stackoverflow.com/questions/11188918/use-neo4j-with-r
query <- function(querystring) {
    h = basicTextGatherer()
    curlPerform(url = "localhost:7474/db/data/ext/CypherPlugin/graphdb/execute_query", 
        postfields = paste("query", curlEscape(querystring), 
        sep = "="), writefunction = h$update, verbose = FALSE)
    result <- fromJSON(h$value())
    data <- data.frame(t(sapply(result$data, unlist)))
    names(data) <- result$columns
    return(data)
}
# -------------------------------------- 
# import all data into neo4j
# --------------------------------------
nrow(venueDataset)  # number of venues
## [1] 3352
length(unique(venueUserDataset$USERID))  # number of users
## [1] 3306
nrow(venueUserDataset)  # number of relationships
## [1] 11293
# venues (-> nodes)
for (i in 1:nrow(venueDataset)) {
    q <- paste("CREATE venue={name:\"", venueDataset[i, "VENUEID"], "\",txt:\"", 
        venueDataset[i, "VENUE_NAME"], "\",categoryname:\"",
	venueDataset[i, "CATEGORY_NAME"], "\",type:\"venue\",\nlng:", 
	venueDataset[i, "LNG"], ", lat:", venueDataset[i, "LAT"], 
	"} RETURN venue;", sep = "")
    data <- query(q)
}
# users (-> nodes)
for (i in unique(venueUserDataset$USERID)) {
    q <- paste("CREATE user={name:\"", i, "\",type:\"user\"} RETURN user;", sep = "")
    data <- query(q)
}
# number of checkins (-> relationships)
for (i in 1:nrow(venueUserDataset)) {
    q <- paste("START user=node:node_auto_index(name=\"", venueUserDataset[i, "USERID"], 
        "\"), venue=node:node_auto_index(name=\"", venueUserDataset[i, "VENUEID"], 
	"\") CREATE user-[:RATED {stars : ", 
	venueUserDataset[i, "COUNT_CHECKINS"], "}]->venue;", sep = "")
    data <- query(q)
}

So before we start with the recommender itself, I will discuss some of it’s the details. First part of the plan is to compute the similarities between a person and all other users, who also visited at least one of the venues the person did. Based on these similarities we will then determine the recommendations. This means, that we need a similarity measure first. In our case we will use the cosines similarity, a similarity measure typically used in text mining for high dimensional data (this also fits our case). A maximum value of 1 means that both users rated all venues they visited in the same way (“the profiles of both are similar”). If you calculate the similarity in the traditional way, you would first have to build up a feature table of the size \(m\) x \(n\) (\(m\) ~ number of users and \(n\) ~ number of venues) where a value \((i,j)\) represents the rating from user \(i\) about venue \(j\). This feature table would be huge and sparse because most users only visited a few venues. A graph is an efficient way to represent that, because only the ratings that already exist have to be encoded as explicit relationships.

After we choose a person for whom we want to compute recommendations, we start by calculating all of the relevant similarities. To get some more meaningful recommendations we exclude all venues related to the venue type “Travel & Transport”“ and only take those users into account, who have at least two visited venues in common with the chosen person. For the last part we have to use R because if I’m right, Neo4j is unable at the moment to carry out “Subselects”.

# -------------------------------------- 
# simple venue recommendation 
# --------------------------------------
userName <- "7347011"  # chose username/ID
nOfRecommendations <- 20  # number of recommendations

# Determine similiar users using the cosinus distance measure
q <- paste("START me=node:node_auto_index(name=\"", userName, "\")
    MATCH (me)-[r1]->(venue)<-[r2]-(simUser)
	WHERE venue.categoryname <> \"Travel & Transport\"
	RETURN id(me) as id1, id(simUser) as id2,
		sqrt(sum(r1.stars*r1.stars)) as mag1, 
		sqrt(sum(r2.stars*r2.stars)) as mag2,
		sum(r1.stars * r2.stars) as dotprod,
		sum(r1.stars * r2.stars)/ 
		(sqrt(sum(r1.stars*r1.stars)) * sqrt(sum(r2.stars*r2.stars))) as cossim,
		count(venue) as anz_venues
	ORDER BY count(venue) DESC;", 
    sep = "")
ans <- query(q)
simUser <- subset(ans, anz_venues >= 2)
head(simUser)
##    id1  id2   mag1   mag2 dotprod cossim anz_venues
## 1 3450 3518 16.371 16.643     196  0.7194         15
## 2 3450 3782  3.000  3.000       8  0.8889          6
## 3 3450 3382  2.828  2.828       7  0.8750          5
## 4 3450 4031  4.690  2.236      10  0.9535          5
## 5 3450 3860  2.236  7.483      12  0.7171          5
## 6 3450 3537  2.236 11.180      15  0.6000          5

The second query then selects all venues (call them recommendation candidates) rated by similar users which are still not visited by the person. It returns the user ratings and the venue properties like name, type and the geographic coordinates.

# Query all venues from the similar users still not visited by the chosen user
q2 <- paste("START su=node(", paste(simUser$id2, collapse = ","), "), 
    me=node:node_auto_index(name=\"", userName, "\")
	MATCH su-[r]->v
	WHERE NOT(v<-[]-me) AND v.categoryname <> \"Travel & Transport\"
	RETURN id(su) as id_su, r.stars as rating, id(v) as id_venue, 
		v.txt as venue_name, v.lng as lng, v.lat as lat ORDER BY v.txt;", 
    sep = "")
ans2 <- query(q2)
head(ans2)
##   id_su rating id_venue            venue_name              lng              lat
## 1  3480      1     1297 . HEDONISTIC CRUISE .         6.926502        50.949425
## 2  3436      1     1269               30works 6.93428135142764 50.9401634603315
## 3  3480      1     1376             3DFACTORY  6.9209361076355 50.9508572326839
## 4  3381      1     2274    4 Cani della Citta         6.942126        50.938178
## 5  3369      1     1418     4010 Telekom Shop          6.94348         50.93852
## 6  3547      1     1418     4010 Telekom Shop          6.94348         50.93852

The last step is to determine the top X recommendations. Therefore we compute a weighted (by the similarity between the user and the chosen person) rating for every recommendation candidate over all of the similar users that had already visited it and pick the top X venues as our recommendations.

# Calculate top X recommendations
recommendationCandidates <- ans2
venueRecommendationCandidates <- merge(ans, recommendationCandidates, 
	by.x = "id2", by.y = "id_su")
venueRecommendationCandidates$rating <- 
	as.numeric(as.character(venueRecommendationCandidates$rating))
venueRecommendation <- ddply(venueRecommendationCandidates, 
	c("id_venue", "venue_name", "lng", "lat"), function(df) {
		sum(df$cossim * as.numeric(df$rating))/sum(df$cossim)
})
venueRecommendation <- 
	venueRecommendation[order(venueRecommendation[, 5], decreasing = TRUE), ]
venueRecommendation$lat <- as.numeric(as.character(venueRecommendation$lat))
venueRecommendation$lng <- as.numeric(as.character(venueRecommendation$lng))

# Our recommendations for the chosen user
venueRecommendation[c(1:nOfRecommendations), ]
##     id_venue                                     venue_name   lng   lat     V1
## 687      168                                     Wohnung 16 6.922 50.94 100.00
## 697      187                       Fork Unstable Media GmbH 6.966 50.93  52.00
## 152       56                  Pixum | Diginet GmbH & Co. KG 7.000 50.87  44.00
## 49       536     Fachhochschule des Mittelstands (FHM) Köln 6.939 50.94  37.00
## 635      154 Seminar für Politische Wissenschaft - Uni Köln 6.924 50.93  30.00
## 752      201      Sinn und Verstand Kommunikationswerkstatt 6.954 50.95  27.00
## 789      677                               Happyjibe's Loft 6.919 50.97  26.00
## 831      425                     PlanB. GmbH Office Cologne 6.963 50.93  26.00
## 484      586                               Praxis Dokter H. 6.989 50.93  25.00
## 666      303                      Bürogemeinschaft Eckladen 6.962 50.93  24.00
## 223      337                                Köln-Lindweiler 6.887 51.00  23.00
## 516      723                                    Health City 6.932 50.94  23.00
## 611      306     Kreuzung Zollstockgürtel / Vorgebirgstraße 6.945 50.90  23.00
## 201      784                            Paul-Humburg-Schule 6.922 50.99  21.00
## 412      122                                     unimatrix0 6.934 50.93  21.00
## 601      989                             Prinzessinnenküche 6.947 50.90  20.00
## 754      957                Reitergemeinschaft Kornspringer 7.076 50.97  19.00
## 371      233                                            MTC 6.939 50.93  17.56
## 721      188                           ESA-Besprechungsecke 6.976 50.96  17.00
## 694      973                                  fischerappelt 6.966 50.93  16.00

We will close the coding section by a nice visualization of already visited (red) and recommended (blue) venues in a map. This gives a first impression on how the venues and the recommendations are distributed in geographic space.

# -------------------------------------- 
# Plot all of the visited and recommended venues
# -------------------------------------- 
# get coordinates of the venues already visited
qUserVenues <- paste("START me=node:node_auto_index(name=\"", 
	userName, "\") MATCH me-[r]->v
	WHERE v.categoryname <> \"Travel & Transport\"
	RETURN r.stars, v.txt as venuename, v.type as type, v.lng as lng, v.lat as lat 
	ORDER BY r.stars, v.txt;", 
    sep = "")
userVenues <- query(qUserVenues)
userVenues$lng <- as.numeric(as.character(userVenues$lng))
userVenues$lat <- as.numeric(as.character(userVenues$lat))

# plot venues using the ggmap package
require(ggmap)
theme_set(theme_bw(16))
hdf <- get_map(location = c(lon = mean(userVenues$lng), 
	lat = mean(userVenues$lat)), zoom = 11)
ggmap(hdf, extent = "normal") + geom_point(aes(x = lng, y = lat), 
	size = 4, colour = "red", data = userVenues) + 
	geom_point(aes(x = lng, y = lat), size = 4, colour = "blue", 
		data = venueRecommendation[c(1:nOfRecommendations), ])

Recommendations and visited venues

Finally it is time to summarize what was done: We built a simple recommendation engine which could be used to recommend new places to users, which they should visit. The recommendation is based on their past behavior and can be computed in near real-time (there is no long running batch job). What we left out, is a rigor evaluation of the recommendation performance. But because recommendation only serves as a demo use case, this was not the topic of this posting. More important I have to say, that I’m really impressed on how easy it is to set up a Neo4j graph database and how simple it is to make the first steps with the query language Cypher. The SQL style of Cypher makes the determination of the most similar users straightforward. What’s also interesting is the simplicity of the connection from R to Neo4j via the REST-Interface. No additional things are needed. The “outlook” is even more promising due to the main advantages of NoSQL databases like fast operations on huge datasets or easy and fault-tolerant scalability (not shown here). Even though the operations run fast on that moderate sized dataset, a broad test lies beyond that session. But maybe in one of the next postings …

  1. things-about-r posted this