Improved Recommendation Engine

I have been working on an improved recommendation engine for Twitter users you should follow, however the basic concept is easily used for other recommendations.

Simple Algorithm

While there are many different ways to come up with recommendations, a common one is to look at the the most common friends-of-friends that you have.

This algorithm is trivial to implement, in SQL you would use something like:

SELECT f2.follows_user_id, COUNT( * ) 
FROM following AS f1
INNER JOIN following AS f2 ON f1.follows_user_id = f2.user_id
WHERE f1.user_id =16308660
GROUP BY f2.follows_user_id
ORDER BY COUNT( * ) DESC 

and in Chyper you would use something like:

MATCH (n:Member { id:'James'})-[fr:FRIENDS_WITH]->(f)-[fofr:FRIENDS_WITH]->(fof) 
RETURN fof.name, count(fofr) 
ORDER BY count(fofr) DESC

Drawbacks

The algorithm is a very broad brush approach and ignores the fact that within the community of people that I follow, there are clusters of users that might have the same interests, such as Neo4j, Web Development, Gardening or just all work for the same company.

While this broad brush approach is useful to find the most popular people in the network globally, such as Stephen Fry or Richard Branson , it will miss people who are popular within small clusters, but not globally, whom I might also be interested in following.

Simple Social Network

In the example above, Alice and Bob form a cluster (as they follow each other), and they both follow S. This means I am probably interested in following S, even though globally they would rank quite low (below Q and R).

Improving the Algorithm

What I wanted was a way to find the users I should follow within these clusters. First I needed to cluster the people that I follow, you can see how I did this in my post here.

Once I had the different clusters, I updated the algorithm to the following:
1) Find the users in each cluster
2) Find the most popular friends-of-friends in each cluster
3) Rank these users by the percentage of followers in the cluster and the size of the cluster.

In SQL this looks something like

SELECT c.cluster_id, f.follows_user_id, c.cluster_size, count(*) as c, count(*)/c.cluster_size as p
FROM cluster as c
RIGHT JOIN cluster_member as cm ON c.cluster_id = cm.cluster_id
RIGHT JOIN following as f ON f.user_id = cm.member
GROUP BY c.cluster_id, f.follows_user_id
ORDER BY count(*) / cluster_size DESC, `c`.`cluster_size` DESC, count(*) DESC

However that SQL is very inefficant so I used a modified version below;

REATE TEMPORARY TABLE IF NOT EXISTS cluster_method ENGINE=MEMORY 
as (
SELECT c.cluster_id, f.follows_user_id, c.cluster_size, count(*) as c, count(*)/c.cluster_size as p
FROM cluster as c
RIGHT JOIN cluster_member as cm ON c.cluster_id = cm.cluster_id
RIGHT JOIN following as f ON f.user_id = cm.member
WHERE c.cluster_size >= 5
GROUP BY c.cluster_id, f.follows_user_id
HAVING count(*) / cluster_size > 0.6
ORDER BY `c`.`cluster_size` DESC, count(*) DESC);

DELETE FROM cluster_method WHERE follows_user_id IN
(SELECT follows_user_id FROM following WHERE user_id = $twitterUserId);

SELECT follows_user_id as id, MAX(p) as p
FROM cluster_method 
WHERE follows_user_id <> 16308660 
GROUP BY follows_user_id
ORDER BY MAX(p) DESC, cluster_size DESC;

In Chypher it is much simpler and looks like this:

MATCH (n:Member { id:'James'})-[cr:MEMBER_OF]->(c)<-[cr1:MEMBER_OF]-(f)-[fofr:FRIENDS_WITH]->(fof) 
RETURN c.name, fof.name, count(cr1), count(fofr) 
ORDER BY count(fofr)/count(cr1) DESC, count(cr1) DESC

This new algorithm is not perfect, one issue is that with smaller cluster it is easier for every member to follow a particular user do potentially the ranking needs to be tweaked to account for this. One thing I did in the production implementation was ignore sub-communities with a size less that 5. You also need to do the extra step of determining the sub-communities up front, however even with these issues this algorithm helped me find interesting people to follow that I would otherwise not have found such as the awesome Nanogirl

Note: In both examples I haven’t accounted for users that you already follow, you can remove these easily once you have the list of recommended users to follow. I didn’t show it in the examples above as the method is the same in both cases and I wanted to make the examples clearer.