For an upcoming project I needed to group the people I follow into communities. While there are many algorithms available for grouping nodes in a graph (in this case a node is a user and an edge represents a follow), they didn’t work for me for 3 reasons:
- My graph is directed
- I wanted users to be able to be part of more than one community
- I needed to be able to cluster quickly and with limited memory
After much experimentation, I did eventually come up with a reasonably good method, which is a modified form of hierarchical clustering. While there is room for improvement, it does find the main communities which is what I needed.
In this post I will use mostly psuedo-code, but you can download a working implementation from GitHub.
The algorithm is as follows:
- Load users and who they follow into an array with the following structure:
following[userId] = array(followsUserId1, followsUserId2, followsUserId3);
- Invert the array so that users are listed with the people who follow them. This is so that we can find people who are similar because the same people follow them. The reason for this is that 2 people might follow the same users, but those users they follow are from different communities e.g. a tech person and a sports person. But the people who are following a person are usually following someone because they are members of a particular community.
This also solves a secondary problem where lots of people might follow someone because they are an important member of a community, but that person may not follow very many people.
follows[userId] = array(followedByUserId1, followedByUserId2, followedByUserId3);
- Next is the first of two loops (in the code on GitHub you can skip the second loop). This loop creates the initial clusters by doing the following:
for each user find the closest user based on common followers use these 2 users to seed a cluster set StillAddingUsers = true while StillAddingUsers find the closest user that is not in the cluster if closest user is close enough add user to cluster recalculate the common followers of the cluster else set StillAddingUsers = false
- Now that we have a collection of clusters, I then do a merge step. The reason for this as the above algorithm will produce several clusters that are very similar to each other. For example id User1 and User2 are similar, then you will have a cluster from when we looped over User1 and another from when we looped over User2.
set StillMergingClusters = true while StillMergingClusters StillMergingClusters = false //to prevent constant looping once we are done for each cluster find cluster with highest number of common users if number common users > minimum value (set in config) merge these clusters reset for each loop
- The final step is to store the clusters in the database so that we can refer back to them later.
As I mentioned at the start of this post, this algorithm is not perfect, but was good enough for what I needed. If you have any suggestions for improvement, drop me a line on twitter or submit a PR on GitHub.
Code example is available on GitHub.