Example of using Markov Clustering in PHP

This post will go step by step through using my Markov Clustering (MCL) php library to cluster twitter users. For more information on the library see my post here.

You can download the library from GitHib at https://github.com/jrowlandsnz/mcl-php.

For this example we are going to cluster the following social graph from twitter (but this could easily be any social graph). We are also going to assume that all the follows are reciprocal.

 

twitter_mcl_example

 

You can see there are two obvious clusters, but there is one connection between them. The test will be to make sure we finish with two clusters, one of Alice, Bridget, Charles and Doug. The second will be Mark, Nancy and Oliver.

The data is stored in a MySQL database with the following structure:

user
user_id
screen_name
follows
user_id
follows_user_id

All sample code is available on github. The steps involved to use the MCL class are outlines below:

  1. Include the Matrix and MCL classes
    include_once('../Matrix.php');
    include_once('../MCL.php');
    
  2. Connect to Database

    //connect to database
    $dbServer = 'localhost';
    $dbUser = 'test_user';
    $dbPassword = 'test_password';
    $dbName = 'twitter';
    
    $dbConn = mysqli_connect($dbServer, $dbUser, $dbPassword, $dbName);
    if (mysqli_connect_errno()) {
        echo "Unable to connect to the database server at this time";
        exit();
    }
    
  3. Declare some variables to hold our data

    $matrixKey = array();
    $dbData = array();
    $filePrefix = 'twitter_test_data'; //optional, used to output data files
    
  4. Query the database

    This query is very simple, for each of the rows in the following table, join on the user table to get the screen name. You can skip this step if you want to use the id’s and then do the lookup later. I prefer to use the screen_names so the clusters are more readable at the end.

    $sql = "SELECT u1.screen_name as user_name, u2.screen_name as follows_user_name
            FROM `following` f
    	INNER JOIN user u1 ON u1.user_id = f.user_id
    	INNER JOIN user u2 ON u2.user_id = f.follows_user_id";
    
  5. Pull the data and create an array of matrix keys.

    We will build an array of all of the screen_names so we can use these as the row/column labels in the matrix, this will make the matrix much easier to interpret. If we didn’t do this we would have no way of correlating a row or column in the matrix to a specific user.

    while($row = $result->fetch_assoc()) {
    	//get a list of all the keys that will be in the array
    	$key1 = array_search("".$row['user_name'],$matrixKey,TRUE);
    	if($key1 === false) {
    		$matrixKey[] = "".$row['user_name'];
    	}
    	
    	$key2 = array_search("".$row['follows_user_name'],$matrixKey,TRUE);
    	if($key2 === false) {
    		$matrixKey[] = "".$row['follows_user_name'];
    	}
    		
                    //store the data so we don't need to query the db again
    		$dbData[] = $row; 
    	}
    	asort($matrixKey);
    }
    
  6. Now that we know how big our matrix has to be, we will create the matrix and then fill with zeroes so that we know each element has a value and we don’t need to check later. This is more efficient to do than checking if every element has a value on each of the multiplication’s.

    matrixData = array();
    for($i = 0;  $i < sizeof($matrixKey); $i++) {
    		
    	$rowOfZero = array();
    	for($j = 0;  $j < sizeof($matrixKey); $j++) {		
    		$rowOfZero[] = 0;
    	}
    	$matrixData[] = $rowOfZero;
    }
    
  7. We can now load the data of who follows who into the matrix. I am going to use an array of the database data that I created earlier, but you could query the database again if you need to.

    $matrix = new Matrix($matrixData);
    		
    //now populate with data from database
    foreach($dbData as $row) {
    	$rowIndex = array_search("".$row['user_name'], $matrixKey, TRUE);
    	if($rowIndex === FALSE) {
    		throw new Exception("Matrix Key not present");
    	} 
    	
    	$colIndex = array_search("".$row['follows_user_name'], $matrixKey, TRUE);
    	if($colIndex === FALSE) {
    		throw new Exception("Matrix Key not present");
    	} 
    	//need to convert from array index to matrix index 		
            $rowIndex = $rowIndex + 1; 
    	$colIndex = $colIndex + 1;		
    	$matrix->setElement($rowIndex, $colIndex, 1);
    }
    
  8. We can finally actually do the clustering, which is nice and easy.

    The inflation values will make strong connections stronger and weaker connections weaker.
    The power value determines the amount of flow between different cluster.
    You should try different values to determine which works best for your data.

    $mcl = new MCL($matrix);
    $mcl->dataFilePrefix = $filePrefix;
    $mcl->matrixKeyArray = $matrixKey;
    $mcl->inflationValue = 2;
    $mcl->powerValue = 2;
    $mcl->cluster(true);
    $clusters = $mcl->interpret();	
    
  9. Finally we can view the output:

    //print the clusters to the screen
    echo '<pre>';
    print_r($clusters);
    echo '</pre>';
    

    which will display as

    Array
    (
        [0] => Array
            (
                [0] => Alice
                [1] => Doug
                [2] => Bridget
                [3] => Charles
            )
    
        [1] => Array
            (
                [0] => Mark
                [1] => Nancy
                [2] => Oliver
            )
    
    )
    

    Exactly as we expected.

I hope that you found this useful, contact me on twitter, I am @jrowlands and let me know how it went.

All code is available at https://github.com/jrowlandsnz/mcl-php.