Topic Specific PageRank simulation in F#

The standard PageRank can be simulated by a random walker (or many of them) surfing the graph of web pages. The random walker is following a random link on the current page with equal probability for each of these links. The PageRank of each page is then proportional to the visits of the walker to the specific page. To avoid dead ends and other issues, the random walker is allowed to teleport to any node in the graph with a small teleport probability at each step.

In Topic Specific PageRank (also know as Personalised Page Rank), the random walker is biased by teleporting to a set of topic-specific relevant nodes. The topic is determined by the context of the search query.

The goal of this post is not to explain PageRank in more detail, but to get this simple model and code it in F# to get actual numbers for a small network of pages. So, given a network of pages, and a subset of relevant pages, how do we compute the PageRank of all pages in the graph?

I start with a couple of helper functions, one that gives an element of a list in random with equal probability, and the second one that uses a function to assign weights/probabilities to each item.

open System

let randomChoice (rnd : Random) (items : 'a list) =
    items.[rnd.Next(List.length items)]

let randomWeightedChoice (rnd : Random) weightSelector (items : 'a list) =

    let weights = items |> List.map weightSelector
    let weightSum = weights |> List.sum

    let rec firstWithAccumulationGreaterThan value items =
        let rec loop value items =
            match items with
            | [] -> raise (ArgumentException("The list of items should never be empty."))
            | (x,w)::tl ->
                if w > value 
                then x 
                else loop (value - w) tl
        loop value items

    let random = rnd.NextDouble() * weightSum

    firstWithAccumulationGreaterThan random (List.zip items weights)

I am going to use the first one when I choose the link to follow on a page. The second one will be used when I choose what page to teleport to, as I will attach a weight to each page in the teleport set. I define two record types, one for a link and one for a teleport node with its attached weight. I identify each node with an integer Id.

    type Link = { From : int; To : int}

    type TeleportNode = { Id : int; Weight : float }

Next I define a record type that contains the state of the random walker, i.e. the current node and the number of visits for each page visited up to now.

type WalkerState = 
    { CurrentNode : int; 
      // NodeId, Number of times visited
      Visits : Map<int, float> }
    member this.visitsTo nodeId =
        match Map.tryFind nodeId this.Visits with
        | None -> 0.0
        | Some(v) -> v
    member this.addVisitTo nodeId = 
        { this with Visits = this.Visits |> Map.add nodeId (this.visitsTo nodeId + 1.0)}
    member this.visit nodeId =
        { this with CurrentNode = nodeId }.addVisitTo nodeId
    member this.getPageRankEstimate () =
        let totalVisits = 
            this.Visits |> Map.toSeq |> Seq.map (fun (k, v) -> v) |> Seq.sum |> float
        this.Visits |> Map.map (fun k v -> v / totalVisits)

I have defined a few functions on the WalkerState. As records are immutable, operations that change the state of the random walker, like visiting a new page, return a new WalkerState object.

Now it is time to define the algorithm that walks the graph. I provide a list of links, the set of teleport nodes, the teleport probability (between 0 and 1), the number of steps to take before reporting the result, and finally I also pass in/inject the random number generator as I might want to have more control on it (like control its seed).

let walkGraphWithTeleportations (rnd : Random) (links : Link list) (teleports : TeleportNode list) 
                                (teleportProbability : float) (steps : int) = 

    // Discover all the nodes from the list of links.
    let allNodes = Seq.append (links |> Seq.map (fun x -> x.From)) 
                              (links |> Seq.map (fun x -> x.To))
                   |> Set.ofSeq
                   |> List.ofSeq
    let nodeCount = allNodes |> List.length |> float

    let teleportNodes = teleports |> List.map (fun x -> x.Id)
    // Store the weights in a map so that you can look them up efficiently.
    let teleportWeights = teleports |> List.map (fun x -> (x.Id, x.Weight)) |> Map.ofList

    // A Map of all linked pages from a given page.
    let linksFrom = links |> Seq.groupBy (fun x -> x.From)
                          |> Seq.map (fun (from, tos) -> 
                              (from, tos 
                                     |> Seq.map (fun x -> x.To) 
                                     |> Seq.toList))
                          |> Map.ofSeq

    let randomTeleportNode rnd = 
        randomWeightedChoice rnd (fun i -> teleportWeights.[i]) teleportNodes

    let takeSingleStep (rnd : Random) (walker : WalkerState) =
        let neighbours = 
            match linksFrom.TryFind(walker.CurrentNode) with
            | None -> []
            | Some(ns) -> ns
        let nextNode = 
            if (List.isEmpty neighbours) || rnd.NextDouble() < teleportProbability
            then
                randomTeleportNode rnd
            else
                randomChoice rnd neighbours
        walker.visit nextNode

    let rec loop (times : int) (rnd : Random) (walker : WalkerState) =
        if times <= 0 then walker
        else
            let newWalker = takeSingleStep rnd walker
            loop (times - 1) rnd newWalker

    let startingNode = randomTeleportNode rnd
    let initialWalkerState = { CurrentNode = startingNode; Visits = Map.empty }

    let finalWalkerState = loop steps rnd initialWalkerState

    allNodes 
    |> List.fold (fun acc node -> 
           // Add 0.0 PageRank for pages not visited but they are part of the graph.
           if acc |> Map.containsKey node then acc else acc |> Map.add node 0.0) 
        (finalWalkerState.getPageRankEstimate ())

The stopping condition here is just reaching the specified number of steps, and it should be good enough assuming I use a number big enough to reach the stable solution.

Note that, to do this properly, I would perform the random walker steps in iterations. At each iteration I would measure how different the solution is. When this difference drops below some value epsilon I would stop. For measuring the difference I would need some measure of distance between two solutions, which could simply be the sum over nodes of the square of the difference in PageRank (between the values in each solution).

As an example I are going to use the following graph

Web page graph

The links in these graph are:

let links = 
    [ { From = 1; To = 3 }
      { From = 1; To = 4 }
      { From = 2; To = 5 }
      { From = 2; To = 6 }
      { From = 3; To = 7 }
      { From = 4; To = 7 }
      { From = 4; To = 8 }
      { From = 5; To = 9 }
      { From = 6; To = 9 }
      { From = 6; To = 10 }
      { From = 8; To = 9 }
      { From = 8; To = 11 }
      { From = 9; To = 8 }
      { From = 9; To = 11 }
      { From = 11; To = 7 }
      { From = 11; To = 10 } ]

First I use a single page {1} as the set of teleport nodes

let teleportNodes = [ { Id = 1; Weight = 1.0 } ]

and I run the simulation for a million steps with teleport probability 0.25

let rnd = new Random(7919)
let teleportProbability = 0.25
let steps = 1000000
let result = walkGraphWithTeleportations rnd links teleportNodes teleportProbability steps

result
|> Map.toSeq 
|> Seq.iter (fun (node, pr) -> printfn "%d -> %f" node pr)

The output I get is

1 -> 0.392301
2 -> 0.000000
3 -> 0.146919
4 -> 0.147492
5 -> 0.000000
6 -> 0.000000
7 -> 0.178075
8 -> 0.064714
9 -> 0.024405
10 -> 0.012689
11 -> 0.033405

And depicting the PageRank with colours I get

PageRank for teleport node {1}

where PageRank from 0 to 1 corresponds to colours from dark blue (for cold) to dark red (for hot). Pages that are closest to the page 1 become relevant and get higher PageRank than those further away, no matter how linked the latter are.

For comparison, if I add all the nodes to the teleport set I are going to get the standard PageRank algorithm. Running the same code with the modification

let teleportNodes = 
    [1 .. 11] 
    |> List.map (fun i -> { Id = i; Weight = 1.0 })

I get the output

1 -> 0.041582
2 -> 0.041389
3 -> 0.057081
4 -> 0.057222
5 -> 0.057457
6 -> 0.056575
7 -> 0.160136 
8 -> 0.118842
9 -> 0.150491
10 -> 0.116490
11 -> 0.142735

which looks like this

PageRank for teleport nodes all

where most linked pages get the highest PageRank.

Next I set the teleport nodes to {2}

let teleportNodes = [ { Id = 2; Weight = 1.0 } ]

and the output I get is

1 -> 0.000000
2 -> 0.332265
3 -> 0.000000
4 -> 0.000000 
5 -> 0.124476
6 -> 0.124816
7 -> 0.031238
8 -> 0.060944
9 -> 0.163396
10 -> 0.078754
11 -> 0.084111

which corresponds to the graph

PageRank for teleport node {2}

Now, I use {1,8} as the teleport set and assign weight to 8 three times bigger than the one of 8. The weight in the two previous examples was irrelevant because I had a single node, but this time it makes a difference.

let teleportNodes = [ { Id = 1; Weight = 1.0 }
                      { Id = 8; Weight = 3.0 } ]

The output we get is

1 -> 0.095392
2 -> 0.000000
3 -> 0.035732
4 -> 0.035732
5 -> 0.000000
6 -> 0.000000
7 -> 0.107830
8 -> 0.347664
9 -> 0.130148
10 -> 0.067629
11 -> 0.179873

and I see that pages next to node 8 get higher PageRank than those next to 1.

PageRank for teleport nodes {1,8}

Finally, I swap node 1 with 2 in the teleport set

let teleportNodes = [ { Id = 2; Weight = 1.0 }
                      { Id = 8; Weight = 3.0 } ]

and the output I get is

1 -> 0.000000
2 -> 0.091306
3 -> 0.000000
4 -> 0.000000
5 -> 0.034185
6 -> 0.034412
7 -> 0.070448
8 -> 0.335009
9 -> 0.163838
10 -> 0.083237
11 -> 0.187565

which looks like this

PageRank for teleport nodes {2,8}

 

comments powered by Disqus