Problem Detail: I have a weighted directed graph (it’s sparse, 35,000 nodes and 19 million edges) and would like to calculate similarity scores for pairs of nodes. SimRank would be ideal for this purpose, except that it applies to unweighted graphs. It’s easy to adapt the matrix representation of SimRank to use weighted edges: $mathbf{S} = max{C cdot (mathbf{A}^T cdot mathbf{S} cdot mathbf{A}),mathbf{I}}$ Just replace the entries in the adjacency matrix $mathbf{A}$ with the edge weights. But I don’t see how to derive the “iteration to a fixed-point” method from this formula. $s_{k+1}(a, b) = frac{C}{left|I(a)right|left|I(b)right|} cdot sum_{i in I(a)} sum_{j in I(b)} s_{k}(i, j)$ Is the following method correct? $s_{k+1}(a, b) = C cdot sum_{i in I(a)} sum_{j in I(b)} s_{k}(i, j) cdot w(i, a) cdot w(j, b)$ My reasoning is that the original method is equivalent to this: $s_{k+1}(a, b) = C cdot sum_{i in I(a)} sum_{j in I(b)} s_{k}(i, j) cdot frac{1}{left|I(a)right|} cdot frac{1}{left|I(b)right|}$ so I replaced the trivially equal weights with the actual ones I have. Confirmation would be nice, though. If there is a superior method to SimRank, I would like to hear about it. (I’m concerned with improvements to accuracy, not performance; this is a small data set.) Edit: A different similarity metric that’s nearly as accurate but performs a lot better would actually be useful too. My initial tests of SimRank are very slow, and the potential optimizations are not simple to code.
Asked By : Remy
Answered By : Remy
The paper “SimRank++: Query Rewriting through Link Analysis of the Click Graph” describes a weighted SimRank algorithm that is similar to my attempt. It uses this formula to update similarity scores: $s_{k+1}(a, b) = text{evidence}(a, b) cdot C cdot sum_{i in I(a)} sum_{j in I(b)} s_{k}(i, j) cdot w(i, a) cdot w(j, b) cdot text{spread}(i) cdot text{spread}(j)$ $text{evidence}(a, b) = sum_{i=1}^{left|I(a) cap I(b)right|} frac{1}{2^i}$ $text{spread}(i) = e^{-text{Var}({w(i, a) ~|~ a in O(i)})}$ Note that the weights should be normalized to sum to 1, which I was already doing. SimRank++ also incorporates evidence scores (which account for the quantity of incoming edges) and spread (which accounts for variance of outgoing edge weights).
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/44098