8 Link Analysis

Pagerank—link analysis 

Early web search:
extract keywords-->queries match -->page ranked by occurrences of keywords

Term Spam: (TS)
Disguise a page as sth it is not about 

Pagerank: vs TS
Rank pages by linkage too
How many pages point to a page= # incoming links => importance
Link spam (fade incoming pointers)

Manually identify some important pages and assign them page ranks, then let ranks propagate to other nodes.

Random surfer model 
Random surfer start from any page follow its outgoing links randomly 
来,去,都是随机的
Pagerank = Prob(a random surfer lands on the page)
More robust than manual approach, a collective voting scheme

Limiting distribution 
Graph is strongly connected, a node can reach any other node in the graph
But cannot have dead end & spider traps
A--->B, C, D;    B--->A, D,(A,D->C);     C--->A(->B,C,D);     D--->B,C (B->A)
nodes = pages, edges = hyperlinks btw pages 


Dead end: 
A page with no edges out --》一个node
Absorb page ranks and sink it 
Pagerank -> 0 for any page reaching the dead end (including itself)
A-> B, C, D;  B-> A, D(-->C);  C;  D-->B, C 任意点都可以到C,但是C没有outgoing pointer


Spider trap 
Group of pages with no edges going out of group 
Absorb all page ranks and circulate within the group 
Rank (of spider trap node) ->1, others ->0 
Surfer cannot leave once trapped 
Can have 1+ nodes 

图里无论是哪个node最终还是回到了C
A--> B-->D--> C-->C

Transition Matrix
M
M[i j] = prob. (j->i) column index-->row index 


Prob(Location of surfer)
column vector
sum --》1
Next distrubution v1 = Mv0, v2 = Mv1, vi = Mv(i-1)
Stabilize, v = Mv from step k on, v doesn't change any more 

Mv = v  
values in each column of M or v both sum to 1, M is a stochastic matrix
Largest eigenvalue of a stochastic matrix is 1
and corresponding eigenvector is principal eigenvector

Methods 
1. Power iteration: start with initial v, repeatedly * M, stop when v'~v
2. Gaussian elimination (M-I)v = 0 ===>M -->Upper Triangular Matrix ==> Linear system 
对角线上的元素-->1,对角线以下-->0 PRACTICE
3. Eigenvector Mv =入 v ---> (M-入I)v = 0 ----> det(M-入I) = 0 PRACTICE

拉普拉斯展开
行列式展开,按某一行或某一列展开,n*n的矩阵有2n种展开方式
B是一个n*n矩阵,Mij是B关于第i行第j列的n-1阶鱼子式
代数余子式: Cij是B的(I,J)余子式Mij 与(-1)i+J 的 乘积
bi1Ci1 +。。。+ binCin
b1jC1j +。。。+ bnjCnj
某行元素,或某列元素,与相应的代数余子式乘积的和,需要乘以-1的i+j次方,

Taxation

Spider trap vs Dead end 

Spider Trap 
C's ranks is absorbing, no lost M is still stochastic 

Dead end--sub-stochastic values in column add up at most 1 or 0
C's rank never contribute to new ranks, just lost 

Taxation
Deal with spider traps & dead endsallow surfer to jump to a random page  “teleporting”
  • 𝑣= 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛
𝛽, damping factor, e.g., set to .8
𝒆, an n-dimensional vector with all 1’s 
n, # of nodes in the graph
Tax rate = 1-𝛽=20%
Do a random jump 20% of time

This in effect makes graph“strongly-connected”


Taxation on Spider Trap

  • 𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛
dead end从整体net出发考虑得失平衡
0.8 * C's rank is lost = sink ⤵️
0.2*(A+ B + C + D) = tax ⬇️
0.2* 1 teleporting --fixed 


When stabilized... 
Rank for node C remains to be .128:
C = .8*(A/3 + D/2) + .2*(1./4) = .128
Overall rank does not change either:
Loss: .8*C due to dead end+ .2*(A+B+C+D) due to taxation = C + .2*(A+B+D) = .2
Gain: .2 from teleporting

Net loss will be 0 when loss & gain balance out 

Taxation on Spider Trap

C keeps absorbing PageRank
no loss in total PageRank (=1)

从吸收点C出发考虑得失
Taxation effect on C
Lose 20% of its PageRank (1-𝛽)tax⬆️
Gain 80% of contributions from A and D 𝛽Mv ⬇️
Gain 1⁄4 of contribution (.2) from teleport ( 1−𝛽 )𝒆/𝑛 = 20%*1/4 fixed
  • 𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛



80% of contribution from A & D + 1/4 of contribution from teleport = 20 % of C’s rank 
(i.e., its taxation loss) 

PageRank in Spark 


bin/spark-submit pagerank.py pagerank_data.txt 10 

pagerank_data.txt 

1 2
1 3 
1 4
2 1 
3 1
4 1 
output
1 has rank: 1.73800730412.
4 has rank: 0.753997565294. 
3 has rank: 0.753997565294. 
2 has rank: 0.753997565294. 

code: 
from __future__ import print_function
import re
import sys
from operator import add
from pyspark.sql import SparkSession

def computeContribs(urls, rank):
    """Calculates URL contributions to the rank of other URLs."""
    num_urls = len(urls)
    for url in urls:
        yield (url, rank / num_urls)


def parseNeighbors(urls):
    """Parses a urls pair string into urls pair."""
    parts = re.split(r'\s+', urls)
    return parts[0], parts[1]


if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: pagerank <file> <iterations>", file=sys.stderr)
        exit(-1)

    print("WARN: This is a naive implementation of PageRank and is given as an example!\n" +
          "Please refer to PageRank implementation provided by graphx",
          file=sys.stderr)

    # Initialize the spark context.
    spark = SparkSession\
        .builder\
        .appName("PythonPageRank")\
        .getOrCreate()

    # Loads in input file. It should be in format of:
    #     URL         neighbor URL
    #     URL         neighbor URL
    #     URL         neighbor URL
    #     ...
    lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

    # Loads all URLs from input file and initialize their neighbors.
    links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()

    # Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
    ranks = links.map(lambda url_neighbors: (url_neighbors[0], 1.0))

    # Calculates and updates URL ranks continuously using PageRank algorithm.
    for iteration in range(int(sys.argv[2])):
        # Calculates URL contributions to the rank of other URLs.
        contribs = links.join(ranks).flatMap(
            lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))

        # Re-calculates URL ranks based on neighbor contributions.
        ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)

    # Collects all URL ranks and dump them to console.
    for (link, rank) in ranks.collect():
        print("%s has rank: %s." % (link, rank))

    spark.stop()

HITS Algorithm 
Hyperlink-inducedtopicsearch
• Each page has 2 scores
– Authority: <---
pointed by many good hub pages 
– Hub: pointing to many good authority pages--->
• E.g
– Hub: department pages listing all courses 

– Authority: individual course pages 


Link/Adjacency Matrix


LT is the un-normalized M (transition matrix) 
𝒉=𝜆L𝒂𝒂=𝜇L𝑇𝒉


  1. Start with 𝒉=a vector of all 1’s
  2. Compute 𝒂=L𝑇𝒉,then scale it so that the largest component is 1
  3. Compute 𝒉=L𝒂 and similarly scaled
  4. Repeat steps 2,3 until 𝒉 and 𝒂 do not change much

Power iteration approach converges 
𝒉 into principal eigenvector of 𝜆𝜆𝑇
𝒂 into principal eigenvector of 𝜆𝑇𝜆 

U, V, W have no incoming edges => Authority = 0, and stay so 





评论

此博客中的热门博文

1 Map reduce problems

NoSql and AWS DynamoDB practices