8 Link Analysis

Pagerank—link analysis

Early web search:

extract keywords-->queries match -->page ranked by occurrences of keywords

Term Spam: (TS)

Disguise a page as sth it is not about

Pagerank: vs TS

Rank pages by linkage too

How many pages point to a page= # incoming links => importance

Link spam (fade incoming pointers)

Manually identify some important pages and assign them page ranks, then let ranks propagate to other nodes.

Random surfer model

Random surfer start from any page follow its outgoing links randomly

来，去，都是随机的

Pagerank = Prob(a random surfer lands on the page)

More robust than manual approach, a collective voting scheme

Limiting distribution

Graph is strongly connected, a node can reach any other node in the graph

But cannot have dead end & spider traps

A--->B, C, D; B--->A, D,(A,D->C); C--->A(->B,C,D); D--->B,C (B->A)

nodes = pages, edges = hyperlinks btw pages

Dead end:

A page with no edges out --》一个node

Absorb page ranks and sink it

Pagerank -> 0 for any page reaching the dead end (including itself)

A-> B, C, D; B-> A, D(-->C); C; D-->B, C 任意点都可以到C，但是C没有outgoing pointer

Spider trap

Group of pages with no edges going out of group

Absorb all page ranks and circulate within the group

Rank (of spider trap node) ->1, others ->0

Surfer cannot leave once trapped

Can have 1+ nodes

图里无论是哪个node最终还是回到了C

A--> B-->D--> C-->C

Transition Matrix

M[i j] = prob. (j->i) column index-->row index

Prob(Location of surfer)

column vector v

sum --》1

Next distrubution v1 = Mv0, v2 = Mv1, vi = Mv(i-1)

Stabilize, v = Mv from step k on, v doesn't change any more

Mv = v

values in each column of M or v both sum to 1, M is a stochastic matrix

Largest eigenvalue of a stochastic matrix is 1,

and corresponding eigenvector is principal eigenvector

Methods

1. Power iteration: start with initial v, repeatedly * M, stop when v'~v

2. Gaussian elimination （M-I）v = 0 ===>M -->Upper Triangular Matrix ==> Linear system

对角线上的元素-->1，对角线以下-->0 PRACTICE

3. Eigenvector Mv =入 v ---> (M-入I)v = 0 ----> det(M-入I) = 0 PRACTICE

拉普拉斯展开

行列式展开，按某一行或某一列展开，n*n的矩阵有2n种展开方式

B是一个n*n矩阵，Mij是B关于第i行第j列的n-1阶鱼子式

代数余子式： Cij是B的（I,J）余子式Mij 与（-1）i+J 的乘积

bi1Ci1 +。。。+ binCin

b1jC1j +。。。+ bnjCnj

某行元素，或某列元素，与相应的代数余子式乘积的和，需要乘以-1的i+j次方，

Taxation

Spider trap vs Dead end

Spider Trap

C's ranks is absorbing, no lost M is still stochastic

Dead end--sub-stochastic values in column add up at most 1 or 0

C's rank never contribute to new ranks, just lost

Taxation
Deal with spider traps & dead endsallow surfer to jump to a random page – “teleporting”

𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛

– 𝛽, damping factor, e.g., set to .8
– 𝒆, an n-dimensional vector with all 1’s
– n, # of nodes in the graph

Tax rate = 1-𝛽=20%
Do a random jump 20% of time

• This in effect makes graph“strongly-connected”

Taxation on Spider Trap

𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛

dead end从整体net出发考虑得失平衡

0.8 * C's rank is lost = sink ⤵️

0.2*(A+ B + C + D) = tax ⬇️

0.2* 1 teleporting --fixed

When stabilized...
• Rank for node C remains to be .128:
– C = .8*(A/3 + D/2) + .2*(1./4) = .128
• Overall rank does not change either:
– Loss: .8*C due to dead end+ .2*(A+B+C+D) due to taxation = C + .2*(A+B+D) = .2
– Gain: .2 from teleporting

Net loss will be 0 when loss & gain balance out

Taxation on Spider Trap

– C keeps absorbing PageRank
– no loss in total PageRank (=1)

从吸收点C出发考虑得失
Taxation effect on C
– Lose 20% of its PageRank （1-𝛽）tax⬆️
– Gain 80% of contributions from A and D 𝛽Mv ⬇️
– Gain 1⁄4 of contribution (.2) from teleport ( 1−𝛽 )𝒆/𝑛 = 20%*1/4 fixed

𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛

80% of contribution from A & D + 1/4 of contribution from teleport = 20 % of C’s rank
(i.e., its taxation loss)

PageRank in Spark

bin/spark-submit pagerank.py pagerank_data.txt 10

pagerank_data.txt

–1 2
–1 3
–1 4
–2 1
–3 1
–4 1

output

• 1 has rank: 1.73800730412.
• 4 has rank: 0.753997565294.
• 3 has rank: 0.753997565294.
• 2 has rank: 0.753997565294.

code:

from __future__ import print_function

import re

import sys

from operator import add

from pyspark.sql import SparkSession

def computeContribs(urls, rank):

"""Calculates URL contributions to the rank of other URLs."""

num_urls = len(urls)

for url in urls:

yield (url, rank / num_urls)

def parseNeighbors(urls):

"""Parses a urls pair string into urls pair."""

parts = re.split(r'\s+', urls)

return parts[0], parts[1]

if __name__ == "__main__":

if len(sys.argv) != 3:

print("Usage: pagerank <file> <iterations>", file=sys.stderr)

exit(-1)

print("WARN: This is a naive implementation of PageRank and is given as an example!\n" +

"Please refer to PageRank implementation provided by graphx",

file=sys.stderr)

# Initialize the spark context.

spark = SparkSession\

.builder\

.appName("PythonPageRank")\

.getOrCreate()

# Loads in input file. It should be in format of:

# URL neighbor URL

# ...

lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])

# Loads all URLs from input file and initialize their neighbors.

links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()

# Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.

ranks = links.map(lambda url_neighbors: (url_neighbors[0], 1.0))

# Calculates and updates URL ranks continuously using PageRank algorithm.

for iteration in range(int(sys.argv[2])):

# Calculates URL contributions to the rank of other URLs.

contribs = links.join(ranks).flatMap(

lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))

# Re-calculates URL ranks based on neighbor contributions.

ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)

# Collects all URL ranks and dump them to console.

for (link, rank) in ranks.collect():

print("%s has rank: %s." % (link, rank))

spark.stop()

HITS Algorithm
• Hyperlink-inducedtopicsearch
• Each page has 2 scores
– Authority: <---pointed by many good hub pages
– Hub: pointing to many good authority pages--->
• E.g
– Hub: department pages listing all courses
– Authority: individual course pages

Link/Adjacency Matrix

LT is the un-normalized M (transition matrix)

𝒉=𝜆L𝒂, 𝒂=𝜇L𝑇𝒉

Start with 𝒉=a vector of all 1’s
Compute 𝒂=L𝑇𝒉,then scale it so that the largest component is 1
Compute 𝒉=L𝒂 and similarly scaled
Repeat steps 2,3 until 𝒉 and 𝒂 do not change much

Power iteration approach converges
– 𝒉 into principal eigenvector of 𝜆𝜆𝑇
– 𝒂 into principal eigenvector of 𝜆𝑇𝜆

U, V, W have no incoming edges => Authority = 0, and stay so

Find !

Stepping

8 Link Analysis

Pagerank—link analysis

Taxation

Taxation on Spider Trap

Taxation on Spider Trap

PageRank in Spark

评论

发表评论

此博客中的热门博文

1 Map reduce problems

How to do addition for sparse vectors