8 Link Analysis
Pagerank—link analysis
Early web search:
extract keywords-->queries match -->page ranked by occurrences of keywords
Term Spam: (TS)
Disguise a page as sth it is not about
Pagerank: vs TS
Rank pages by linkage too
How many pages point to a page= # incoming links => importance
Link spam (fade incoming pointers)
Manually identify some important pages and assign them page ranks, then let ranks propagate to other nodes.
Random surfer model
Random surfer start from any page follow its outgoing links randomly
来,去,都是随机的
Pagerank = Prob(a random surfer lands on the page)
More robust than manual approach, a collective voting scheme
Limiting distribution
Graph is strongly connected, a node can reach any other node in the graph
But cannot have dead end & spider traps
A--->B, C, D; B--->A, D,(A,D->C); C--->A(->B,C,D); D--->B,C (B->A)
nodes = pages, edges = hyperlinks btw pages
Dead end:
A page with no edges out --》一个node
Absorb page ranks and sink it
Pagerank -> 0 for any page reaching the dead end (including itself)
A-> B, C, D; B-> A, D(-->C); C; D-->B, C 任意点都可以到C,但是C没有outgoing pointer
Spider trap
Group of pages with no edges going out of group
Absorb all page ranks and circulate within the group
Rank (of spider trap node) ->1, others ->0
Surfer cannot leave once trapped
Can have 1+ nodes
图里无论是哪个node最终还是回到了C
A--> B-->D--> C-->C
Transition Matrix
M
M[i j] = prob. (j->i) column index-->row index
Prob(Location of surfer)
column vector v
Next distrubution v1 = Mv0, v2 = Mv1, vi = Mv(i-1)
Stabilize, v = Mv from step k on, v doesn't change any more
Mv = v
values in each column of M or v both sum to 1, M is a stochastic matrix
Largest eigenvalue of a stochastic matrix is 1,
and corresponding eigenvector is principal eigenvector
Methods
1. Power iteration: start with initial v, repeatedly * M, stop when v'~v
2. Gaussian elimination (M-I)v = 0 ===>M -->Upper Triangular Matrix ==> Linear system
对角线上的元素-->1,对角线以下-->0 PRACTICE
3. Eigenvector Mv =入 v ---> (M-入I)v = 0 ----> det(M-入I) = 0 PRACTICE
拉普拉斯展开
行列式展开,按某一行或某一列展开,n*n的矩阵有2n种展开方式
B是一个n*n矩阵,Mij是B关于第i行第j列的n-1阶鱼子式
代数余子式: Cij是B的(I,J)余子式Mij 与(-1)i+J 的 乘积
bi1Ci1 +。。。+ binCin
b1jC1j +。。。+ bnjCnj
某行元素,或某列元素,与相应的代数余子式乘积的和,需要乘以-1的i+j次方,
Taxation
Spider trap vs Dead end
Spider Trap
C's ranks is absorbing, no lost M is still stochastic
Dead end--sub-stochastic values in column add up at most 1 or 0
C's rank never contribute to new ranks, just lost
Taxation
Deal with spider traps & dead endsallow surfer to jump to a random page – “teleporting”
Deal with spider traps & dead endsallow surfer to jump to a random page – “teleporting”
- 𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛
– 𝛽, damping factor, e.g., set to .8
– 𝒆, an n-dimensional vector with all 1’s
– n, # of nodes in the graph
– 𝒆, an n-dimensional vector with all 1’s
– n, # of nodes in the graph
Tax rate = 1-𝛽=20%
Do a random jump 20% of time
Do a random jump 20% of time
• This in effect makes graph“strongly-connected”
Taxation on Spider Trap
- 𝑣′ = 𝛽𝑀𝑣+( 1−𝛽 )𝒆/𝑛
dead end从整体net出发考虑得失平衡
0.8 * C's rank is lost = sink ⤵️
0.2*(A+ B + C + D) = tax ⬇️
0.2* 1 teleporting --fixed
When stabilized...
• Rank for node C remains to be .128:
– C = .8*(A/3 + D/2) + .2*(1./4) = .128
• Overall rank does not change either:
– Loss: .8*C due to dead end+ .2*(A+B+C+D) due to taxation = C + .2*(A+B+D) = .2
– Gain: .2 from teleporting
• Rank for node C remains to be .128:
– C = .8*(A/3 + D/2) + .2*(1./4) = .128
• Overall rank does not change either:
– Loss: .8*C due to dead end+ .2*(A+B+C+D) due to taxation = C + .2*(A+B+D) = .2
– Gain: .2 from teleporting
Net loss will be 0 when loss & gain balance out
output
• 1 has rank: 1.73800730412.
• 4 has rank: 0.753997565294.
• 3 has rank: 0.753997565294.
• 2 has rank: 0.753997565294.
• 4 has rank: 0.753997565294.
• 3 has rank: 0.753997565294.
• 2 has rank: 0.753997565294.
code:
from __future__ import print_function
import re
import sys
from operator import add
from pyspark.sql import SparkSession
def computeContribs(urls, rank):
"""Calculates URL contributions to the rank of other URLs."""
num_urls = len(urls)
for url in urls:
yield (url, rank / num_urls)
def parseNeighbors(urls):
"""Parses a urls pair string into urls pair."""
parts = re.split(r'\s+', urls)
return parts[0], parts[1]
if __name__ == "__main__":
if len(sys.argv) != 3:
print("Usage: pagerank <file> <iterations>", file=sys.stderr)
exit(-1)
print("WARN: This is a naive implementation of PageRank and is given as an example!\n" +
"Please refer to PageRank implementation provided by graphx",
file=sys.stderr)
# Initialize the spark context.
spark = SparkSession\
.builder\
.appName("PythonPageRank")\
.getOrCreate()
# Loads in input file. It should be in format of:
# URL neighbor URL
# URL neighbor URL
# URL neighbor URL
# ...
lines = spark.read.text(sys.argv[1]).rdd.map(lambda r: r[0])
# Loads all URLs from input file and initialize their neighbors.
links = lines.map(lambda urls: parseNeighbors(urls)).distinct().groupByKey().cache()
# Loads all URLs with other URL(s) link to from input file and initialize ranks of them to one.
ranks = links.map(lambda url_neighbors: (url_neighbors[0], 1.0))
# Calculates and updates URL ranks continuously using PageRank algorithm.
for iteration in range(int(sys.argv[2])):
# Calculates URL contributions to the rank of other URLs.
contribs = links.join(ranks).flatMap(
lambda url_urls_rank: computeContribs(url_urls_rank[1][0], url_urls_rank[1][1]))
# Re-calculates URL ranks based on neighbor contributions.
ranks = contribs.reduceByKey(add).mapValues(lambda rank: rank * 0.85 + 0.15)
# Collects all URL ranks and dump them to console.
for (link, rank) in ranks.collect():
print("%s has rank: %s." % (link, rank))
spark.stop()
HITS Algorithm
• Hyperlink-inducedtopicsearch
• Each page has 2 scores
– Authority: <---pointed by many good hub pages
– Hub: pointing to many good authority pages--->
• E.g
– Hub: department pages listing all courses
– Authority: individual course pages
• Hyperlink-inducedtopicsearch
• Each page has 2 scores
– Authority: <---pointed by many good hub pages
– Hub: pointing to many good authority pages--->
• E.g
– Hub: department pages listing all courses
– Authority: individual course pages
Link/Adjacency Matrix
LT is the un-normalized M (transition matrix)
𝒉=𝜆L𝒂, 𝒂=𝜇L𝑇𝒉
-
Start with 𝒉=a vector of all 1’s
-
Compute 𝒂=L𝑇𝒉,then scale it so that the
largest component is 1
-
Compute 𝒉=L𝒂 and similarly scaled
-
Repeat steps 2,3 until 𝒉 and 𝒂 do not
change much
Power iteration approach converges
– 𝒉 into principal eigenvector of 𝜆𝜆𝑇
– 𝒂 into principal eigenvector of 𝜆𝑇𝜆
– 𝒉 into principal eigenvector of 𝜆𝜆𝑇
– 𝒂 into principal eigenvector of 𝜆𝑇𝜆
U, V, W have no incoming edges
=> Authority = 0, and stay so
评论
发表评论