博文

目前显示的是 五月 7, 2017的博文

8 Link Analysis

图片
Pagerank—link analysis  Early web search: extract keywords-->queries match -->page ranked by  occurrences of keywords Term Spam: (TS) Disguise a page as sth it is not about  Pagerank: vs TS Rank pages by linkage too How many pages point to a page= # incoming links => importance Link spam (fade incoming pointers) Manually identify some important pages and assign them page ranks , then let ranks propagate to other nodes. Random surfer model  Random surfer start from any page follow its outgoing links randomly   来,去,都是随机的 Pagerank = Prob ( a random surfer lands on the page ) More robust than manual approach , a collective voting scheme Limiting distribution  Graph is strongly connected , a node can reach any other node in the graph But cannot have dead end & spider traps A--->B, C, D;    B--->A, D,(A,D->C);     C--->A(->B,C,D);   ...

1 Map reduce problems

图片
PROBLEMS Join   Matrix multiplication 2 / 1 phase  Matrix vector multiplication  POINTS Map reduce input output Memory requirement  Reducer size  Communication cost P1 Join:  R ( A , B )  S ( A , C )   1. Map r ( a , b ) => ( a , ( 'R' , b ));  s ( a , c ) => ( a , ( 'S' , c ))   2. Reduce ( a , [( 'R' , b ), ( 'S' , c1 ), ( 'S' , c2 )]) => ( a , ( b , c1 )), ( a , ( b , c2 ))   Dangling tuples: Key with values from only 1 relation  ( a , [( 'R' , b )]) => left dangling ; ( a , [( 'S' , c )]) => right dangling  P2 Matrix-multiplication  C mXn = A mXp B pXn  2 phase: a. phase1  product  Map : Extract key from A ’ s c , B ’ s r A [ i , k ] => ( k , ( 'A' , i , A [ i , k ])) ; B [ k , j ] => ( k , ( 'B' , j , B [ k , j ]))   Communication cost : mp + np  Reduce...

3 Mining transactional data

图片
Mining transactional data —frequent item-sets chapter 6  Frequent itemsets: sup ( I ) >= s Support of I: sup ( I ) = # baskets that contains all items in I Support ratio = sup ( I ) /n n = # baskets Maximally frequent:   I is frequent , none of I ’ s supersets is frequent Applications:  1. Baskets—transactions// documents 2. Items—products// words 3. Frequent item sets: products frequently bought together//  Words that occur together frequently in many documents Association rules: I —>j ( I: a set of items , J: an item ) I: a set of items , j: an item If all of items in I appear in some baskets , then j likely appears in that baskets too Confidence ( I->j ) = sup ( I U{j} ) /sup ( I ) Finding rules with high confidence: I —>j: both of sup ( I U{j} ) and sup ( I ) should be high  For frequent item sets For each of them j , determine if J-{j}—>j has high confidence  for j belongs t...