Stepping

博文

目前显示的是五月 7, 2017的博文

8 Link Analysis

五月 07, 2017

Pagerank—link analysis Early web search: extract keywords-->queries match -->page ranked by occurrences of keywords Term Spam: (TS) Disguise a page as sth it is not about Pagerank: vs TS Rank pages by linkage too How many pages point to a page= # incoming links => importance Link spam (fade incoming pointers) Manually identify some important pages and assign them page ranks , then let ranks propagate to other nodes. Random surfer model Random surfer start from any page follow its outgoing links randomly 来，去，都是随机的 Pagerank = Prob ( a random surfer lands on the page ) More robust than manual approach , a collective voting scheme Limiting distribution Graph is strongly connected , a node can reach any other node in the graph But cannot have dead end & spider traps A--->B, C, D; B--->A, D,(A,D->C); C--->A(->B,C,D); ...

继续阅读

1 Map reduce problems

五月 07, 2017

PROBLEMS Join Matrix multiplication 2 / 1 phase Matrix vector multiplication POINTS Map reduce input output Memory requirement Reducer size Communication cost P1 Join: R ( A , B ) S ( A , C ) 1. Map r ( a , b ) => ( a , ( 'R' , b )); s ( a , c ) => ( a , ( 'S' , c )) 2. Reduce ( a , [( 'R' , b ), ( 'S' , c1 ), ( 'S' , c2 )]) => ( a , ( b , c1 )), ( a , ( b , c2 )) Dangling tuples: Key with values from only 1 relation ( a , [( 'R' , b )]) => left dangling ; ( a , [( 'S' , c )]) => right dangling P2 Matrix-multiplication C mXn = A mXp B pXn 2 phase: a. phase1 product Map : Extract key from A ’ s c , B ’ s r A [ i , k ] => ( k , ( 'A' , i , A [ i , k ])) ; B [ k , j ] => ( k , ( 'B' , j , B [ k , j ])) Communication cost : mp + np Reduce...

继续阅读

3 Mining transactional data

五月 07, 2017

Mining transactional data —frequent item-sets chapter 6 Frequent itemsets: sup ( I ) >= s Support of I: sup ( I ) = # baskets that contains all items in I Support ratio = sup ( I ) /n n = # baskets Maximally frequent: I is frequent , none of I ’ s supersets is frequent Applications: 1. Baskets—transactions// documents 2. Items—products// words 3. Frequent item sets: products frequently bought together// Words that occur together frequently in many documents Association rules: I —>j ( I: a set of items , J: an item ) I: a set of items , j: an item If all of items in I appear in some baskets , then j likely appears in that baskets too Confidence ( I->j ) = sup ( I U{j} ) /sup ( I ) Finding rules with high confidence: I —>j: both of sup ( I U{j} ) and sup ( I ) should be high For frequent item sets For each of them j , determine if J-{j}—>j has high confidence for j belongs t...

继续阅读