10 Adwords model and Graph mining

背景

搜索应用往往从web advertising受益颇多

搜索广告的有效性源于:将搜索查询广告进行匹配matching的adwords模型
关注广告匹配的优化算法,贪心算法,online algorithms

另一个有趣问题是,在线商店里选择广告商品,collaboratve filtering技术,发现行为相似的用户向用户推荐商品,推荐系统

简化模型

1. 匹配:简化模型,搜索查询和投标词语完全一致,广告显示
google yahoo microsoft 为广告商提供broad matching功能

2.基于点击的付费:简化模型,当用户点击广告,广告商必须按广告的出价付费,first-price auction机制,
实际模型,search engine采用second-price auction机制,每个广告上大雨按照紧随其后的出价付费

一次广告的revenue = 每个选出广告的价值之和,
一条广告的value = 对应查询的出价*广告点击率

一个online alg的绩效为一个月(time unit)的总收益

广告类型

  1. Direct placement 直接投放:ebaycaiglist,广告商免费付费或委托方式投放
  2. Display ads展示广告:广告商按每展示一次的固定费率付费,即使同意用户对网页的二次下载,也会导致一次不同的广告展示,视为二次展示
  3. Search ads搜索广告:包含在搜索结果中,广告商为某些查询进行投标bid获得在搜索结果中展示广告的权利,但他们只在广告点击的情况下付$ (impression of search ads)
  4. Online storeamazon在线商店在上下文中显示广告,在线商店选出,最大化顾客对商品感兴趣概率(recommend system

off-line algorithm

通常算法:首先准备所需的所有数据,算法以任意次序访问,最后输出结果,称为off line
有时不能获取全部数据,只保存有限stream data,有要求时必须应答整个流的查询
greedy (may only assign 1 specific query) 
optimal(non-greedy,all queries are assigned

online algorithm

Stream processing的极端形式:对每个流元素到达之后就以output形式对查询应答,对未来一无所知时对当前元素决策。
为广告分配查询,搜索到达,选择广告显示。
click-throughrate考察。当点击率低的时候,广告越less competitive

实际上,无法保证online algoffline alg的效果总是一样好
e.g. A Chesterfieldbid price 10 cents BChesterfield”,”sofa bid 20 cents separatelybudget 100 $ 
Assume case like: many sofa query, less chesterfield
Search engine want to maximize the benefits 

Greedy algorithm

最大化当前输入元素和历史信息的某个函数,对每个input元素作出决策

competitive rate

最高期望时,存在某个小于1的数c,对于任意输入,一个具体的在线算法结果至少时最优offlinec倍,c越小,在线算法的下界越小。
profit of alg / profit of optimal alg
profit = bid price * assigned number of queries <= budget

广告匹配问题

maximal matching bipartite graph
simple 1-1 matching of queries and advertisers:
  1. assume each query appear only once!
  2. profits = # of matches of bids & queries
  3. assume all bids have same value, 1$
  4. budget = # of bids placed, a: 2, b: 2, c: 1
e.g. matching = {(1,a), (2,b), (3,d)} 
profit = size of matching = # matches = 3

maximal matching: if adding more edges, it cannot be a matching
optimal matching: matching with largest size, maximize the optimal
perfect matching: every node engage in matching---must be maximal 


最大匹配的offline alg
对于n node graph,已经有接近On^2)的alg
流程:按照任意次序考虑边,考虑边(xy)若xy都不是已有匹配中边的端点,则加入,否则跳过

贪心匹配算法

--for each edge, add to matching if it can be 
consider edges arriving in some order:
which queries arrive first & which bidders are considered first for the queries
e.g. left-->right 
(1, a) (1, c)
(2, b)
(3, b) (3, d) 
(4, a) 

matching: (1, a)(2, b) (3, d) or (1, a), (3, b)
all matchings found by greedy are maximal, it cannot add any edge to be matching
every maximal matching can be found by greedy, when choose right order

Adwords问题的贪心算法

给出一个比greedy更好的算法
简化:
  1. 每条查询-一个广告
  2. 所有广告预算相等
  3. 所有广告点击率相等
  4. 出价只有0/1(每个广告价值,出价*点击率相等)
不管查询顺序如何,greedoptimum的收益比最少1/2

Minimum Competitive Ratio CR

CR of greedy 给出1/2,竞争率不大于12
|Mg|/|Mo| 1/2     CR minimum= 最坏情况
prove

L:left nodes in Mo but not in Mg
L’: left nodes in Mo and also in Mg
step 1:
|Mo| =|L| + |L’| |L| +|Mg| (since |L’| |Mg|)
step 2:
R: right nodes with edges to nodes in L  

|L| |R| (each query much have at least 1 adiverser bid on)

Nodes in R must be in Mg|R| |Mg| (note it is possible |R| < |Mg|
step 3:

|Mo| |L| +|Mg|, |L| |R|, |R| |Mg|
=>|Mo| 2|Mg|, i.e., |Mg|/|Mo| 1/2 


Many to 1 matching 

bidders can bid on multiple queries, multiple same queries 
 m - 1 right-left 

optimal offline: 3*q1-->A1.3*q2-->A2. profit = 6

greedy: each query, pick any bidder that bids on it
Worst case: q ’s assigned to A profit = 3

Balance: Assign query to bidder with largest remaining budget

  1. Each advertiser has a budget: B = N!
  2. N advertisers/bidders: A1, A2, ..., A
  3. Optimal Offline Solution: Profit = NB
  4. Advertiser Ai bids on queries: q1, q2, ..., qi, Each bid has the same amount 1$
  5. Queries arrive in N boundsi-th round has B occurrences of q
  6. Every advertiser got B/N q1’s on average  then B/N-1. B/N-2...
  7. Algorithm stops at some round j where Aj (and all Aj+1 ..., AN) run out of budget

Adwords balance贪心算法:

提供较高的竞争率
问题背景:
  1. 众多广告商为搜索查询设定的bid price集合
  2. 每个广告商查询对所对应的点击率
  3. 每个广告商的预算
  4. 每个搜索查询显广告数上限


对每个搜索查询,算法给出一系列广告上的应答结果集合,满足条件:
  1. 集合大小不超过 每个搜索查询显示广告数上限
  2. 该集合每个广告商都对本条搜索查询出价
  3. 每个广告上必须生于足够的预算为广告点击付费

Balance 3/4 competitive rate

1-1/e)随着广告商数不断增长
将查询分配给出价最高且剩余预算最多的广告上
若广告商的剩余预算相等,随意选择
下界:>= 3/4
assume每个查询都会按照最优算法结果分配相应广告商
否则,可以删除这些查询,此时最优算法收益不受影响
balance会降低
查询序列仅由最优算法分配的广告组成,获得最低竞争率

When query q arrives, competitiveness of Ai:
𝜓𝑖=𝑥𝑖 (1−𝑒−𝑓𝑖)
𝑥𝑖: bid of Ai on q
fi: fraction of Ai’s budget left 
Algorithm Balance*:Pick advertiser with largest 𝜓= largest competitiveness
Property: Balance* has competitive ratio 1-1/e = .63 
If click-through rate of an ad is low, then advertiser is less competitive for the query q
adjust:𝜓𝑖 =𝑐𝑖𝑥𝑖 (1−𝑒−𝑓𝑖)–---𝑐𝑖: click-through rate of Ai’s ad for query q 

Simrank Introduction 

Random walk on a social graph :users, tags, web pages

--Measure similarity of nodes using random walk with restart
chance that two random surfers will land on the same page 

e.g. sim(T1, T2) > sim(T1, T3)– Starting from Tmore likely to visit T

Restart:
Going back to the stating node
Limiting distribution does not depend on starting node

Limitation : computational expensive

using idea of finding similar users

Tags = sets of web pages, users--– Tags are similar if their sets are
E.g. T1 = {W1U1}; T2 = {W1, W2, W3, U1, U2} ; T3 = {W2, U1, U2} 
Jaccard(T1, T2) = 2/5 = .4
Jaccard(T1, T3) = 1⁄4 = .25

Teleporting to initial node(assume the i-th node):
from ith node
Damping factor 𝛽 
𝒗= 𝛽𝑀𝒗 + 1 − 𝛽 𝒆𝑖
ei -- vector of all 0 s except ith row element = 1

Laplacian Matrix L = D-A

D degree: how many edges on this node

A Ajacency: 


L symmetric matrix

Eigenvalue Eigenvector of L:

Lx = 𝜆x

Smallest 𝜆 = 0, every eigenvalue of 𝐿 is non-negative
 Since𝐿𝟏=0𝟏,𝟏= [11...1𝑇]

Spectral theorem 

M is a symmetric matrix, then
M has exactly n (not necessarily distinct) eigenvalues (real numbers)
their corresponding eigenvectors can be chosen to form an orthonormal basis 















评论

此博客中的热门博文

8 Link Analysis

1 Map reduce problems

NoSql and AWS DynamoDB practices