10 Adwords model and Graph mining
背景
搜索应用往往从web advertising受益颇多
搜索广告的有效性源于:将搜索查询和广告进行匹配matching的adwords模型
关注广告匹配的优化算法,贪心算法,online algorithms
另一个有趣问题是,在线商店里选择广告商品,collaboratve filtering技术,发现行为相似的用户向用户推荐商品,推荐系统
简化模型
1. 匹配:简化模型,搜索查询和投标词语完全一致,广告显示
google yahoo microsoft 为广告商提供broad matching功能
2.基于点击的付费:简化模型,当用户点击广告,广告商必须按广告的出价付费,first-price auction机制,
实际模型,search engine采用second-price auction机制,每个广告上大雨按照紧随其后的出价付费
一次广告的revenue = 每个选出广告的价值之和,
一条广告的value = 对应查询的出价*广告点击率
一个online alg的绩效为一个月(time unit)的总收益
广告类型
- Direct placement 直接投放:ebay,caiglist,广告商免费付费或委托方式投放
- Display ads展示广告:广告商按每展示一次的固定费率付费,即使同意用户对网页的二次下载,也会导致一次不同的广告展示,视为二次展示
- Search ads搜索广告:包含在搜索结果中,广告商为某些查询进行投标bid获得在搜索结果中展示广告的权利,但他们只在广告点击的情况下付$ (impression of search ads)
- Online store:amazon在线商店在上下文中显示广告,在线商店选出,最大化顾客对商品感兴趣概率(recommend system)
off-line algorithm
通常算法:首先准备所需的所有数据,算法以任意次序访问,最后输出结果,称为off line
有时不能获取全部数据,只保存有限stream data,有要求时必须应答整个流的查询
greedy (may only assign 1 specific query)
optimal(non-greedy,all queries are assigned)
online algorithm
Stream processing的极端形式:对每个流元素到达之后就以output形式对查询应答,对未来一无所知时对当前元素决策。
为广告分配查询,搜索到达,选择广告显示。
click-throughrate考察。当点击率低的时候,广告越less competitive
实际上,无法保证online alg和offline alg的效果总是一样好
e.g. A “Chesterfield”bid price 10 cents, B”Chesterfield”,”sofa” bid 20 cents separately,budget 100 $
Assume case like: many sofa query, less chesterfield
Search engine want to maximize the benefits
Greedy algorithm
最大化当前输入元素和历史信息的某个函数,对每个input元素作出决策competitive rate
最高期望时,存在某个小于1的数c,对于任意输入,一个具体的在线算法结果至少时最优offline的c倍,c越小,在线算法的下界越小。
profit of alg / profit of optimal alg
profit = bid price * assigned number of queries <= budget
广告匹配问题
maximal matching bipartite graph
simple 1-1 matching of queries and advertisers:
- assume each query appear only once!
- profits = # of matches of bids & queries
- assume all bids have same value, 1$
- budget = # of bids placed, a: 2, b: 2, c: 1
e.g. matching = {(1,a), (2,b), (3,d)}
profit = size of matching = # matches = 3
maximal matching: if adding more edges, it cannot be a matching
optimal matching: matching with largest size, maximize the optimal
perfect matching: every node engage in matching---must be maximal
最大匹配的offline alg
对于n node graph,已经有接近O(n^2)的alg
流程:按照任意次序考虑边,考虑边(x,y)若x,y都不是已有匹配中边的端点,则加入,否则跳过
贪心匹配算法
--for each edge, add to matching if it can be
consider edges arriving in some order:
which queries arrive first & which bidders are considered first for the queries
e.g. left-->right
(1, a) (1, c)
(2, b)
(3, b) (3, d)
(4, a)
(2, b)
(3, b) (3, d)
(4, a)
matching:
(1, a)(2, b)
(3, d) or (1, a), (3, b)
all matchings found by greedy are maximal, it cannot add any edge to be matching
every maximal matching can be found by greedy, when choose right order
every maximal matching can be found by greedy, when choose right order
Adwords问题的贪心算法
给出一个比greedy更好的算法
简化:
- 每条查询-一个广告
- 所有广告预算相等
- 所有广告点击率相等
- 出价只有0/1(每个广告价值,出价*点击率相等)
不管查询顺序如何,greed/optimum的收益比最少1/2
Minimum Competitive Ratio CR
CR of greedy 给出1/2,竞争率不大于1/2
|Mg|/|Mo| ≥ 1/2 CR minimum= 最坏情况
prove
prove
L:left nodes in Mo but not in Mg
L’: left nodes in Mo and also in Mg
step 1:
|Mo| =|L| + |L’| ≤ |L| +|Mg| (since |L’| ≤|Mg|)
step 2:
R: right nodes with edges to nodes in L
L’: left nodes in Mo and also in Mg
step 1:
|Mo| =|L| + |L’| ≤ |L| +|Mg| (since |L’| ≤|Mg|)
step 2:
R: right nodes with edges to nodes in L
|L| ≤ |R| (each query much have at least 1 adiverser bid on)
Nodes in R must be in Mg. ⇒|R| ≤ |Mg| (note it is possible |R| < |Mg|)
step 3:
step 3:
|Mo| ≤ |L| +|Mg|, |L| ≤ |R|, |R| ≤ |Mg|
=>|Mo| ≤ 2|Mg|, i.e., |Mg|/|Mo| ≥ 1/2
Many to 1 matching
bidders can bid on multiple queries, multiple same queries
m - 1 right-left
optimal offline: 3*q1-->A1.3*q2-->A2. profit = 6
greedy: each query, pick any bidder that bids on it
Worst case: q ’s assigned to A
1 profit = 3
Balance: Assign query to bidder with largest remaining budget
Balance: Assign query to bidder with largest remaining budget
- Each advertiser has a budget: B = N!
- N advertisers/bidders: A1, A2, ..., AN
- Optimal Offline Solution: Profit = NB
- Advertiser Ai bids on queries: q1, q2, ..., qi, Each bid has the same amount 1$
- Queries arrive in N bounds– i-th round has B occurrences of qi
- Every advertiser got B/N q1’s on average then B/N-1. B/N-2...
- Algorithm stops at some round j where Aj (and all Aj+1 ..., AN) run out of budget
Adwords balance贪心算法:
提供较高的竞争率
问题背景:
- 众多广告商为搜索查询设定的bid price集合
- 每个广告商查询对所对应的点击率
- 每个广告商的预算
- 每个搜索查询显示广告数上限
对每个搜索查询,算法给出一系列广告上的应答结果集合,满足条件:
- 集合大小不超过 每个搜索查询显示广告数上限
- 该集合每个广告商都对本条搜索查询出价
- 每个广告上必须生于足够的预算为广告点击付费
Balance 3/4 competitive rate
(1-1/e)随着广告商数不断增长
将查询分配给出价最高且剩余预算最多的广告上
若广告商的剩余预算相等,随意选择
下界:>= 3/4
assume每个查询都会按照最优算法结果分配相应广告商
否则,可以删除这些查询,此时最优算法收益不受影响
但balance会降低
当查询序列仅由最优算法分配的广告组成,获得最低竞争率
When query q arrives, competitiveness of Ai:
–𝜓𝑖=𝑥𝑖 (1−𝑒−𝑓𝑖)
– 𝑥𝑖: bid of Ai on q
– fi: fraction of Ai’s budget left
–𝜓𝑖=𝑥𝑖 (1−𝑒−𝑓𝑖)
– 𝑥𝑖: bid of Ai on q
– fi: fraction of Ai’s budget left
• Algorithm Balance*:Pick advertiser with largest 𝜓= largest competitiveness
• Property: Balance* has competitive ratio 1-1/e = .63
• Property: Balance* has competitive ratio 1-1/e = .63
If click-through rate of an ad is low, then advertiser is less competitive for the query q
• adjust:𝜓𝑖 =𝑐𝑖𝑥𝑖 (1−𝑒−𝑓𝑖)–---𝑐𝑖: click-through rate of Ai’s ad for query q
• adjust:𝜓𝑖 =𝑐𝑖𝑥𝑖 (1−𝑒−𝑓𝑖)–---𝑐𝑖: click-through rate of Ai’s ad for query q
Simrank Introduction
Random walk on a social graph :users, tags, web pages
--Measure similarity of nodes using random walk with restart
chance that two random surfers will land on the same page
e.g. sim(T1, T2) > sim(T1, T3)– Starting from T1 more likely to visit T2
Restart:
Going back to the stating node
Limiting distribution does not depend on starting node
Limitation : computational expensive
using idea of finding similar users
Tags = sets of web pages, users--– Tags are similar if their sets are
E.g. T1 = {W1, U1}; T2 = {W1, W2, W3, U1, U2} ; T3 = {W2, U1, U2}
L symmetric matrix
Eigenvalue Eigenvector of L:
Lx = 𝜆x
Smallest 𝜆 = 0, every eigenvalue of 𝐿 is non-negative
Since𝐿𝟏=0𝟏,𝟏= [11...1𝑇]
Since𝐿𝟏=0𝟏,𝟏= [11...1𝑇]
Spectral theorem
M is a symmetric matrix, then
– M has exactly n (not necessarily distinct) eigenvalues (real numbers)
– their corresponding eigenvectors can be chosen to form an orthonormal basis
– M has exactly n (not necessarily distinct) eigenvalues (real numbers)
– their corresponding eigenvectors can be chosen to form an orthonormal basis
评论
发表评论