(Incomplete) Data Mining--Clustering Basics

Hierarchical Clustering
Cluster radius: max distance from point to the centroid 
Cluster diameter: max distance between any 2 points
Stop rules:
  1. If diameter/radius of next cluster > threshold
  2. If density of next cluster < threshold 
  3. Stop at a jump in avg. /diameter of clusters 

Density: # points 1 unit volume
Volume: certain power of diameter/ radius 
Braches: HAC (Hierarchical Agglomerative Clustering)  HDC (Hierarchical Divisive Clustering), single-node

Non-Euclidean space-
different rules on distance computation 
Jacquard for set
Cosine for vector
Edit for string 
Hamming for binary 

Centroid lost its meaning, use clusteroid instead.
Clusteroid = a point in cluster that 
Minimize:
  1. Sum of (squared) distances to other points 
  2. Maximum distance to another points 
—not too far from all other points 

Indicators: sum, sum-sq, max—found the minimum one as clusteroid

K-means

cast k-means as a core net descent algorithm, then we know that we are indeed converging to a local optimum. 

Pick k points as initial centroids of k clusters
Repeat until centroid stabilize 
A) for each point find the closest centroid, add p to the cluster of that centroid
B) re-compute centroid. 

Complexity ~ linear, constant * k 
N data points
  1. Pick k points as centroids of k clusters O(k)
  2. Repeat until centroid stabilize O(I*n*k)
A) for each point find the closest centroid, add p to the cluster of that centroid. O(n*k) - n points, k clusters 
B) re-compute centroid. O(k*n/k)=O(n)

Adjust centroid after all point assignment!

Initial centroid + k # clusters 
Initial centroid: 
  1. Pick points as far away as possible 
-pick first randomly 
-repeatedly pick a point far away from the chosen ones
  1. Produce k clusters by hierarchical methods, cut off in the dengerod, and select 1 point from each cluster 

Elbow curve: 
Run k-means for k = 1,2,4,,8…2^(m-1)-double # clusters at each clustering 
Stop at k = 2v where cohesion changes little from k= v—>2v
Elbow -> [v/2,v]
K = 2^(m-1) = 2v, m = 2+log2v

Little change: 
V—>2v, the rate of decrease: r = |c(2v)- c(v)|/v*c(v)<threshold 

Binary search on [v/2,v]
[x,y] mid point floor((x+y)/2)=z bisecting—[x,z],[z,y]
Little change in [z,y], search in [x,z]
Vice versa

# clustering = log2(v/2) = log2v -1

kMeans-data.txt
  1. Parse input and find closest center: 
-ParseVector line—>np.array 
  1. Generate initial centers: 
-takeSample(False, k, 1)
  1. Parse and cache the input dataset: 
-cache 
  1. Assign point to its closest center:
-center0: point(0,0,0) and point (1,1,1)
  1. Getting statistics for each center pointStats:
-key-value pair for each center , center index, (sum, count)
  1. Computing coordinates of new centers
-sum of coordinates/count-mapValues

BFR algorithm ._.
Pick k as initial in 1st chunk 
Load 1 chunk of data in memory at a time
For each chunk, its points are either:
a. Assign to existing clusters
b. Form a new mini-clusters
c. Retain

a. Assign to existing clusters-when 
The point is sufficiently close to the centroid of the cluster 
Update the summary of cluster 
-Cluster summary: N, SUM, SUMQ (in each dimension)
Centroid: SUM/N
Variance: SUMQ/N- (SUM/N)^2 —> Var(x) = E(x^2)-E(x)^2
Standard deviation: sqrt 

Mahalanobis distance
Normalize distance for multi-dimensional data
How many sigma away from centroid 
Assume no correlation among Diff dimensions 
Normally distributed 
sqrt (sum [(Point i - centre i )/sigma i]^2 )

68% of points 1 sigma from mean 
95% of points 2 sigmas from mean
99% of points 3 sigmas from mean 

Multi-variate normal distribution 
Non-correlating dimensions—diagonal covariance matrix
Squared mahanoblis distance—value 
Sufficiently close—mahanoblis distance<threshold.—pick centroid with smallest distance 

b. Form  a new mini-clusters
after a, there are points not assigned to any clusters, sigma of 1st chunk based on sample. Use a hierarchical clustering with proper stopping condition 
Assume the axes of clusters align with axes of space 
Merge mini-cluster if variance of merged cluster is small enough 
No mini-cluster can be merged into existing clusters 

Discard set: points close enough to an existing cluster
Compression set: points close to each other to form mini-clusters, not close enough to ant existing cluster
Retained set: isolated points retained for new rounds 

Clustering using representative
Large-scale data 
Pick a sample, hierarchical clustered and determine k clusters from dendrogram 
Scan data and assign points to closest cluster 

Distance between point p and cluster C: 
From the closest representative in C 
Representative—a small set of points far away from each other
Moving representatives—a fixed fraction of distance toward centroid 

Higher dimension 
Cosine similarity, cosine—>0
Euclidean: points have similar distance btw each other 

Subspace clustering 

评论

此博客中的热门博文

8 Link Analysis

1 Map reduce problems

NoSql and AWS DynamoDB practices