(Incomplete) Data Mining--Clustering Basics

四月 16, 2017

Hierarchical Clustering

Cluster radius: max distance from point to the centroid

Cluster diameter: max distance between any 2 points

Stop rules:

If diameter/radius of next cluster > threshold
If density of next cluster < threshold
Stop at a jump in avg. /diameter of clusters

Density: # points 1 unit volume

Volume: certain power of diameter/ radius

Braches: HAC (Hierarchical Agglomerative Clustering) HDC (Hierarchical Divisive Clustering), single-node

Non-Euclidean space-

different rules on distance computation

Jacquard for set

Cosine for vector

Edit for string

Hamming for binary

Centroid lost its meaning, use clusteroid instead.

Clusteroid = a point in cluster that

Minimize:

Sum of (squared) distances to other points
Maximum distance to another points

—not too far from all other points

Indicators: sum, sum-sq, max—found the minimum one as clusteroid

K-means

cast k-means as a core net descent algorithm, then we know that we are indeed converging to a local optimum.

Pick k points as initial centroids of k clusters

Repeat until centroid stabilize

A) for each point find the closest centroid, add p to the cluster of that centroid

B) re-compute centroid.

Complexity ~ linear, constant * k

N data points

Pick k points as centroids of k clusters O(k)
Repeat until centroid stabilize O(I*n*k)

A) for each point find the closest centroid, add p to the cluster of that centroid. O(n*k) - n points, k clusters

B) re-compute centroid. O(k*n/k)=O(n)

Adjust centroid after all point assignment!

Initial centroid + k # clusters

Initial centroid:

Pick points as far away as possible

-pick first randomly

-repeatedly pick a point far away from the chosen ones

Produce k clusters by hierarchical methods, cut off in the dengerod, and select 1 point from each cluster

Elbow curve:

Run k-means for k = 1,2,4,,8…2^(m-1)-double # clusters at each clustering

Stop at k = 2v where cohesion changes little from k= v—>2v

Elbow -> [v/2,v]

K = 2^(m-1) = 2v, m = 2+log2v

Little change:

V—>2v, the rate of decrease: r = |c(2v)- c(v)|/v*c(v)<threshold

Binary search on [v/2,v]

[x,y] mid point floor((x+y)/2)=z bisecting—[x,z],[z,y]

Little change in [z,y], search in [x,z]

Vice versa

# clustering = log2(v/2) = log2v -1

kMeans-data.txt

Parse input and find closest center:

-ParseVector line—>np.array

Generate initial centers:

-takeSample(False, k, 1)

Parse and cache the input dataset:

-cache

Assign point to its closest center:

-center0: point(0,0,0) and point (1,1,1)

Getting statistics for each center pointStats:

-key-value pair for each center , center index, (sum, count)

Computing coordinates of new centers

-sum of coordinates/count-mapValues

BFR algorithm ._.

Pick k as initial in 1st chunk

Load 1 chunk of data in memory at a time

For each chunk, its points are either:

a. Assign to existing clusters

b. Form a new mini-clusters

c. Retain

a. Assign to existing clusters-when

The point is sufficiently close to the centroid of the cluster

Update the summary of cluster

-Cluster summary: N, SUM, SUMQ (in each dimension)

Centroid: SUM/N

Variance: SUMQ/N- (SUM/N)^2 —> Var(x) = E(x^2)-E(x)^2

Standard deviation: sqrt

Mahalanobis distance

Normalize distance for multi-dimensional data

How many sigma away from centroid

Assume no correlation among Diff dimensions

Normally distributed

sqrt (sum [(Point i - centre i )/sigma i]^2 )

68% of points 1 sigma from mean

95% of points 2 sigmas from mean

99% of points 3 sigmas from mean

Multi-variate normal distribution

Non-correlating dimensions—diagonal covariance matrix

Squared mahanoblis distance—value

Sufficiently close—mahanoblis distance<threshold.—pick centroid with smallest distance

b. Form a new mini-clusters

after a, there are points not assigned to any clusters, sigma of 1st chunk based on sample. Use a hierarchical clustering with proper stopping condition

Assume the axes of clusters align with axes of space

Merge mini-cluster if variance of merged cluster is small enough

No mini-cluster can be merged into existing clusters

Discard set: points close enough to an existing cluster

Compression set: points close to each other to form mini-clusters, not close enough to ant existing cluster

Retained set: isolated points retained for new rounds

Clustering using representative

Large-scale data

Pick a sample, hierarchical clustered and determine k clusters from dendrogram

Scan data and assign points to closest cluster

Distance between point p and cluster C:

From the closest representative in C

Representative—a small set of points far away from each other

Moving representatives—a fixed fraction of distance toward centroid

Higher dimension

Cosine similarity, cosine—>0

Euclidean: points have similar distance btw each other

Subspace clustering

Find !

Stepping

(Incomplete) Data Mining--Clustering Basics

评论

发表评论

此博客中的热门博文

8 Link Analysis

How to do addition for sparse vectors

Sorting Algorithms Summary