(Incomplete) Data Mining--Clustering Basics
Hierarchical
Clustering
Cluster radius: max distance from point to the
centroid
Cluster diameter: max distance between any 2
points
Stop rules:
- If diameter/radius of next cluster > threshold
- If density of next cluster < threshold
- Stop at a jump in avg. /diameter of clusters
Density: # points 1 unit volume
Volume: certain power of diameter/ radius
Braches: HAC (Hierarchical Agglomerative Clustering) HDC (Hierarchical Divisive Clustering), single-node
Non-Euclidean
space-
different rules on distance computation
Jacquard for set
Cosine for vector
Edit for string
Hamming for binary
Centroid lost its meaning,
use clusteroid instead.
Clusteroid = a point in cluster that
Minimize:
- Sum of (squared) distances to other points
- Maximum distance to another points
—not too far from all other points
Indicators: sum,
sum-sq, max—found the minimum one as clusteroid
K-means
cast k-means as a core net descent
algorithm, then we know that we are indeed
converging to a local optimum.
Pick k points as initial centroids of k clusters
Repeat until centroid stabilize
A) for each point find the closest centroid,
add p to the cluster of that centroid
B) re-compute centroid.
Complexity ~ linear,
constant * k
N data points
- Pick k points as centroids of k clusters O(k)
- Repeat until centroid stabilize O(I*n*k)
A) for each point find the closest centroid,
add p to the cluster of that centroid. O(n*k) - n points, k
clusters
B) re-compute centroid. O(k*n/k)=O(n)
Adjust centroid after all point assignment!
Initial centroid + k # clusters
Initial centroid:
- Pick points as far away as possible
-pick first randomly
-repeatedly pick a point far away from the
chosen ones
- Produce k clusters by hierarchical methods, cut off in the dengerod, and select 1 point from each cluster
Elbow curve:
Run k-means for k = 1,2,4,,8…2^(m-1)-double
# clusters at each clustering
Stop at k = 2v where cohesion changes little
from k= v—>2v
Elbow -> [v/2,v]
K = 2^(m-1)
= 2v, m = 2+log2v
Little change:
V—>2v, the rate
of decrease: r = |c(2v)- c(v)|/v*c(v)<threshold
Binary search on [v/2,v]
[x,y]
mid point floor((x+y)/2)=z bisecting—[x,z],[z,y]
Little change in [z,y],
search in [x,z]
Vice versa
# clustering = log2(v/2)
= log2v -1
kMeans-data.txt
- Parse input and find closest center:
-ParseVector line—>np.array
- Generate initial centers:
-takeSample(False,
k, 1)
- Parse and cache the input dataset:
-cache
- Assign point to its closest center:
-center0: point(0,0,0)
and point (1,1,1)
- Getting statistics for each center pointStats:
-key-value pair for each center ,
center index, (sum, count)
- Computing coordinates of new centers
-sum of coordinates/count-mapValues
BFR algorithm
._.
Pick k as initial in 1st chunk
Load 1 chunk of data in memory at a time
For each chunk,
its points are either:
a. Assign to existing clusters
b. Form a new mini-clusters
c. Retain
a. Assign to existing clusters-when
The point is sufficiently close to the centroid
of the cluster
Update the summary of cluster
-Cluster summary: N, SUM, SUMQ (in each dimension)
Centroid: SUM/N
Variance: SUMQ/N- (SUM/N)^2
—> Var(x) = E(x^2)-E(x)^2
Standard deviation: sqrt
Mahalanobis
distance
Normalize distance for multi-dimensional data
How many sigma away from centroid
Assume
no correlation among Diff dimensions
Normally distributed
sqrt (sum [(Point i - centre i )/sigma
i]^2 )
68% of points 1 sigma from mean
95% of points 2 sigmas from mean
99% of points 3 sigmas from mean
Multi-variate normal distribution
Non-correlating dimensions—diagonal covariance
matrix
Squared mahanoblis distance—value
Sufficiently close—mahanoblis
distance<threshold.—pick centroid with smallest distance
b. Form a new mini-clusters
after a, there are
points not assigned to any clusters, sigma of 1st chunk based on sample. Use a
hierarchical clustering with proper stopping condition
Assume
the axes of clusters align with axes of space
Merge mini-cluster if variance of merged cluster is small enough
No mini-cluster can be merged into existing
clusters
Discard set: points close enough to an existing
cluster
Compression set: points close to each
other to form mini-clusters, not close enough to ant existing cluster
Retained set: isolated points retained for new
rounds
Clustering using
representative
Large-scale data
Pick a sample, hierarchical clustered and determine k clusters from
dendrogram
Scan data and assign points to closest cluster
Distance between point p and cluster C:
From the closest representative in C
Representative—a small set of points far away
from each other
Moving representatives—a fixed fraction of
distance toward centroid
Higher dimension
Cosine similarity,
cosine—>0
Euclidean: points have similar distance btw each
other
Subspace clustering
评论
发表评论