数据挖掘 2 Representation-based Clustering
What is Clustering
Grouping a set of data objects into clusters
- Cluster: a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
Cluster = unsupervised classification no predefined classes
Usage
- Get insight into data distribution
- Preprocessing step for other algorithms
Application of Clustering
Pattern Recognition and Image Processing
Spatial Data Analysis
WWW
Biology
Information retrieval
Marketing
City-planning
Social networks
Basic Concept of Representative-based Approaches
Goal: Construct a partition of a database of objects into a set of clusters minimizing an objective function
- Exhaustively enumerating all possible partitions into sets in order to find the global minimum is too expensive (exponential time)
- Example objective: minimize the max distance (Manhattan, Euclidean, Geodesic, …) among the points in the cluster
Heuristic methods:
- Choose representatives for clusters e.g. randomly
- Improve there initial representatives iteratively
- Assign each object to the cluster it fits best in the current clustering
- Compute new cluster representatives based on these assignments
- Repeat until the change in the objective function from one iteration to the next drops below a threshold
Types of cluster representatives
- K-means: Each cluster is represented by the center of the cluster
- K-medoid: Each cluster is represented by one of its objects.
K-means Clustering
Overview
Basic Ideas
For a given , form groups so that the sum of the distances between the mean of the groups (centroids) and their elements is minimal.
Basic Notions
Objects are points in a d-dimensional vector space
Centroid : Means of all points in a cluster : .
Measure of the compactness (aka total distance) of a cluster :
Measure of the compactness of a clustering:
The ideal clustering minimizes this objective function.
Lloyd’s Algorithm
Initialize: Partition the objects randomly into non-empty subsets.
- Centroid Update: Compute the centroids of the clusters of the current partition. The centroid is the center of a cluster.
- Cluster Assignment: Assign each object to the cluster with the nearest representative.
Repeat Step 1 until representatives change.
Pros and Cons
Advantages 👍
Relatively efficient
is # of objects, is # of clusters, is # of iterations.
Normally .
Easy implementation
Disadvantages 👎
- Applicable only when mean is defined
- Need to specify in advance
- Sensitive to noisy data and outliers
- Clusters are forced to have convex shapes.
- Result and runtime are very dependent on initial partition; often terminates at a local optimum.
Variant of k-means, e.g. ISODATA (eliminate small clusters, merging and split of clusters)
Choice of Parameter
Idea for a method:
- Determine a clustering for each
- Choose the best clustering
How to measure the quality of a clustering?
- measure has to be independent of !
- The measures for the compactness of a clustering and are monotonously decreasing with increasing value of .
The Silhouette Coefficient
A measure to clustering
Basic idea:
Elements in cluster should be similar to their representative
Elements in different clusters should be dissimilar
: average distance between object and the objects in its cluster.
: average distance between object and the objects in its “second closest” cluster
The silhouette of is:
How good is the assignment of to its cluster?
: bad, on average cluster to members of second-closest cluster.
: o lies between two clusters.
: good assignment of to its cluster.
Silhouette coefficient of a clustering is average silhouette of all objects
- implies strong structure.
- implies medium structure.
- implies weak structure.
- implies no structure.
is used for determining the best value of .
Kernel K-means
Linearity of K-means: K-means assume the boundaries between the clusters are lines.
❓What if the boundaries are not lines? Use Kernel K-means.
Basic Idea
Use kernels to project points into another space
Separate the points with -means in the new space
Recall k-means objective:
Map every point into a different space
Recall that a kernel is a dot-product
Thus,
Kernel K-means can generates non-linear boundaries.
Pro and Cons
Advantages 👍
- Allow detection of clusters with arbitrary shape
- Fits any possible kernel
Disadvantages 👎
- Inefficient
- Like K-means: need to specify the number of clusters in advance.
- Requires the choice of a kernel
K-medoid
Motivation
K-means assumes Euclidean distance while K-medoid is more general.
- Motivated by L1 norm (Manhattan distance)
- Works also in space, where a mean is not defined
- Only input is the possibility to compute pairwise distance
More robust to noise 👍
Lp Norms
General Lp-Metric (Minkowski-Distance)
: Euclidean Distance used in K-means
Manhattan Distance used in K-medoids
Basic Idea
Find representatives in the dataset s.t., the sum of the distances between representatives and objects which are assigned to them is minimal.
Medoid: representative object “in the middle”
Notions
Requires arbitrary objects and a distance function
Medoid : representative object in a cluster
Measure for the compactness of a cluster :
Measure for the compactness of a clustering
PAM Algorithm
Given
- Select objects arbitrarily as medoids; assign each remaining object to the cluster with the nearest representative, and compute current .
- For each pair medoid M and non-medoid N, compute the value .
- Select the non-medoid for which is minimum.
- If is smaller than current
- Swap with
- Set current .
- Go back to step 2.
- Else stop
CLARA & CLARANS
CLARA
- additional parameter: numlocal
- Draws numlocal samples of the data set
- applies PAM on each sample
- returns the best of these sets of medoids as output
CLARANS
- two additional parameters: maxneighbor and numlocal
- at most maxneighbor many pairs are evaluated in the algorithm.
- The first pair (M,N) for which TD is smaller than current one is swapped instead of finding the minimal one.
- Finding the local minimum with this procedure is repeated numlocal times.
Efficiency: runtime(CLARANS) < runtime(CLARA) < runtime(PAM)
Pros and Cons
Advantages 👍
- Applicable to arbitrary objects + distance functions
- Not as sensitive to noisy data and outlier as K-means
Disadvantages 👎
- Inefficient
- Need to specify the number of clusters in advance
- Result and runtime for CLARA and CLARNAS may vary largely depend on randomization.
Expectation Maximization
Consider points from a d-dimensional Euclidean vector space
Each cluster is represented by a probability distribution
Typically: mixture of Gaussian distributions
Single distribution to represent a cluster
- Center point of all points in the cluster
- Covariance matrix for the points in the cluster
The density for the clustering
❓ How do we find the parameters ? Maximize the log-likelihood (MLE)
where
EM algorithm
- Assume we have already computed the parameters
- Compute the probability of each cluster conditioned to each point
- Update the parameters .
Pros and Cons
Advantages 👍
- Flexible and powerful probabilistic model
- Captures overlapping clusters
Disadvantages 👎
- Convergence to local minimum
- Computation effort
- Both result and runtime strongly depend on
- initial assignment
- a proper choice of parameter
- Modification to obtain a partitioning variant