数据挖掘 4 Hierarchical and Subspace Clustering
When data has high dimensions or clusters have different densities, methods like DBSCAN and -means might fail.
Hierarchical clustering partitions the space in progressively smaller clusters.
- Hierarchical agglomerative clustering 层次凝聚聚类:single-link, complete-link and average-link strategies to merge clusters and form larger clusters.
- OPTICS clustering that order density reachable points to find cluster structures
- BRICH that summarizes the data to understand similarities
Subspace finds different clusters in different subsets of the dimensions in which the data lies.
- CLIQUE: A clustering algorithm which divides the space into grids and exploit the apriori principle on density to compute dense subspaces
- SUBCLU: Similar to CLIQUE but uses DBSCAN
- PROCLUS: A projected clustering which finds the best dimensions for each clusters.
Agglomerative Hierarchical Clustering
Dendrogram: one of the main visualization tools for hierarchical clustering.
- A tree, in which the root is a cluster containing all the points in the dataset and the leaves are clusters of single points.
- Can be constructed bottom-up (agglomerative) by progressively merge smaller clusters into large one or top-down (divisive) by splitting larger clusters.
- Height of a branch in the dendrogram show the distance.
- Cut-off value as a vertical line identifies a partition of the data points.
Agglomerative hierarchical clustering takes a bottom-up approach.
- Small clusters are progressively merged to form larger clusters.
- Require the choice of a distance function.
Pseudo-code of Agglomerative Hierarchical Clustering
Input: dataset , clustering distance
while do
- for each do
- Compute
// merge the clusters with the minimum distance
// Create a new level in the dendrogram
- for each do
return
Distance Among Clusters
The choice of distance metric determines the quality of the clustering
- Single-link distance : compute the minimum distance among two clusters
- Complete-link distance : compute the maximum distance among two clusters
- Average-link distance : compute the average distance among two clusters
OPTICS
DBSCAN performs poorly when the clusters have different densities.
OPTICS abbr. for Ordering Points To Identify the Clustering Structure, solves the problem.
DBSCAN have fixed value of , a point might have a very large -neighborhood . Observation: the possibility to define the minimum distance under which a point is a core point. This distance is called the core distance .
Core distance: the smallest distance s.t. is a core point
Reachability distance: the distance of the closest point reachable to .
OPTICS is built upon the idea of core distance and reachability distance.
- OPTICS maintains a basic data structure that stores the shortest reachability-distance seen so far
- OPTICS visits each point once and moves to the next point based on the smallest reachability distance
- OPTICS outputs the order of points, the core and the reachability distance of all points in a reachability plot.
Pseudo Code of OPTICS
for each do
if is not processed then
- insert into
while do
select first element
retrieve and
Write to file
if is a core object at distance then
for each do
if false then
if then
- insert into
else if and then
- update in
return
Finding a density based clustering w.r.t. and is easy:
- start with with and
- continue while
Running time. OPTICS running time is bounded by the running time of DBSCAN, . OPTICS only maintains additional data structures, that do not affect the overall time complexity.
Analysis of reachability plot. The reachability plot shows both reachability and the core distance. Contiguous areas in the plot with large reachability and core-distance indicate regions with low density. At the same time, low reachability-distance indicates high-density areas. A jump in the plot occurs when the next point in order falls outside the current cluster.
OPTICS is relatively insensitive to parameter settings, and results are goods if the parameter are large enough.
Pros and Cons
👍
- Doesn’t require the number of clusters to be known in advance.
- Very robust
- Computation complete hierarchically
- Good visualization
- A flat partition can be derived, e.g. dendrogram or reachability plot
👎
- No backtracking, greedy splits and merges
- May not scale well. It examines many objects in order to split and merge. Runtime for standard methods is , OPTICS without indexing is
BRICH
Balanced Iterative Reducing and Clustering using Hierarchies.
It first summarizes the data and then clusters the summarized data with any clustering algorithm.
Can cluster huge datasets that do not fit into memory, with only very small loss in accuracy.
Incremental algorithm for accommodating new data points.
2 main phases of BRICH
- (Summarize) Scan the dataset to build an in-memory Clustering Feature tree. CF-tree is a multi-level compression that preserves inherent clustering structure,
- Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree.
BRICH scales linearly in the size of the dataset. To improve quality, BRICH can optionally scan the data multiple times.
Fast, but can only deal with numerical data.
Basic Idea of BRICH
- Constructs a partitioning of the data into micro-clusters using an efficient index-like structure
- The micro-clusters are representations of multiple points described by Clustering Features
- CFs are organized hierarchically in a balanced tree
- A standard clustering algorithm is then applied to the leaf entries of the CF-tree.
Clustering Features
A clustering features of a micro-cluster consists of
- : Number of points in the micro-cluster
- : Linear sum of the data points in
- : Square sum of the data points in
Note that both and are vectors. The CF representation, although quite simple, is enough to compute a number of statistics about a cluster, such as centroids, measures of compactness, and distance measures of clusters.
Additivity Theorem: CFs satisfy additivity theorem that allows CFs to be computed incrementally on unions of micro-clusters. If two clusters are merged the CF of the new cluster is simply the element-wise sum of the respective CFs.
CF-tree
The CF-tree is an index similar to a B+ tree that contains the CF representations of the micro-cluster.
The tree-structure can be easily updated without scanning the entire data, by virtue of a nice data organization.
A CF-tree with parameters is a tree-like structure, s.t.
- An internal node contains at most entries
- A leaf node contains at most entries
- The diameter or radius of all entries in a leaf node is at most according to a user-defined distance metric
- Leaf nodes are connected via
prev
andnxt
pointers
The tree is constructed bottom-up in the following manner:
- Transform point into CF vector
- Insert in the same way as in a B+ tree.
- If threshold is violated by insertion, then split the leaf node and reorganize tree, analog to that of B+ tree.
The BRICH Algorithm
Scan all data points and build in memory CF tree using given amount of memory and recycling space
(optional) Condense into desirable length by building a smaller CF tree.
Global clustering with any clustering algorithm on the CF features vectors.
(optional) refine with more passes
Pros and Cons
👍
- Compression factor can be tuned to the available memory
- Efficient construction of a micro-clustering
- Good clustering result for the partitioning iterative refine clustering algorithm such as -means and -medoid when applied to only the leaf nodes of a CF tree.
👎
- Only for data from a euclidean vector space
- Handles only numeric data
- Sensitive to the order of data records
- Entries are limited by disk page size
- Different parameters to tune
Subspace Clustering
Problem with high dimensional data: curse of dimensionality
A general phenomenon that occurs on points as the number of dimensions increases:
- Sparsification: the data points grow further apart with dimension
- Incomparability: the probability that points fall into a sphere of a certain radius becomes negligible, thus the distances of points from a center or another points become closer with the dimension.
A cluster in low dimension might not appear as such in higher dimension.
Subspace clustering attempts to tackle this problem by clustering the points in different subsets of the dimensions. Intuitively, in 3D it is like projecting the points on some of the coordinates.
The selection of the coordinates where to project is the main difference between the subspace clustering algorithms.
Another major difference is the clustering algorithm employed in the subspaces.
CLIQUE
One of the first algorithms for subspace clustering.
Density-based: divide the space into cubic regions, prune regions that have less than a certain number of points (density threshold) in low dimension and repeat the process on subspaces obtained by aggregating dimensions that contains at least one cluster.
The naive approach computes clustering for subsets of dimensions. However this approach considers clusters, which is prohibitive.
CLIQUE exploit the monotonicity of density for which a dense cluster in dimension is dense in every subset of the dimensions. A cluster is the union of dense regions. Therefore, we can construct a greedy algorithm based on pruning regions that are not dense.
The idea starts with an initial empty set of dimensions and iteratively include dimensions that are not pruned in the previous step.
The greedy algorithm in CLIQUE first finds 1-dimensional candidate regions, then generate 2-dimensional from the 1-dimensional candidates, prune the candidates using the monotonicity of density, and repeat the process increasing the dimension until no candidate is generated.
Pseudo Code of CLIQUE
Input: dataset , regions size , density threshold
Candidate regions in 1-dimension
repeat
- Generate all candidate -dimensional cells from
// Prune cells with fewer than points
until
Compute clusters from each
Summarize cluster
// In the last two steps, CLIQUE computes the cluster from the dense regions and summarizes them.
Compute clusters from the union of adjacent candidate regions. Compute a graph for each candidate subspace in dimension . The graph has a node for each dense region and an edge if two regions are adjacent. The connected components of the graph represent the clusters. This steps takes for each candidate subspace.
Summarizes cluster with a minimal number of inequalities, by defining boundaries on the regions. Since each region is represented by a number of dimensions, its boundaries are values of the dimensions that describe the region’s hypercube.
Pros and Cons
👍
- Automatic detection of subspaces with clusters
- Automatic detection of clusters
- No assumptions about the distribution of data
- Insensitivity to the order of the data
- Good scalability w.r.t. to the number of data points
👎
- Accuracy of result depends on the number of partitions
- Requires heuristics to avoid constructing all subspaces
SUBCLU
CLIQUE might miss clusters due an early pruning of the regions that cross the clusters.
SUBCLU remedies to CLIQUE’s deficiency by using DBSCAN as the basis to find dense clusters in a subspace. Without entering into details the definition of core-objects, density reachability and connectivity can easily adapt to subspaces. Similarly, SUBCLU exploits the monotonicity of density connectivity:
- If are density-connected in -dimensional space, all their projections in are also density-connected.
The algorithm iteratively applies DBSCAN at each step to generate -dimensional clusters as in CLIQUE
Pros and Cons
👍
- Automatic detection of subspaces containing clusters
- Automatic detection of clusters
- No assumptions on data distributions (density-based approach)
👎
- Parameter setting highly affects clustering results
- Inherits the drawback of DBSCAN, such as the sensitivity w.r.t. clusters with different densities.
- At least time. Typically worse depending on max-dimensionality of the subspace containing clusters.
Projected Clustering
It tackles the curse of dimensionality differently from subspace clustering.
Main idea: To project each point individually into the only subspace in which the point is clustered.
The goal of projected clustering is correctly identify the right projections, s.t. each point is assigned exactly to one cluster. Projected clustering avoids rediscovering the clusters in multiple subspaces but at the same time, might miss clusters.
PROCLUS
A representative method for projected clustering.
Top-down approach that iteratively refines clusters and projections starting from an initial clustering of the data in the full dimension.
PROCLUS uses -medoid algorithm to find clusters.
Projection cluster: a set of objects and a set of dimension , s.t. the objects in are closely clustered in the subspace defined .
Closely clustered: A cluster is closely clustered if it is compact under the Manhattan segmental distance
that measures the Manhattan distance of the objects in the projected dimensions of cluster .
PROCLUS requires the specification of the number of clusters and the average number of dimensions. PROCLUS follows a multi-step approach in three phases
Initialization: selects the initial set of medoids
Select a sample of representative data points from the entire dataset, where the sample should ideally characterize the entire dataset and should contain at least medoids. In practice, since finding such an ideal a sample is hard, PROCLUS greedily selects well-separated points.
Iterative: refines the medoids and computes the projections.
Main body of PROCLUS that determines the best set of dimensions for a set of candidate medoids
- Choose random set of -medoids from representative points
- Determine optimal set of dimensions for each medoid
- Assign all objects to the nearest medoid
- Update current clustering if it is better than the previous until termination
The iteration continues from step 2. By replacing bad medoids with random medoids from the sample of representatives until the clusters do not change anymore.
First issue: how to determine an optimal set of dimensions for each medoid?
PROCLUS proposes a score that measures the reduction in dispersion of the cluster on dimension as opposed to the full dimensionality .
For each medoid calculate the average distance on dimension of objects in the hyper-sphere with radius the distance from to its nearest medoid. The dispersion of cluster is the average distance for all the dimensions . The deviation is negative if the clusters on dimension are correlated to those in the full dimension
The score is normalized deviation : where is the empirical standard deviation of in cluster .
\sigma_i=\sqrt