When data has high dimensions or clusters have different densities, methods like DBSCAN and kk-means might fail.

Hierarchical clustering partitions the space in progressively smaller clusters.

  • Hierarchical agglomerative clustering 层次凝聚聚类:single-link, complete-link and average-link strategies to merge clusters and form larger clusters.
  • OPTICS clustering that order density reachable points to find cluster structures
  • BRICH that summarizes the data to understand similarities

Subspace finds different clusters in different subsets of the dimensions in which the data lies.

  • CLIQUE: A clustering algorithm which divides the space into grids and exploit the apriori principle on density to compute dense subspaces
  • SUBCLU: Similar to CLIQUE but uses DBSCAN
  • PROCLUS: A projected clustering which finds the best dimensions for each clusters.

Agglomerative Hierarchical Clustering

Dendrogram: one of the main visualization tools for hierarchical clustering.

  • A tree, in which the root is a cluster containing all the points in the dataset D\mathcal D and the leaves are clusters of single points.
  • Can be constructed bottom-up (agglomerative) by progressively merge smaller clusters into large one or top-down (divisive) by splitting larger clusters.
  • Height of a branch in the dendrogram show the distance.
  • Cut-off value as a vertical line identifies a partition of the data points.

Agglomerative hierarchical clustering takes a bottom-up approach.

  • Small clusters are progressively merged to form larger clusters.
  • Require the choice of a distance function.

Pseudo-code of Agglomerative Hierarchical Clustering

Input: dataset D\mathcal D, clustering distance distdist

  • t1t\larr1

  • Ct{{x}xD}\mathcal C^t\larr\{\{\mathbb x\}_{\mathbb x\in\mathcal D}\}

  • while CtD\mathcal C^t\neq\mathcal D do

    • for each Ci,CjCtC_i,C_j\in\mathcal C^t do
      • Compute dist(Ci,Cj)dist(C_i,C_j)

    // merge the clusters with the minimum distance

    • Ci,CjargminCi,CjCtdist(Ci,Cj)C_i,C_j\larr\arg\min_{C_i,C_j\in\mathcal C^t}dist(C_i,C_j)

    // Create a new level in the dendrogram

    • Ct+1(CtCi)CjCj(CiCj)\mathcal C^{t+1}\larr(\mathcal C^t\diagdown C_i)\diagdown C_j\cup C_j\cup(C_i\cup C_j)
    • tt+1t\larr t+1
  • return C\mathcal C

Distance Among Clusters

The choice of distance metric determines the quality of the clustering

  • Single-link distance distsl(Ci,Cj)=minxCi,yCjdist(x,y)dist_{sl}(C_i,C_j)=\min_{\mathbb x\in C_i,\mathbb y\in C_j}dist(\mathbb x,\mathbb y): compute the minimum distance among two clusters
  • Complete-link distance distcl(Ci,Cj)=minxCi,yCjdist(x,y)dist_{cl}(C_i,C_j)=\min_{\mathbb x\in C_i,\mathbb y\in C_j}dist(\mathbb x,\mathbb y): compute the maximum distance among two clusters
  • Average-link distance distal(Ci,Cj)=1CiCjdist(x,y)dist_{al}(C_i,C_j)={1\over|C_i||C_j|}dist(\mathbb x,\mathbb y): compute the average distance among two clusters

OPTICS

DBSCAN performs poorly when the clusters have different densities.

OPTICS abbr. for Ordering Points To Identify the Clustering Structure, solves the problem.

DBSCAN have fixed value of ϵ\epsilon, a point might have a very large ϵ\epsilon-neighborhood Nϵ(x)N_\epsilon(\mathbb x). Observation: the possibility to define the minimum distance under which a point is a core point. This distance is called the core distance cdistcdist.

Core distance: the smallest distance s.t. x\mathbb x is a core point

cdist(x)={MinPtsth smallest distance dist(x,y) if Nϵ(x)MinPts? otherwisecdist(\mathbb x)=\begin{cases} MinPts-th\text{ smallest distance }dist(\mathbb x,\mathbb y)\text{ if }|N_\epsilon(\mathbb x)|\ge MinPts \\?\text{ otherwise} \end{cases}

Reachability distance: the distance of the closest point reachable to x\mathbb x.

rdist(x)={dist(x,y) if dist(x,y)>cdist(y)cdist(y) if dist(x,y)<cdist(y)? if dist(x,y)>ϵrdist(\mathbb x)=\begin{cases} dist(\mathbb x,\mathbb y)\text{ if }dist(\mathbb x,\mathbb y)\gt cdist(\mathbb y)\\ cdist(\mathbb y)\text{ if }dist(\mathbb x,\mathbb y)\lt cdist(\mathbb y)\\ ?\text{ if }dist(\mathbb x,\mathbb y)\gt\epsilon \end{cases}

OPTICS is built upon the idea of core distance and reachability distance.

  • OPTICS maintains a basic data structure that stores the shortest reachability-distance seen so far
  • OPTICS visits each point once and moves to the next point based on the smallest reachability distance
  • OPTICS outputs the order of points, the core and the reachability distance of all points in a reachability plot.

Pseudo Code of OPTICS

  • for each xD\mathbb x\in\mathcal D do

    • if x\mathbb x is not processed then

      • insert (x,?)(\mathbb x,?) into CLCL
    • while CLCL\neq\empty do

      • select first element (x,rdist)CL(\mathbb x,rdist)\in CL

      • retrieve Nϵ(x)N_\epsilon(\mathbb x) and cdistcdist(x)cdist\larr cdist(\mathbb x)

      • x.processedTrue\mathbb x.processed\larr True

      • Write (x,rdist,cdist)(\mathbb x,rdist,cdist) to file

      • if x\mathbb x is a core object at distance distϵdist\le\epsilon then

        • for each pNϵ(x)\mathbb p\in N_\epsilon(\mathbb x) do

          • if p.processed=\mathbb p.processed= false then

            rdistp=rdist(p,x)rdist_p=rdist(\mathbb p,\mathbb x)

            if (p,_)∉CL(\mathbb p,\_)\not\in CL then

            • insert (p,rdistp)(\mathbb p,rdist_p) into CLCL

            else if (p,old_rdist)CL(\mathbb p,old\_rdist)\in CL and rdistp<old_rdistrdist_p\lt old\_rdist then

            • update (p,rdistp)(p,rdist_p) in CLCL
  • return CLCL

Finding a density based clustering w.r.t. ϵϵ\epsilon^*\le\epsilon and MinPtsMinPts is easy:

  • start with x\mathbb x with cdist(x)ϵcdist(\mathbb x)\le\epsilon^* and rdist(x,_)>ϵrdist(\mathbb x,\_)\gt\epsilon^*
  • continue while rdist(x,_)ϵrdist(\mathbb x,\_)\le\epsilon^*

Running time. OPTICS running time is bounded by the running time of DBSCAN, Ω(n4/3)\Omega(n^{4/3}). OPTICS only maintains additional data structures, that do not affect the overall time complexity.

Analysis of reachability plot. The reachability plot shows both reachability and the core distance. Contiguous areas in the plot with large reachability and core-distance indicate regions with low density. At the same time, low reachability-distance indicates high-density areas. A jump in the plot occurs when the next point in order falls outside the current cluster.

OPTICS is relatively insensitive to parameter settings, and results are goods if the parameter are large enough.

Pros and Cons

👍

  • Doesn’t require the number of clusters to be known in advance.
  • Very robust
  • Computation complete hierarchically
  • Good visualization
  • A flat partition can be derived, e.g. dendrogram or reachability plot

👎

  • No backtracking, greedy splits and merges
  • May not scale well. It examines many objects in order to split and merge. Runtime for standard methods is O(nlogn)O(n\log n), OPTICS without indexing is O(n2)O(n^2)

BRICH

Balanced Iterative Reducing and Clustering using Hierarchies.

It first summarizes the data and then clusters the summarized data with any clustering algorithm.

Can cluster huge datasets that do not fit into memory, with only very small loss in accuracy.

Incremental algorithm for accommodating new data points.

2 main phases of BRICH

  1. (Summarize) Scan the dataset D\mathcal D to build an in-memory Clustering Feature tree. CF-tree is a multi-level compression that preserves inherent clustering structure,
  2. Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree.

BRICH scales linearly O(n)O(n) in the size of the dataset. To improve quality, BRICH can optionally scan the data multiple times.

Fast, but can only deal with numerical data.

Basic Idea of BRICH

  • Constructs a partitioning of the data into micro-clusters using an efficient index-like structure
  • The micro-clusters are representations of multiple points described by Clustering Features
  • CFs are organized hierarchically in a balanced tree
  • A standard clustering algorithm is then applied to the leaf entries of the CF-tree.

Clustering Features

A clustering features CF=(N,LS,SS)CF=(N,LS,SS) of a micro-cluster CDC\subseteq D consists of

  • N=CN=|C|: Number of points in the micro-cluster
  • LS=xCxLS=\sum_{\mathbb x\in C}\mathbb x: Linear sum of the data points in CC
  • SS=xCxi2SS=\sum_{\mathbb x\in C}\mathbb x_i^2: Square sum of the data points in CC

Note that both LSLS and SSSS are vectors. The CF representation, although quite simple, is enough to compute a number of statistics about a cluster, such as centroids, measures of compactness, and distance measures of clusters.

Additivity Theorem: CFs satisfy additivity theorem that allows CFs to be computed incrementally on unions of micro-clusters. If two clusters are merged the CF of the new cluster is simply the element-wise sum of the respective CFs.

CF-tree

The CF-tree is an index similar to a B+ tree that contains the CF representations of the micro-cluster.

The tree-structure can be easily updated without scanning the entire data, by virtue of a nice data organization.

A CF-tree with parameters B,L,TB,L,T is a tree-like structure, s.t.

  • An internal node contains at most BB entries [CFi,childi][CF_i,child_i]
  • A leaf node contains at most LL entries [CFi][CF_i]
  • The diameter or radius of all entries in a leaf node is at most TT according to a user-defined distance metric
  • Leaf nodes are connected via prev and nxt pointers

The tree is constructed bottom-up in the following manner:

  • Transform point x\mathbb x into CF vector CFx=(1,x,x2)CF_x=(1,\mathbb x,\mathbb x^2)
  • Insert x\mathbb x in the same way as in a B+ tree.
  • If threshold TT is violated by insertion, then split the leaf node and reorganize tree, analog to that of B+ tree.

The BRICH Algorithm

  1. Scan all data points and build in memory CF tree using given amount of memory and recycling space

  2. (optional) Condense into desirable length by building a smaller CF tree.

  3. Global clustering with any clustering algorithm on the CF features vectors.

  4. (optional) refine with more passes

Pros and Cons

👍

  • Compression factor can be tuned to the available memory
  • Efficient construction of a micro-clustering O(n)O(n)
  • Good clustering result for the partitioning iterative refine clustering algorithm such as kk-means and kk-medoid when applied to only the leaf nodes of a CF tree.

👎

  • Only for data from a euclidean vector space
  • Handles only numeric data
  • Sensitive to the order of data records
  • Entries are limited by disk page size
  • Different parameters to tune

Subspace Clustering

Problem with high dimensional data: curse of dimensionality

A general phenomenon that occurs on points as the number of dimensions increases:

  • Sparsification: the data points grow further apart with dimension
  • Incomparability: the probability that points fall into a sphere of a certain radius becomes negligible, thus the distances of points from a center or another points become closer with the dimension.

A cluster in low dimension might not appear as such in higher dimension.

Subspace clustering attempts to tackle this problem by clustering the points in different subsets of the dimensions. Intuitively, in 3D it is like projecting the points on some of the coordinates.

The selection of the coordinates where to project is the main difference between the subspace clustering algorithms.

Another major difference is the clustering algorithm employed in the subspaces.

CLIQUE

One of the first algorithms for subspace clustering.

Density-based: divide the space into n/hn/h cubic regions, prune regions that have less than a certain number of points ξ\xi (density threshold) in low dimension and repeat the process on subspaces obtained by aggregating dimensions that contains at least one cluster.

The naive approach computes clustering for subsets of dimensions. However this approach considers O(2d)O(2^d) clusters, which is prohibitive.

CLIQUE exploit the monotonicity of density for which a dense cluster in dimension dd is dense in every subset of the dd dimensions. A cluster is the union of dense regions. Therefore, we can construct a greedy algorithm based on pruning regions that are not dense.

The idea starts with an initial empty set of dimensions and iteratively include dimensions that are not pruned in the previous step.

The greedy algorithm in CLIQUE first finds 1-dimensional R1R_1 candidate regions, then generate 2-dimensional R2R_2 from the 1-dimensional candidates, prune the candidates using the monotonicity of density, and repeat the process increasing the dimension until no candidate is generated.

Pseudo Code of CLIQUE

Input: dataset D\mathcal D, regions size hh, density threshold ξ\xi

  • R1R_1\larr Candidate regions in 1-dimension

  • d2d\larr2

  • repeat

    • RdR_d\larr Generate all candidate dd-dimensional cells from Rd1R_{d-1}

    // Prune cells with fewer than ξ\xi points

    • Rd{R:Rξ}R_d\larr\{R:|R|\ge\xi\}
    • dd+1d\larr d+1
  • until RdR_d\neq\empty

  • C\mathcal C\larr Compute clusters from each RiR_i

  • Summarize cluster

// In the last two steps, CLIQUE computes the cluster from the dense regions and summarizes them.

Compute clusters from the union of adjacent candidate regions. Compute a graph for each candidate subspace in dimension dd. The graph has a node for each dense region and an edge if two regions are adjacent. The connected components of the graph represent the clusters. This steps takes O(2dn)O(2dn) for each candidate subspace.

Summarizes cluster with a minimal number of inequalities, by defining boundaries on the regions. Since each region is represented by a number of dimensions, its boundaries are values of the dimensions that describe the region’s hypercube.

Pros and Cons

👍

  • Automatic detection of subspaces with clusters
  • Automatic detection of clusters
  • No assumptions about the distribution of data
  • Insensitivity to the order of the data
  • Good scalability w.r.t. to the number nn of data points

👎

  • Accuracy of result depends on the number of partitions h/nh/n
  • Requires heuristics to avoid constructing all subspaces

SUBCLU

CLIQUE might miss clusters due an early pruning of the regions that cross the clusters.

SUBCLU remedies to CLIQUE’s deficiency by using DBSCAN as the basis to find dense clusters in a subspace. Without entering into details the definition of core-objects, density reachability and connectivity can easily adapt to subspaces. Similarly, SUBCLU exploits the monotonicity of density connectivity:

  • If x,y\mathbb x,\mathbb y are density-connected in dd-dimensional space, all their projections in k1k-1 are also density-connected.

The algorithm iteratively applies DBSCAN at each step to generate dd-dimensional clusters as in CLIQUE

Pros and Cons

👍

  • Automatic detection of subspaces containing clusters
  • Automatic detection of clusters
  • No assumptions on data distributions (density-based approach)

👎

  • Parameter setting highly affects clustering results
  • Inherits the drawback of DBSCAN, such as the sensitivity w.r.t. clusters with different densities.
  • At least O(n2)O(n^2) time. Typically worse depending on max-dimensionality of the subspace containing clusters.

Projected Clustering

It tackles the curse of dimensionality differently from subspace clustering.

Main idea: To project each point individually into the only subspace in which the point is clustered.

The goal of projected clustering is correctly identify the right projections, s.t. each point is assigned exactly to one cluster. Projected clustering avoids rediscovering the clusters in multiple subspaces but at the same time, might miss clusters.

PROCLUS

A representative method for projected clustering.

Top-down approach that iteratively refines clusters and projections starting from an initial clustering of the data in the full dimension.

PROCLUS uses kk-medoid algorithm to find clusters.

Projection cluster: a set of objects CC and a set of dimension DD, s.t. the objects in CC are closely clustered in the subspace defined DD.

Closely clustered: A cluster CC is closely clustered if it is compact under the Manhattan segmental distance

d(x,y)=1DCiDCxiyid(\mathbb x,\mathbb y)={1\over|D_C|}\sum_{i\in D_C}|x_i-y_i|

that measures the Manhattan distance of the objects in the projected dimensions DD of cluster CC.

PROCLUS requires the specification of the number kk of clusters and the average number d^\hat d of dimensions. PROCLUS follows a multi-step approach in three phases

  1. Initialization: selects the initial set of medoids

    Select a sample of representative data points from the entire dataset, where the sample should ideally characterize the entire dataset and should contain at least kk medoids. In practice, since finding such an ideal a sample is hard, PROCLUS greedily selects AkA\cdot k well-separated points.

  2. Iterative: refines the medoids and computes the projections.

    Main body of PROCLUS that determines the best set of dimensions for a set of candidate medoids

    • Choose random set of kk-medoids from representative points
    • Determine optimal set of dimensions for each medoid
    • Assign all objects to the nearest medoid
    • Update current clustering if it is better than the previous until termination

    The iteration continues from step 2. By replacing bad medoids with random medoids from the sample of representatives until the clusters do not change anymore.

    First issue: how to determine an optimal set of dimensions for each medoid?

    PROCLUS proposes a score ZijZ_{ij} that measures the reduction in dispersion of the cluster CiC_i on dimension jj as opposed to the full dimensionality dd.

    For each medoid mi\mathbb m_i calculate the average distance XijX_{ij} on dimension jj of objects in the hyper-sphere Li\mathcal L_i with radius the distance from mi\mathbb m_i to its nearest medoid. The dispersion of cluster ii is the average distance XijX_{ij} for all the dimensions Yi=1dj=1dXijY_i={1\over d}\sum_{j=1}^dX_{ij}. The deviation XijYiX_{ij}-Y_i is negative if the clusters on dimension jj are correlated to those in the full dimension

    The score ZijZ_{ij} is normalized deviation XijYiX_{ij}-Y_i: Zij=1σi(XijYi)Z_{ij}={1\over\sigma_i}(X_{ij}-Y_i) where σi\sigma_i is the empirical standard deviation of XX in cluster ii.

    \sigma_i=\sqrt