Density-based departs from the rigid geometrical structure of representative-based clustering and defines cluster as high-density regions separated by low density regions.

Density is number of points in a certain region in proportion to the size of the region. Different definition of density, different density-based approaches.

DBSCAN

A density-based clustering algorithm grounded into the intuition that a point in a cluster should be density reachable from any other point in that cluster, i.e., there should be “enough” points around each point in a cluster.

This standard does not include any specific on the shape of the cluster, but only constrains the neighborhood of a point. Thus, DBSCAN can handle clusters with arbitrary shape.

Definitions

Neighborhood: For each point x\mathbf x, we define the neighborhood the set of points that are at distance at most ϵ\epsilon from x\mathbf x, formally Nϵ(x)={yDdist(x,y)ϵ}N_\epsilon(\mathbf x)=\{\mathbf y\in\mathcal D|dist(\mathbf x,\mathbf y)\le\epsilon\}.

Additionally, a neighborhood is dense if it contains at least MinPtsMinPts points, i.e., Nϵ(x)MinPts|N_\epsilon(\mathbf x)|\ge MinPts.

Core object: A point x\mathbf x whose neighborhood contains at least MinPtsMinPts points.

Density-reachable: y\mathbf y is directly density-reachable from x\mathbf x within ϵ,MinPts\epsilon,MinPts, if yNϵ(x)\mathbf y\in N_\epsilon(\mathbf x), and x\mathbf x is a core-object.

A point p\mathbf p is density reachable from x\mathbf x if p\mathbf p is directly density reachable from a point y\mathbf y in the neighborhood Nϵ(x)N_\epsilon(\mathbf x). Density reachability is the transitive closure of directly density reachability. Note that the density reachability is not symmetric. In particular, since p\mathbf p might not have MinPtsMinPts in its neighborhood, it cannot “reach” x\mathbf x with a sequence of dense areas.

Density-connected: x\mathbf x is density-connected to a point y\mathbf y if there is a point p\mathbf p such that both x\mathbf x and y\mathbf y are density-reachable from p\mathbf p.

x\mathbf x and y\mathbf y are not necessarily density-reachable in this case.

Density-based cluster: a non-empty subset SDS\subseteq\mathcal D of the data D\mathcal D that satisfies both

  • Maximality: If a point xS\mathbf x\in S and y\mathbf y is density-reachable from x\mathbf x, then yS\mathbf y\in S.
  • Connectivity: Each object in SS is density-connected to all other objects.

A density-based clustering of a database D\mathcal D is a partition {C1,,Ck;N}\{C_1,\cdots,C_k;N\} s.t. i=1kCiN=D\bigcup_{i=1}^kC_i\cup N=\mathcal D, CiCj=,j>i,i,j[1,k]C_i\cap C_j=\empty,j\gt i,i,j\in[1,k], and CiN=C_i\cap N=\empty, where C1,,CkC_1,\cdots,C_k are the density-based clusters. N=D{C1,,Ck}N=\mathcal D\diagdown\{C_1,\cdots,C_k\} is the noise cluster of objects that do not belong to any cluster.

DBSCAN Algorithm

Pseudo code

Input: dataset D,ϵ,MinPts\mathcal D,\epsilon,MinPts

  • for each xD\mathbf x\in \mathcal D, do

    • if x\mathbf x is not part of any cluster CC then

      • if x\mathbf x is a core project then

        CC\larr Compute the density-reachable points from x\mathbf x

        Create a new cluster CC.

      • else

        NN{x}N\larr N\cup\{x\}

The main theoretical idea behind the DBSCAN algorithm is that each object in a density-based cluster CC is density-reachable from any of its core objects and nothing else is reachable from the core objects.

The DBSCAN is complete and ensures the two properties of maximality and connectivity. The DBSCAN algorithm is a simple iteration around the objects in the database, starting from a random point and finding all reachable points.

Density-reachable objects are collected by performing successive ϵ\epsilon-neighborhood queries that find the points at distance ϵ\epsilon from another point.

Complexity Analysis

When dimensionality is high, DBSCAN runs in O(n2)\mathcal O(n^2). By using specialized indexes for spatial queries, e.g., RR^* trees, DBSCAN can run faster than CLARANS that as a runtime complexity of O(k3+nk)\mathcal O(k^3+nk).

NϵN_\epsilon-queryDBSCAN
Without support (worst case)O(n)\mathcal O(n)O(n2)\mathcal O(n^2)
Tree based support (e.g. RR^*)O(logn)\mathcal O(\log n)O(nlogn)\mathcal O(n\log n)
Direct access to the neighborhoodO(1)\mathcal O(1)O(n)\mathcal O(n)

Tuning ϵ\epsilon and MinPtsMinPts

Before using DBSCAN, we need to decide on parameters ϵ\epsilon and MinPtsMinPts. This choice determines the shape and the number of clusters; low ϵ\epsilon or high MinPtsMinPts would find too many small clusters, a high ϵ\epsilon might trivially terminate with a single cluster containing the entire dataset.

To find good ϵ\epsilon and MinPtsMinPts, there are some heuristics. For example, one could use the point density of the least dense cluster to determine the parameters. One common heuristic uses the distance from kk-nearest neighbors.

  • kdist(x)kdist(\mathbf x) is the distance from x\mathbf x to its kk-nearest neighbor
  • kdistkdist plot is a chart of the kk-distances of all points, sorted in decreasing order of distance.

The heuristic works as follows

  • Fix a value for MinPtsMinPts. If there is no domain knowledge available, a common choice is 2d12d-1 where dd is the dimension.
  • Select a “border object” x\mathbf x from the MinPtsMinPts distance polt, ϵ\epsilon is set to MinPtsdist(x)MinPtsdist(\mathbf x)

Pros and Cons

👍 Advantages:

  1. Does not require to specify number of clusters.
  2. Performs well with clusters of arbitrary shape.
  3. DBSCAN is robust to outlier and is able to handle them.

👎Disadvantages:

  1. It requires domain knowledge for MinPtsMinPts and it is not easy to determine ϵ\epsilon.
  2. If clusters are very different in terms of in-cluster densities, then DBSCAN is not well suited to define clusters as it cannot generalize well to clusters with very different densities.

DENCLUE

DENCLUE detects dense regions without specifically specifying the size of the neighborhood.

Main idea: a generalization of neighborhood density in terms of density distribution.

Density Estimation

To estimate an unknown density function in a dataset.

“Non-parametric”

Density estimation:

  • determines an unknown probability density function
  • it is non-parametric and does not assume a fixed probability model
  • estimates the probability density at each point

Univariate density estimation

In one dimension, we can model the data as a random variable XX.

Unidimensional data points can be treated as nn samples or observations {xi}\{x_i\} from such a random variable.

We can estimate the cumulative density function CDF by counting the number of points less than a certain value vv

F^(v)=1ni=1nI(xiv)\hat F(v)=\frac{1}{n}\sum_{i=1}^nI(x_i\le v)

where II is the indicator functions. which is 11 if the conditions inside the ()(\cdot) is satisfied and 00 otherwise.

The density function is then estimated by taking the derivative of the CDF

f^(x)=F^(x+h2)F^(xh2)h=k/nh=knh\hat f(x)=\frac{\hat F(x+{h\over2})-\hat F(x-{h\over2})}{h}={k/n\over h}={k\over nh}

where kk is the number of points in a window of width hh centered on xx. The density estimate is ratio of points in the window (k/n)(k/n) to the volume of the window hh.

Kernel density estimation: We need to be careful on the window size hh. We define a kernel function KK that is a probability density function: ⚠️Don’t mess up with kernel in k-means

  • non-negative K(x)0K(x)\ge0
  • symmetric K(x)=K(x)K(-x)=K(x)
  • integrates to 11, i.e., K(x)dx=1\int K(x)dx=1.

The discrete kernel is the simplest kernel and it corresponds to a degenerate uniform probability density function

K(z)={1 if z120 otherwiseK(z)=\begin{cases} 1\text{ if }|z|\le{1\over2}\\ 0\text{ otherwise} \end{cases}

when z=xxih12|z|=\frac{x-x_i}{h}\le{1\over2}, the point xix_i is inside a window of size hh centered at point xx. We can then rewrite the density estimator as

f^(x)=1nhi=1nK(xxih)\hat f(x)=\frac{1}{nh}\sum_{i=1}^nK({x-x_i\over h})

The discrete kernel is not smooth, meaning that the kernel estimate looks again like a histogram with the consequence that many useful calculus operator cannot be comfortably performed.

Gaussian kernel: Kernel used in DENCLUE. The Gaussian kernel is one of the most popular kernels and represents a Gaussian distribution centered around a point.

K(z)=12πexp(z22).K(z)=\frac{1}{\sqrt{2\pi}}\exp(-{z^2\over2}).

As usual, by z=xxihz={x-x_i\over h} we obtain

K(xxih)=12πexp((xxi)22h2)K({x-x_i\over h})=\frac{1}{\sqrt{2\pi}}\exp(-{(x-x_i)^2\over2h^2})

where xx at the center of the window plays the role of the mean, and the window size hh becomes standard deviation.

Multivariate density estimation

For dd-dimensional data x=(x1,,xd)\mathbf x=(x_1,\cdots,x_d), the window hh becomes a hypercube centered at xx with volume hh, that is vol(Hd(h))=hdvol(H_d(h))=h^d. Density estimation is the same:

f^(x)=1nhdi=1nK(xxih)\hat f(x)=\frac{1}{nh^d}\sum_{i=1}^nK({x-x_i\over h})

where the multivariate kernel must preserve non-negativity, symmetry, and integrate to 1. It is then straightforward to redefine the discrete and Gaussian kernels.

Discrete multivariate kernel Similar to univariate counterpart, the multivariate kernel is

K(z)={1 if z12 for all dimensions0 otherwiseK(z)=\begin{cases} 1\text{ if }|z|\le{1\over2}\text{ for all dimensions}\\ 0\text{ otherwise} \end{cases}

The Gaussian kernel uses a equi-variate covariance matrix which translates into using the identity matrix Σ=I\Sigma=\mathbf I

K(z)=1(2π)d2exp(zTz2)K(\mathbf z)=\frac{1}{(2\pi)^{d\over2}}\exp(-{\mathbf z^T\mathbf z\over2})

Nearest neighbor estimation

Instead of fixing the size hh of the region, the nearest neighbor estimate fixes the number of points kk in the neighborhood of a point x\mathbf x and finds the size of the region.

KNN proceeds:

  1. Fix kk, the minimum number of points that defines a dense region.
  2. Compute the volume vol(Sd(hx))vol(S_d(h_\mathbf x)) as a function of kk.

Given kk, we estimate density at x\mathbf x as

f^(x)=knvol(Sd(hx))\hat f(\mathbf x)={k\over n\cdot vol(S_d(h_\mathbf x))}

where hxh_\mathbf x is the distance from x\mathbf x to its kk-nearest neighbor. The volume is the volume of the dd-dimensional hypersphere centered at x\mathbf x.

Density Attractors

How can we find dense clusters?

Look at the density function and set a threshold ξ\xi i.e. a horizontal line in a density points histogram. However, if so, some points in low-density regions might not be assigned to any cluster.

Density attractor: A point x\mathbf x^* is a density attractor if it is a local maxima of the probability density function ff.

In order to discover these peaks in the density function, we need to compute its gradient.

\nabla\hat f(\mathbf x)={\part\over\part\mathbf x}\hat f(\mathbf x)={1\over nh^d}\sum_{i=1}^n{\part\over\part\mathbf x}K({\mathbf x-\mathbf x_i\over h})

For the Gaussian kernel, the gradient takes a particularly neat form

{\part\over\part\mathbf x}K({\mathbf x-\mathbf x_i\over h})=K({\mathbf x-\mathbf x_i\over h})\cdot({\mathbf x-\mathbf x_i\over h})\cdot({1\over h})

which yields

f^(x)=1nhd+2i=1nK(xxih)(xix)\nabla\hat f(\mathbf x)={1\over nh^{d+2}}\sum_{i=1}^nK({\mathbf x-\mathbf x_i\over h})\cdot(\mathbf x_i-\mathbf x)

Density-based cluster: A set of CC of data points from a data set D\mathcal D is a density-based cluster w.r.t. some threshold ξ\xi if there exist a set of attractors x1,,xk\mathbf x_1^*,\cdots,\mathbf x_k^* if

  • Each point x\mathbf x in CC attracted to some attractor xi\mathbf x_i^*.
  • Each density attractor xi\mathbf x_i^* exceeds some density threshold ξ\xi, i.e., f^(xi)ξ\hat f(\mathbf x_i^*)\ge\xi.
  • Any two density attractors xi,xj\mathbf x_i^*,\mathbf x_j^* are density reachable. That is, there exists a path from xi\mathbf x_i^* to xj\mathbf x_j^* such that for all points y\mathbf y on the path, f^(y)ξ\hat f(\mathbf y)\ge\xi.

Not all points will be part of a cluster. We treat them as noise.

The DENCLUE algorithm: For each point in the data, it finds a candidate attractor. If the attractor’s density is above the threshold ξ\xi, it is added to the set AA of attractors and the point is added to the set RR of points attracted by x\mathbf x^*. Then all the maximal sets of mutually density reachable attractors computed.

To find an attractor for a point x\mathbf x we can solve f^(x)=0\nabla\hat f(\mathbf x)=0, thus

1nhd+2i=1nK(xxih)(xix)=0x=i=1nK(xxih)xii=1nK(xxih){1\over nh^{d+2}}\sum_{i=1}^nK({\mathbf x-\mathbf x_i\over h})\cdot(\mathbf x_i-\mathbf x)=0\\ \Rarr\mathbf x=\frac{\sum_{i=1}^nK({\mathbf x-\mathbf x_i\over h})\mathbf x_i}{\sum_{i=1}^nK({\mathbf x-\mathbf x_i\over h})}

We can use it as iterative rule until convergence. The algorithm runs in O(n2T)\mathcal O(n^2T) where TT is the number of iterations.

DENCLUE vs DBSCAN

DBSCAN corresponds to a discrete kernel, more efficient bit more rough. ϵ,MinPts\epsilon,MinPts corresponds h,ξh,\xi.

DENCLUE uses Gaussian kernel density based attractors, which is smooth model and more able to capture underlying data distribution than DBSCAN.

Pros and Cons

👍 Advantages

  • Clusters can arbitrary shape and size, not necessarily be convex hull.
  • Number of clusters is determined automatically.
  • Can separate clusters from surrounding noise.
  • Can be supported by spatial index structures.

👎Disadvantages

  • Input parameters may be difficult to determine
  • Can be sensitive to input parameter setting

Clustering Evaluation

External Measures

Take criteria into account that are not part of the clustering data.

Contingency Table

A r×kr\times k matrix N\mathbf N s.t. (cluster CC, ground-truth clustering TT)

Nij=CiTj\mathbf N_{ij}=|C_i\cap T_j|

Computing the contingency table takes O(nrk)\mathcal O(nrk)

Purity

To what extent a cluster CiC_i contains points only from one ground truth cluster.

purityi=1Cimaxj=1,,k{nij}\text{purity}_i=\frac{1}{|C_i|}\max_{j=1,\cdots,k}\{n_{ij}\}

The purity of a clustering is the weighted sum of the purity of each cluster.

purity=ipurityi\text{purity}=\sum_i\text{purity}_i

Maximum Matching

Consider a 1-to-1 assignment that maximizes the overall overlap between ground-truth clusters and computed clusters.

F-measure

precision and recall

Conditional Entropy

Internal Measures

Based only on clustering data

Silhouette coefficient

Davies-Bouldin index