Density-based departs from the rigid geometrical structure of representative-based clustering and defines cluster as high-density regions separated by low density regions.
Density is number of points in a certain region in proportion to the size of the region. Different definition of density, different density-based approaches.
DBSCAN
A density-based clustering algorithm grounded into the intuition that a point in a cluster should be density reachable from any other point in that cluster, i.e., there should be "enough" points around each point in a cluster.
This standard does not include any specific on the shape of the cluster, but only constrains the neighborhood of a point. Thus, DBSCAN can handle clusters with arbitrary shape.
Definitions
Neighborhood: For each point \(\mathbf x\), we define the neighborhood the set of points that are at distance at most \(\epsilon\) from \(\mathbf x\), formally \(N_\epsilon(\mathbf x)=\{\mathbf y\in\mathcal D|dist(\mathbf x,\mathbf y)\le\epsilon\}\).
Additionally, a neighborhood is dense if it contains at least \(MinPts\) points, i.e., \(|N_\epsilon(\mathbf x)|\ge MinPts\).
Core object: A point \(\mathbf x\) whose neighborhood contains at least \(MinPts\) points.
Density-reachable: \(\mathbf y\) is directly density-reachable from \(\mathbf x\) within \(\epsilon,MinPts\), if \(\mathbf y\in N_\epsilon(\mathbf x)\), and \(\mathbf x\) is a core-object.
A point \(\mathbf p\) is density reachable from \(\mathbf x\) if \(\mathbf p\) is directly density reachable from a point \(\mathbf y\) in the neighborhood \(N_\epsilon(\mathbf x)\). Density reachability is the transitive closure of directly density reachability. Note that the density reachability is not symmetric. In particular, since \(\mathbf p\) might not have \(MinPts\) in its neighborhood, it cannot "reach" \(\mathbf x\) with a sequence of dense areas.
Density-connected: \(\mathbf x\) is density-connected to a point \(\mathbf y\) if there is a point \(\mathbf p\) such that both \(\mathbf x\) and \(\mathbf y\) are density-reachable from \(\mathbf p\).
\(\mathbf x\) and \(\mathbf y\) are not necessarily density-reachable in this case.
Density-based cluster: a non-empty subset \(S\subseteq\mathcal D\) of the data \(\mathcal D\) that satisfies both
- Maximality: If a point \(\mathbf x\in S\) and \(\mathbf y\) is density-reachable from \(\mathbf x\), then \(\mathbf y\in S\).
- Connectivity: Each object in \(S\) is density-connected to all other objects.
A density-based clustering of a database \(\mathcal D\) is a partition \(\{C_1,\cdots,C_k;N\}\) s.t. \(\bigcup_{i=1}^kC_i\cup N=\mathcal D\), \(C_i\cap C_j=\emptyset,j\gt i,i,j\in[1,k]\), and \(C_i\cap N=\emptyset\), where \(C_1,\cdots,C_k\) are the density-based clusters. \(N=\mathcal D\diagdown\{C_1,\cdots,C_k\}\) is the noise cluster of objects that do not belong to any cluster.
DBSCAN Algorithm
Pseudo code
Input: dataset \(\mathcal D,\epsilon,MinPts\)
-
for each \(\mathbf x\in \mathcal D\), do
-
if \(\mathbf x\) is not part of any cluster \(C\) then
- if \(\mathbf x\) is a core project then
\(C\leftarrow\) Compute the density-reachable points from \(\mathbf x\)
Create a new cluster \(C\).
- else
\(N\leftarrow N\cup\{x\}\)
The main theoretical idea behind the DBSCAN algorithm is that each object in a density-based cluster \(C\) is density-reachable from any of its core objects and nothing else is reachable from the core objects.
The DBSCAN is complete and ensures the two properties of maximality and connectivity. The DBSCAN algorithm is a simple iteration around the objects in the database, starting from a random point and finding all reachable points.
Density-reachable objects are collected by performing successive \(\epsilon\)-neighborhood queries that find the points at distance \(\epsilon\) from another point.
Complexity Analysis
When dimensionality is high, DBSCAN runs in \(\mathcal O(n^2)\). By using specialized indexes for spatial queries, e.g., \(R^*\) trees, DBSCAN can run faster than CLARANS that as a runtime complexity of \(\mathcal O(k^3+nk)\).
\(N_\epsilon\)-query | DBSCAN | |
---|---|---|
Without support (worst case) | \(\mathcal O(n)\) | \(\mathcal O(n^2)\) |
Tree based support (e.g. \(R^*\)) | \(\mathcal O(\log n)\) | \(\mathcal O(n\log n)\) |
Direct access to the neighborhood | \(\mathcal O(1)\) | \(\mathcal O(n)\) |
Tuning \(\epsilon\) and \(MinPts\)
Before using DBSCAN, we need to decide on parameters \(\epsilon\) and \(MinPts\). This choice determines the shape and the number of clusters; low \(\epsilon\) or high \(MinPts\) would find too many small clusters, a high \(\epsilon\) might trivially terminate with a single cluster containing the entire dataset.
To find good \(\epsilon\) and \(MinPts\), there are some heuristics. For example, one could use the point density of the least dense cluster to determine the parameters. One common heuristic uses the distance from \(k\)-nearest neighbors.
- \(kdist(\mathbf x)\) is the distance from \(\mathbf x\) to its \(k\)-nearest neighbor
- \(kdist\) plot is a chart of the \(k\)-distances of all points, sorted in decreasing order of distance.
The heuristic works as follows
- Fix a value for \(MinPts\). If there is no domain knowledge available, a common choice is \(2d-1\) where \(d\) is the dimension.
- Select a "border object" \(\mathbf x\) from the \(MinPts\) distance polt, \(\epsilon\) is set to \(MinPtsdist(\mathbf x)\)
Pros and Cons
Advantages:
- Does not require to specify number of clusters.
- Performs well with clusters of arbitrary shape.
- DBSCAN is robust to outlier and is able to handle them.
Disadvantages:
- It requires domain knowledge for \(MinPts\) and it is not easy to determine \(\epsilon\).
- If clusters are very different in terms of in-cluster densities, then DBSCAN is not well suited to define clusters as it cannot generalize well to clusters with very different densities.
DENCLUE
DENCLUE detects dense regions without specifically specifying the size of the neighborhood.
Main idea: a generalization of neighborhood density in terms of density distribution.
Density Estimation
To estimate an unknown density function in a dataset.
"Non-parametric"
Density estimation:
- determines an unknown probability density function
- it is non-parametric and does not assume a fixed probability model
- estimates the probability density at each point
Univariate density estimation
In one dimension, we can model the data as a random variable \(X\).
Unidimensional data points can be treated as \(n\) samples or observations \(\{x_i\}\) from such a random variable.
We can estimate the cumulative density function CDF by counting the number of points less than a certain value \(v\)
where \(I\) is the indicator functions. which is \(1\) if the conditions inside the \((\cdot)\) is satisfied and \(0\) otherwise.
The density function is then estimated by taking the derivative of the CDF
where \(k\) is the number of points in a window of width \(h\) centered on \(x\). The density estimate is ratio of points in the window \((k/n)\) to the volume of the window \(h\).
Kernel density estimation: We need to be careful on the window size \(h\). We define a kernel function \(K\) that is a probability density function: Don't mess up with kernel in k-means
- non-negative \(K(x)\ge0\)
- symmetric \(K(-x)=K(x)\)
- integrates to \(1\), i.e., \(\int K(x)dx=1\).
The discrete kernel is the simplest kernel and it corresponds to a degenerate uniform probability density function
when \(|z|=\frac{x-x_i}{h}\le{1\over2}\), the point \(x_i\) is inside a window of size \(h\) centered at point \(x\). We can then rewrite the density estimator as
The discrete kernel is not smooth, meaning that the kernel estimate looks again like a histogram with the consequence that many useful calculus operator cannot be comfortably performed.
Gaussian kernel: Kernel used in DENCLUE. The Gaussian kernel is one of the most popular kernels and represents a Gaussian distribution centered around a point.
As usual, by \(z={x-x_i\over h}\) we obtain
where \(x\) at the center of the window plays the role of the mean, and the window size \(h\) becomes standard deviation.
Multivariate density estimation
For \(d\)-dimensional data \(\mathbf x=(x_1,\cdots,x_d)\), the window \(h\) becomes a hypercube centered at \(x\) with volume \(h\), that is \(vol(H_d(h))=h^d\). Density estimation is the same:
where the multivariate kernel must preserve non-negativity, symmetry, and integrate to 1. It is then straightforward to redefine the discrete and Gaussian kernels.
Discrete multivariate kernel Similar to univariate counterpart, the multivariate kernel is
The Gaussian kernel uses a equi-variate covariance matrix which translates into using the identity matrix \(\Sigma=\mathbf I\)
Nearest neighbor estimation
Instead of fixing the size \(h\) of the region, the nearest neighbor estimate fixes the number of points \(k\) in the neighborhood of a point \(\mathbf x\) and finds the size of the region.
KNN proceeds:
- Fix \(k\), the minimum number of points that defines a dense region.
- Compute the volume \(vol(S_d(h_\mathbf x))\) as a function of \(k\).
Given \(k\), we estimate density at \(\mathbf x\) as
where \(h_\mathbf x\) is the distance from \(\mathbf x\) to its \(k\)-nearest neighbor. The volume is the volume of the \(d\)-dimensional hypersphere centered at \(\mathbf x\).
Density Attractors
How can we find dense clusters?
Look at the density function and set a threshold \(\xi\) i.e. a horizontal line in a density points histogram. However, if so, some points in low-density regions might not be assigned to any cluster.
Density attractor: A point \(\mathbf x^*\) is a density attractor if it is a local maxima of the probability density function \(f\).
In order to discover these peaks in the density function, we need to compute its gradient.
For the Gaussian kernel, the gradient takes a particularly neat form
which yields
Density-based cluster: A set of \(C\) of data points from a data set \(\mathcal D\) is a density-based cluster w.r.t. some threshold \(\xi\) if there exist a set of attractors \(\mathbf x_1^*,\cdots,\mathbf x_k^*\) if
- Each point \(\mathbf x\) in \(C\) attracted to some attractor \(\mathbf x_i^*\).
- Each density attractor \(\mathbf x_i^*\) exceeds some density threshold \(\xi\), i.e., \(\hat f(\mathbf x_i^*)\ge\xi\).
- Any two density attractors \(\mathbf x_i^*,\mathbf x_j^*\) are density reachable. That is, there exists a path from \(\mathbf x_i^*\) to \(\mathbf x_j^*\) such that for all points \(\mathbf y\) on the path, \(\hat f(\mathbf y)\ge\xi\).
Not all points will be part of a cluster. We treat them as noise.
The DENCLUE algorithm: For each point in the data, it finds a candidate attractor. If the attractor's density is above the threshold \(\xi\), it is added to the set \(A\) of attractors and the point is added to the set \(R\) of points attracted by \(\mathbf x^*\). Then all the maximal sets of mutually density reachable attractors computed.
To find an attractor for a point \(\mathbf x\) we can solve \(\nabla\hat f(\mathbf x)=0\), thus
We can use it as iterative rule until convergence. The algorithm runs in \(\mathcal O(n^2T)\) where \(T\) is the number of iterations.
DENCLUE vs DBSCAN
DBSCAN corresponds to a discrete kernel, more efficient bit more rough. \(\epsilon,MinPts\) corresponds \(h,\xi\).
DENCLUE uses Gaussian kernel density based attractors, which is smooth model and more able to capture underlying data distribution than DBSCAN.
Pros and Cons
Advantages
- Clusters can arbitrary shape and size, not necessarily be convex hull.
- Number of clusters is determined automatically.
- Can separate clusters from surrounding noise.
- Can be supported by spatial index structures.
Disadvantages
- Input parameters may be difficult to determine
- Can be sensitive to input parameter setting
Clustering Evaluation
External Measures
Take criteria into account that are not part of the clustering data.
Contingency Table
A \(r\times k\) matrix \(\mathbf N\) s.t. (cluster \(C\), ground-truth clustering \(T\))
Computing the contingency table takes \(\mathcal O(nrk)\)
Purity
To what extent a cluster \(C_i\) contains points only from one ground truth cluster.
The purity of a clustering is the weighted sum of the purity of each cluster.
Maximum Matching
Consider a 1-to-1 assignment that maximizes the overall overlap between ground-truth clusters and computed clusters.
F-measure
precision and recall
Conditional Entropy
Internal Measures
Based only on clustering data
Silhouette coefficient
Davies-Bouldin index
...