数据挖掘 3 Densiy-based Clustering
Density-based departs from the rigid geometrical structure of representative-based clustering and defines cluster as high-density regions separated by low density regions.
Density is number of points in a certain region in proportion to the size of the region. Different definition of density, different density-based approaches.
DBSCAN
A density-based clustering algorithm grounded into the intuition that a point in a cluster should be density reachable from any other point in that cluster, i.e., there should be “enough” points around each point in a cluster.
This standard does not include any specific on the shape of the cluster, but only constrains the neighborhood of a point. Thus, DBSCAN can handle clusters with arbitrary shape.
Definitions
Neighborhood: For each point , we define the neighborhood the set of points that are at distance at most from , formally .
Additionally, a neighborhood is dense if it contains at least points, i.e., .
Core object: A point whose neighborhood contains at least points.
Density-reachable: is directly density-reachable from within , if , and is a core-object.
A point is density reachable from if is directly density reachable from a point in the neighborhood . Density reachability is the transitive closure of directly density reachability. Note that the density reachability is not symmetric. In particular, since might not have in its neighborhood, it cannot “reach” with a sequence of dense areas.
Density-connected: is density-connected to a point if there is a point such that both and are density-reachable from .
and are not necessarily density-reachable in this case.
Density-based cluster: a non-empty subset of the data that satisfies both
- Maximality: If a point and is density-reachable from , then .
- Connectivity: Each object in is density-connected to all other objects.
A density-based clustering of a database is a partition s.t. , , and , where are the density-based clusters. is the noise cluster of objects that do not belong to any cluster.
DBSCAN Algorithm
Pseudo code
Input: dataset
for each , do
if is not part of any cluster then
if is a core project then
Compute the density-reachable points from
Create a new cluster .
else
The main theoretical idea behind the DBSCAN algorithm is that each object in a density-based cluster is density-reachable from any of its core objects and nothing else is reachable from the core objects.
The DBSCAN is complete and ensures the two properties of maximality and connectivity. The DBSCAN algorithm is a simple iteration around the objects in the database, starting from a random point and finding all reachable points.
Density-reachable objects are collected by performing successive -neighborhood queries that find the points at distance from another point.
Complexity Analysis
When dimensionality is high, DBSCAN runs in . By using specialized indexes for spatial queries, e.g., trees, DBSCAN can run faster than CLARANS that as a runtime complexity of .
-query | DBSCAN | |
---|---|---|
Without support (worst case) | ||
Tree based support (e.g. ) | ||
Direct access to the neighborhood |
Tuning and
Before using DBSCAN, we need to decide on parameters and . This choice determines the shape and the number of clusters; low or high would find too many small clusters, a high might trivially terminate with a single cluster containing the entire dataset.
To find good and , there are some heuristics. For example, one could use the point density of the least dense cluster to determine the parameters. One common heuristic uses the distance from -nearest neighbors.
- is the distance from to its -nearest neighbor
- plot is a chart of the -distances of all points, sorted in decreasing order of distance.
The heuristic works as follows
- Fix a value for . If there is no domain knowledge available, a common choice is where is the dimension.
- Select a “border object” from the distance polt, is set to
Pros and Cons
👍 Advantages:
- Does not require to specify number of clusters.
- Performs well with clusters of arbitrary shape.
- DBSCAN is robust to outlier and is able to handle them.
👎Disadvantages:
- It requires domain knowledge for and it is not easy to determine .
- If clusters are very different in terms of in-cluster densities, then DBSCAN is not well suited to define clusters as it cannot generalize well to clusters with very different densities.
DENCLUE
DENCLUE detects dense regions without specifically specifying the size of the neighborhood.
Main idea: a generalization of neighborhood density in terms of density distribution.
Density Estimation
To estimate an unknown density function in a dataset.
“Non-parametric”
Density estimation:
- determines an unknown probability density function
- it is non-parametric and does not assume a fixed probability model
- estimates the probability density at each point
Univariate density estimation
In one dimension, we can model the data as a random variable .
Unidimensional data points can be treated as samples or observations from such a random variable.
We can estimate the cumulative density function CDF by counting the number of points less than a certain value
where is the indicator functions. which is if the conditions inside the is satisfied and otherwise.
The density function is then estimated by taking the derivative of the CDF
where is the number of points in a window of width centered on . The density estimate is ratio of points in the window to the volume of the window .
Kernel density estimation: We need to be careful on the window size . We define a kernel function that is a probability density function: ⚠️Don’t mess up with kernel in k-means
- non-negative
- symmetric
- integrates to , i.e., .
The discrete kernel is the simplest kernel and it corresponds to a degenerate uniform probability density function
when , the point is inside a window of size centered at point . We can then rewrite the density estimator as
The discrete kernel is not smooth, meaning that the kernel estimate looks again like a histogram with the consequence that many useful calculus operator cannot be comfortably performed.
Gaussian kernel: Kernel used in DENCLUE. The Gaussian kernel is one of the most popular kernels and represents a Gaussian distribution centered around a point.
As usual, by we obtain
where at the center of the window plays the role of the mean, and the window size becomes standard deviation.
Multivariate density estimation
For -dimensional data , the window becomes a hypercube centered at with volume , that is . Density estimation is the same:
where the multivariate kernel must preserve non-negativity, symmetry, and integrate to 1. It is then straightforward to redefine the discrete and Gaussian kernels.
Discrete multivariate kernel Similar to univariate counterpart, the multivariate kernel is
The Gaussian kernel uses a equi-variate covariance matrix which translates into using the identity matrix
Nearest neighbor estimation
Instead of fixing the size of the region, the nearest neighbor estimate fixes the number of points in the neighborhood of a point and finds the size of the region.
KNN proceeds:
- Fix , the minimum number of points that defines a dense region.
- Compute the volume as a function of .
Given , we estimate density at as
where is the distance from to its -nearest neighbor. The volume is the volume of the -dimensional hypersphere centered at .
Density Attractors
How can we find dense clusters?
Look at the density function and set a threshold i.e. a horizontal line in a density points histogram. However, if so, some points in low-density regions might not be assigned to any cluster.
Density attractor: A point is a density attractor if it is a local maxima of the probability density function .
In order to discover these peaks in the density function, we need to compute its gradient.
\nabla\hat f(\mathbf x)={\part\over\part\mathbf x}\hat f(\mathbf x)={1\over nh^d}\sum_{i=1}^n{\part\over\part\mathbf x}K({\mathbf x-\mathbf x_i\over h})
For the Gaussian kernel, the gradient takes a particularly neat form
{\part\over\part\mathbf x}K({\mathbf x-\mathbf x_i\over h})=K({\mathbf x-\mathbf x_i\over h})\cdot({\mathbf x-\mathbf x_i\over h})\cdot({1\over h})
which yields
Density-based cluster: A set of of data points from a data set is a density-based cluster w.r.t. some threshold if there exist a set of attractors if
- Each point in attracted to some attractor .
- Each density attractor exceeds some density threshold , i.e., .
- Any two density attractors are density reachable. That is, there exists a path from to such that for all points on the path, .
Not all points will be part of a cluster. We treat them as noise.
The DENCLUE algorithm: For each point in the data, it finds a candidate attractor. If the attractor’s density is above the threshold , it is added to the set of attractors and the point is added to the set of points attracted by . Then all the maximal sets of mutually density reachable attractors computed.
To find an attractor for a point we can solve , thus
We can use it as iterative rule until convergence. The algorithm runs in where is the number of iterations.
DENCLUE vs DBSCAN
DBSCAN corresponds to a discrete kernel, more efficient bit more rough. corresponds .
DENCLUE uses Gaussian kernel density based attractors, which is smooth model and more able to capture underlying data distribution than DBSCAN.
Pros and Cons
👍 Advantages
- Clusters can arbitrary shape and size, not necessarily be convex hull.
- Number of clusters is determined automatically.
- Can separate clusters from surrounding noise.
- Can be supported by spatial index structures.
👎Disadvantages
- Input parameters may be difficult to determine
- Can be sensitive to input parameter setting
Clustering Evaluation
External Measures
Take criteria into account that are not part of the clustering data.
Contingency Table
A matrix s.t. (cluster , ground-truth clustering )
Computing the contingency table takes
Purity
To what extent a cluster contains points only from one ground truth cluster.
The purity of a clustering is the weighted sum of the purity of each cluster.
Maximum Matching
Consider a 1-to-1 assignment that maximizes the overall overlap between ground-truth clusters and computed clusters.
F-measure
precision and recall
Conditional Entropy
Internal Measures
Based only on clustering data
Silhouette coefficient
Davies-Bouldin index
…