数据挖掘 5 Outlier Detection
Outlier detection as the task of finding anomalous points in data.
Brief Introduction
Outlier: definition depends on the application. “An object that deviates so much from the rest of the data as to arouse suspicion that it was generated by a different mechanism.”
Noise VS outliers: Noise is a measuring error. Outlier is a point that belongs to the same distribution.
Different types of outliers:
- Global outliers: Point that significantly deviate from the data set.
- Contextual outliers: Points deviate from a subset of points. Context is defined as fixed values in some dimensions.
- Collective outliers: A set of points might represent an outlier w.r.t. the rest of the points. In this case, a single point might not be an outlier, though such is the whole set.
Methods for outlier detection: supervised and unsupervised.
- Supervised outlier detection requires labels.
- Unsupervised outlier detected. ✅ We are interested in this one.
This lecture:
- Low-dimensional
- model-based
- depth-based
- distance-based
- high-dimensional
- angle-based
- local outliers
- clustering-based
- isolation forests
Low-dimensional Outlier Detection
Model-based Methods
It assumes that the data is a sample from a known distribution . After fitting the parameters of the distribution, return the points having low probability , i.e., points that are unlikely to be generated by the distribution .
The simplest model-based method: assuming data belongs to a normal distribution
A rule of thumb:
The Grubbs’ Test provides a statistical confidence whether the point is generated by the same normal distribution. But only with uni-variate 单变量 data. The test’s null hypothesis is the absence of outliers in the data. Test works as follows:
Compute the Grubbs’ test statistics
Reject the null hypothesis if
where is the -student distribution.
Other model-based approaches generalize to arbitrary distributions, such as Gaussian mixtures. We fit the model parameters using methods such as Expectation Maximization algorithm…
Pros and Cons
👍
- Strong statistical foundation
- Output labels for outliers and non-outliers but can also output a score of outlierdness
- Efficient to compute and easy to interpret
👎
- Often work only in the 1-dimensional case
- Require fixing a model for the data distribution
- The Gaussian distribution is affected by the dimensionality
Depth-based Approaches
It search for outliers that stands at the border of the data. One such approach is to run the convex hull algorithm to enclose the points inside a shape. The implicit assumption is that outliers are always located at the border of the data space, while the normal points are inside a convex shape.
The algorithm alternates the convex hull algorithm to the removal of the points at the border. The points are marked with the number of iteration in which they have been removed. The number represents the depth of the point.
Finally, the algorithm marks as outliers all the points at the border. The points are marked with number of iteration in which they are removed. The number represents the depth of the point.
Finally the algorithm marks as outliers all the points having a depth .
Although simple, depth-based approaches have severe limitations:
- The data is assumed to lie on a convex shape. Not the case for multi-modal distributions in which the data contains many clusters.
- Convex hull computation is only efficient on low-dimensional data.
For these reasons, depth-based can only be used in specific settings.
Pros and Cons
👍
- Uses a global reference set for outlier detection
- Originally outputs a label, but can be extended for scoring easily, take depth as a scoring value for example.
👎
- The data must be uni-modal.
- Convex hull computation is usually only efficient in 2D and 3D spaces.
Distance-based Methods
A direct formalization of the outlier notion by Hawkins.
A point is an outlier if most points are further than from .
Definition: Given a radius and a percentage , a point is an outlier if at most percent of the points are at distance less than from .
The set of outliers :
Convient notion of outlier score, easy to implement.
However, are globally defined, distance-based outliers can not detect outliers with different densities.
Pros and Cons
👍
- Easy to implement
- Highly interpret-able
👎
- Effectiveness depends on the choice of distance measure
- Cannot detect collective outliers
- Sensitive to the choice of
- Sensitive to data density
High-dimensional Outlier Detection
Suffers from the curse of dimensionality. In this case outliers hides within inliers and make analyses harder.
Angel-based Outlier Detection
abbr. ABOD
Curse of dimensionality doesn’t affect angles as much as distances. Because the relative positions of the points do not change in higher dimension.
Idea: The angle formed with any pair of other points is typically small in variance.
这个观点很有趣,一个异常值相对于其他点有偏离,所以这个异常值和其他点连线的方向都应该差别不太大。(异常值离聚类越远,连线的夹角越接近。)
ABOD works as follows:
- Consider the angles formed by a point with any point
- Plot the angles in a chart
- The variance of the angles represents an indication on whether a point is an outlier or an inlier.
ABOD measures the variance of angles, weighted by the corresponding distances by
A point with a small ABOD denotes an outlier/ The naive computation of ABOD takes due to the number of pairs for each point.
One possible solution is to consider only a subset of pairs, to get a good approximation to exact ABOD.
Pros and Cons
👍
- Global approach to outlier detection
- Performs well on high-dimensional data
👎
- The naive algorithm runs in
- Cannot detect collective outliers
Local Methods
A point is a outlier if local density is larger than that of its own neighbors.
Local outliers are defined in terms of nearest neighbors and require only to specify the parameter to define how many neighbor points to consider.
One of local outlier methods is the Local Outlier Factor.
The KNN distance is the distance between a point and its k-nearest neighbor.
Reachability distance sets a lower bound between the KNN distance and the distance of a point from a point
The local reachability distance (lrd) of a point is the average inverse reachability distance of a point from its KNN. The higher the , the closer is to its k-nearest points.
Finally the Local Outlier Factor (LOF) is the average ratio of the local reachability distances of the neighbors of and that of . The local outlier factor measures the relative density of the neighbors of a point and point it self.
If the density of points around is less than the density around its neighbors, the LOF will be large.
LOF solves curse of dimensionality. However, LOF struggles in regions with low-density.
Pros and Cons
👍
- Provides a scores of outlierdness
- Easy interpret-ability
- Not affected by the curse of dimensionality
👎
- Inefficient due to the computation of the neighbors
- Requires the correct setting of the parameter
- Hard to detect outliers in low-density regions
Clustering-based Methods
e.g. in DBSCAN, every object not contained in a cluster must be an outlier.
FindCBLOF. In order to find outliers using clustering techniques we can use a general method called FindCBLOF.
- Compute clusters using a clustering algorithm
- Sort the clusters in decreasing order by number of points
- Assign to each point a cluster-based local outlier factor (CBLOF)
- If point belongs to a large cluster , then , where is similarity between and the cluster.
- If belongs to a small cluster, then , where is the similarity between and the closest large cluster.
Pros and Cons
👍
- Detect outliers without requiring labeled data
- Works on several types of data
- Clusters can be regarded as summaries of data
👎
- The effectiveness depends on the chosen clustering method
- High computational cost to compute the clusters
Isolation Forest
abbr. IF is a method to detect outliers based on the notion of separability.
is separable if it is easy to find a random hyper-plane that separates the point from the rest. Idea is similar to a decision tree.
The probability of a point to be separated from the others in a random split of the data is higher if the point is an outlier. This means that such point with high probability, a leaf close to the root of a randomly-built decision tree.
However, the point can also be separated by chance. As such, the algorithm builds a number of trees (forest) and calculates the expected height of the root-leaf path and uses such expectation to compute an outlier score
where is the expected path length of a balanced binary search tree with data points used for normalization. An outlier has a score close to 1, i.e. the expected length of the path from the root to the leaf .
Pros and Cons
👍
- Easy to construct avoiding difficult decisions on if data points are anomalies or not.
- Can achieve sublinear time complexity and a small memory footprint by exploiting sub-sampling. By eliminating major computational cost of distance calculation in all the distance and density based AD methods
- Can provide anomaly explanations
👎
- Hyperparameter tuning required such as number of trees, sample size and heights of trees.
- Requires a high percentage of relevant features to identity anomalies.