Outlier detection as the task of finding anomalous points in data.

Brief Introduction

Outlier: definition depends on the application. “An object that deviates so much from the rest of the data as to arouse suspicion that it was generated by a different mechanism.”

Noise VS outliers: Noise is a measuring error. Outlier is a point that belongs to the same distribution.

Different types of outliers:

  • Global outliers: Point that significantly deviate from the data set.
  • Contextual outliers: Points deviate from a subset of points. Context is defined as fixed values in some dimensions.
  • Collective outliers: A set of points might represent an outlier w.r.t. the rest of the points. In this case, a single point might not be an outlier, though such is the whole set.

Methods for outlier detection: supervised and unsupervised.

  • Supervised outlier detection requires labels.
  • Unsupervised outlier detected. ✅ We are interested in this one.

This lecture:

  • Low-dimensional
    • model-based
    • depth-based
    • distance-based
  • high-dimensional
    • angle-based
    • local outliers
    • clustering-based
    • isolation forests

Low-dimensional Outlier Detection

Model-based Methods

It assumes that the data is a sample from a known distribution P(X)P(X). After fitting the parameters θ\theta of the distribution, return the points having low probability P(x)P(\mathbb x), i.e., points that are unlikely to be generated by the distribution P(X)P(X).

The simplest model-based method: assuming data D\mathcal D belongs to a normal distribution

Df(xμ,Σ)=1(2π)dΣexp(12(xμ)TΣ1(xμ))\mathcal D\sim f(\mathbb x|\mu,\Sigma)={1\over\sqrt{(2\pi)^d|\Sigma|}}\exp(-{1\over2}(x-\mu)^T\Sigma^{-1}(x-\mu))

A rule of thumb: 3σ3\sigma

The Grubbs’ Test provides a statistical confidence whether the point is generated by the same normal distribution. But only with uni-variate 单变量 data. The test’s null hypothesis H0H_0 is the absence of outliers in the data. Test works as follows:

  1. Compute the Grubbs’ test statistics G=maxxDxμσG={\max_{\mathbb x\in\mathcal D}|\mathbb x-\mu|\over\sigma}

  2. Reject the null hypothesis H0H_0 if

    G>n1nG\gt\frac{n-1}{\sqrt n}tα/n22n1+tα/n22\sqrt\frac{t^2_{\alpha/n-2}}{n-1+t^2_{\alpha/n-2}}

    where tt is the tt-student distribution.

Other model-based approaches generalize to arbitrary distributions, such as Gaussian mixtures. We fit the model parameters using methods such as Expectation Maximization algorithm…

Pros and Cons

👍

  • Strong statistical foundation
  • Output labels for outliers and non-outliers but can also output a score of outlierdness
  • Efficient to compute and easy to interpret

👎

  • Often work only in the 1-dimensional case
  • Require fixing a model for the data distribution
  • The Gaussian distribution is affected by the dimensionality

Depth-based Approaches

It search for outliers that stands at the border of the data. One such approach is to run the convex hull algorithm to enclose the points inside a shape. The implicit assumption is that outliers are always located at the border of the data space, while the normal points are inside a convex shape.

The algorithm alternates the convex hull algorithm to the removal of the points at the border. The points are marked with the number of iteration in which they have been removed. The number represents the depth of the point.

Finally, the algorithm marks as outliers all the points at the border. The points are marked with number of iteration in which they are removed. The number represents the depth of the point.

Finally the algorithm marks as outliers all the points having a depth t\ge t.

Although simple, depth-based approaches have severe limitations:

  1. The data is assumed to lie on a convex shape. Not the case for multi-modal distributions in which the data contains many clusters.
  2. Convex hull computation is only efficient on low-dimensional data.

For these reasons, depth-based can only be used in specific settings.

Pros and Cons

👍

  • Uses a global reference set for outlier detection
  • Originally outputs a label, but can be extended for scoring easily, take depth as a scoring value for example.

👎

  • The data must be uni-modal.
  • Convex hull computation is usually only efficient in 2D and 3D spaces.

Distance-based Methods

A direct formalization of the outlier notion by Hawkins.

A point p\mathbb p is an outlier if most points are further than ϵ\epsilon from p\mathbb p.

Definition: Given a radius ϵ\epsilon and a percentage π\pi, a point p\mathbb p is an outlier if at most π\pi percent of the points xD\mathbb x\in\mathcal D are at distance less than ϵ\epsilon from p\mathbb p.

The set of outliers OD\mathcal O_\mathcal D:

OD={p s.t. {xDdist(p,x)<ϵ}Dπ}\mathcal O_\mathcal D=\{\mathbb p\text{ s.t. }{|\{\mathbb x\in\mathcal D|dist(\mathbb p,\mathbb x)\lt\epsilon\}|\over|\mathcal D|}\le\pi\}

Convient notion of outlier score, easy to implement.

However, ϵ,π\epsilon,\pi are globally defined, distance-based outliers can not detect outliers with different densities.

Pros and Cons

👍

  • Easy to implement
  • Highly interpret-able

👎

  • Effectiveness depends on the choice of distance measure distdist
  • Cannot detect collective outliers
  • Sensitive to the choice of ϵ,π\epsilon,\pi
  • Sensitive to data density

High-dimensional Outlier Detection

Suffers from the curse of dimensionality. In this case outliers hides within inliers and make analyses harder.

Angel-based Outlier Detection

abbr. ABOD

Curse of dimensionality doesn’t affect angles as much as distances. Because the relative positions of the points do not change in higher dimension.

Idea: The angle formed with any pair of other points is typically small in variance.

这个观点很有趣,一个异常值相对于其他点有偏离,所以这个异常值和其他点连线的方向都应该差别不太大。(异常值离聚类越远,连线的夹角越接近。)

ABOD works as follows:

  • Consider the angles formed by a point p\mathbb p with any point x,yD\mathbb x,\mathbb y\in\mathcal D
  • Plot the angles in a chart
  • The variance of the angles represents an indication on whether a point is an outlier or an inlier.

ABOD measures the variance of angles, weighted by the corresponding distances by

ABOD(p)=VARx,yD(xp,ypxp2yp2).ABOD(\mathbb p)=VAR_{\mathbb x,\mathbb y\in\mathcal D}\left({\langle\mathbb x-\mathbb p,\mathbb y-\mathbb p\rangle}\over||\mathbb x-\mathbb p||^2\cdot||\mathbb y-\mathbb p||^2\right).

A point with a small ABOD denotes an outlier/ The naive computation of ABOD takes O(n3)\mathcal O(n^3) due to the number of pairs for each point.

One possible solution is to consider only a subset of pairs, to get a good approximation to exact ABOD.

Pros and Cons

👍

  • Global approach to outlier detection
  • Performs well on high-dimensional data

👎

  • The naive algorithm runs in O(n3)\mathcal O(n^3)
  • Cannot detect collective outliers

Local Methods

A point is a outlier if local density is larger than that of its own neighbors.

Local outliers are defined in terms of nearest neighbors and require only to specify the parameter to define how many neighbor points to consider.

One of local outlier methods is the Local Outlier Factor.

The KNN distance is the distance distk(p)dist_k(\mathbb p) between a point p\mathbb p and its k-nearest neighbor.

Reachability distance sets a lower bound between the KNN distance and the distance of a point x\mathbb x from a point p\mathbb p

rdistk(p,x)=max{distk(x),dist(p,x)}rdist_k(\mathbb p,\mathbb x)=\max\{dist_k(\mathbb x),dist(\mathbb p,\mathbb x)\}

The local reachability distance (lrd) of a point p\mathbb p is the average inverse reachability distance of a point p\mathbb p from its KNN. The higher the lrd(p)lrd(\mathbb p), the closer p\mathbb p is to its k-nearest points.

lrdk(p)=1Nk(p)1xNk(p)rdistk(p,x)lrd_k(\mathbb p)={1\over|N_k(\mathbb p)|}{1\over\sum_{x\in N_k(\mathbb p)}rdist_k(\mathbb p,\mathbb x)}

Finally the Local Outlier Factor (LOF) is the average ratio of the local reachability distances of the neighbors of p\mathbb p and that of p\mathbb p. The local outlier factor measures the relative density of the neighbors of a point and point it self.

LOFk(p)=1Nk(p)xNk(p)lrdk(x)lrdk(p)LOF_k(\mathbb p)={1\over|N_k(\mathbb p)|}\sum_{\mathbb x\in N_k(\mathbb p)}{lrd_k(\mathbb x)\over lrd_k(\mathbb p)}

If the density of points around p\mathbb p is less than the density around its neighbors, the LOF will be large.

LOF solves curse of dimensionality. However, LOF struggles in regions with low-density.

Pros and Cons

👍

  • Provides a scores of outlierdness
  • Easy interpret-ability
  • Not affected by the curse of dimensionality

👎

  • Inefficient due to the computation of the neighbors
  • Requires the correct setting of the parameter kk
  • Hard to detect outliers in low-density regions

Clustering-based Methods

e.g. in DBSCAN, every object not contained in a cluster must be an outlier.

FindCBLOF. In order to find outliers using clustering techniques we can use a general method called FindCBLOF.

  • Compute clusters using a clustering algorithm
  • Sort the clusters in decreasing order by number of points
  • Assign to each point a cluster-based local outlier factor (CBLOF)
    • If point p\mathbb p belongs to a large cluster CiC_i, then CBLOF(p)=CiSCBLOF(\mathbb p)=|C_i|\cdot S, where SS is similarity between p\mathbb p and the cluster.
    • If p\mathbb p belongs to a small cluster, then CBLOF(p)=CiSCBLOF(\mathbb p)=|C_i|\cdot S, where SS is the similarity between p\mathbb p and the closest large cluster.

Pros and Cons

👍

  • Detect outliers without requiring labeled data
  • Works on several types of data
  • Clusters can be regarded as summaries of data

👎

  • The effectiveness depends on the chosen clustering method
  • High computational cost to compute the clusters

Isolation Forest

abbr. IF is a method to detect outliers based on the notion of separability.

p\mathbb p is separable if it is easy to find a random hyper-plane that separates the point from the rest. Idea is similar to a decision tree.

The probability of a point p\mathbb p to be separated from the others in a random split of the data is higher if the point is an outlier. This means that such point with high probability, a leaf close to the root of a randomly-built decision tree.

However, the point p\mathbb p can also be separated by chance. As such, the algorithm builds a number of trees (forest) and calculates the expected height E(h(p))\mathbb E(h(\mathbb p)) of the root-leaf path and uses such expectation to compute an outlier score

s(p)=2E(h(p))/c(n)s(\mathbb p)=2^{-\mathbb E(h(\mathbb p))/c(n)}

where c(n)c(n) is the expected path length of a balanced binary search tree with nn data points used for normalization. An outlier has a score close to 1, i.e. the expected length of the path from the root to the leaf h(p)0h(\mathbb p)\approx0.

Pros and Cons

👍

  • Easy to construct avoiding difficult decisions on if data points are anomalies or not.
  • Can achieve sublinear time complexity and a small memory footprint by exploiting sub-sampling. By eliminating major computational cost of distance calculation in all the distance and density based AD methods
  • Can provide anomaly explanations

👎

  • Hyperparameter tuning required such as number of trees, sample size and heights of trees.
  • Requires a high percentage of relevant features to identity anomalies.