数据挖掘 13 Sequence Segmentation and Similarities
How to segment a sequence of points with an efficient algorithms and how to find similar documents in a linear manner.
2 algorithm.
The first is clustering a sequence to form partitions with a dynamic programming.
The second is an approximation that allows to compute pairwise distance avoiding . The idea is to use efficient signatures and hash the signatures in bands so as similar items will likely end up in the same bucket.
Sequence Segmentation
A sequence is a vector .
Sequence segmentation: Given a set sequence , find contiguous segments that are as homogeneous as possible.
The sequence segmentation problem requires a notion of homogeneity similar to one of clustering. Points in the segment should be subsequences rendering the problem different from that of clustering.
A sequence is homogeneous if its points have a small deviation from the mean. Let be a sequence of -dimensional points.
Definition of -segmentation: A -segmentation is a partition of into contiguous segments . Each segment is represented by a -dimensional mean vector .
The mean vector has the same role of the centroid for -means clustering. The error of a -segmentation is the error of replacing each individual point with the means. Similar to -means, one of the most common errors is sum of squares error:
Problem of k-segmentation problem
Given a sequence of length , find a set of segments of s.t. the SSE error is minimized.
A Dynamic Programming Solution
A dynamic programming solution from Bellman solves the -segmentation problem in polynomial time.
The idea is to build the solution bottom-up solving progressively larger instances of the problem.
DP requires the property of optimal sub-problem that states that the solution on a subset of the input is also optimal.
Denote the sub-sequence with . is the error of the optimal segmentation of the sub-sequence with segments with . The dynamic programming algorithm follows the recursion
In practice we can use a 2-dimensional table that stores in position the value .
Pseudo-code
Input: Sequence
- for each do
- for each do
- for each do
- for each do
- for each do
To recover the actual segmentation, the matrix should store also the values minimizing the error.
Complexity
The algorithm needs to fill table, each cell computes the minimum, takes cells to check, each of it computes the error in . Thus total complexity of for naive version.
The errors per cell can be further optimized by noticing that
that requires the repeated computation of and . These two terms can be easily pre-computed for all . By this optimization, complexity is reduced to .
Heuristic: we can further reduce the computational burden by using greedy approaches or local search that assigns border points randomly and move the in a way to reduce the errors. The total time is .
Finding Similar Points Efficiently
Comparing each point in the data with other, raising .
Given a dataset of -dimensional data points, find all the pairs of data point s.t. for some distance function .
With Locality-Sensitive Hashing, we can get rid of .
LSH 不是 今天上午的随机算法课 才讲的么……这下闭环了。
Definition of Jaccard similarity: of two sets is the size of their intersection sim(C_1,C_2)=|C_1\and C_2|/|C_1\or C_2|. The Jaccard distance is then d(C_1,C_2)=1-|C_1\and C_2|/|C_1\or C_2|.
A large number of documents do not fit into memory and we need a representation of each document that allows for preserving the distances while considerably reducing the memory footprint. In order to achieve such a gold, we use the 3 steps:
- Shingling: Convert documents to sets.
- Min-hashing: Convert large sets to short signatures, while preserving similarity.
- LSH: Focus on pairs of signatures likely to be from similar documents.
Shingling
A -shingle is a set of tokens that appear in the document.
Idea: we return a large number of sparse vector with low similarity.
For instance, the set of -shingle for document is .
Long shingles can be easily compressed by hashing. Documents are compared through the hash values. Finally each document is a binary vector in the space of -shingle in which a shingle is a dimension.
We assume that documents have a large number of common shingles if they are similar. However, is a critical choice to compare documents. A small leads to a large number of similar documents. As rule of thumb is a reasonable choice for short/large documents.
Shingling doesn’t solve the problem of a large number of similarity computation.
Min-hash
The document is converted into a set of shingles and consequently as a binary vector, where each dimension is a shingle and a in position of document indicates that the document contains the shingle . This way can we represent our dataset as a matrix where each row represents.
Computing Jaccard similarity in such a matrix relies on simple bit-wise AND, intersection and OR.
Min-Hashing is used for faster comparison. It converts the binary representation into small signatures through hashing. The key idea is to hash each column to a small signatures s.t.
- is small enough that the signature fits in RAM
- is the same as the similarity of signatures .
Definition of min-hash function: The hash function returns the index of the first row in which column has value .
Hash function is a approximation of a set.
Min-hash preserves the property that .
A trick
For each column , and hash function, keep a slot for the min-hash value. Initialize , then scan rows looking for s. How to choose ? Using universal hashing.
又是随机算法前面学过的东西。
LSH
LSH tackles the problem of the comparison by hashing the signatures in an appropriate manner.
Intuition: use a function to determine whether and is a candidate pair whose similarity must be evaluated. For Min-hash matrices we hash the columns of the signature matrix to different buckets. If two documents hashes into the same buckets, they become a candidate pair with similarity .
LSH works:
- Pick a similarity threshold .
- The columns of and of are a candidate pair if their signatures agree on at least a fraction of their rows. for at least a fraction of .
- The idea is to hash columns of signatures matrix several times. Similar columns should be likely to hash to the same bucket with high probability.
LSH partitions ’s columns into bands each of rows. For each band, compute a signatures using a hash function with buckets, where is arbitrarily large. The candidate column pairs are those that hash to the same bucket for at least band. Tune and to catch most similar pairs, few non-similar pairs.
How to choose and ?
Balances false positives / negatives.
We need to strike a trade-off with the number of bands and the number of rows.
As increases, the method becomes more reliable, but the time increases as well.
If the probability that all rows in a band are equal is and probability that some row in a band are different is .
As such the probability that no band is identical is that leads to the probability that at least band is identical.
Pros and Cons
👍Finds similar items in linear time
👍Easy and fast to implement
👍Generalizes to other similarity measures
👎Performance depend on the choice of
👎Requres a fairly large number of buckets