How to segment a sequence of points with an efficient algorithms and how to find similar documents in a linear manner.

2 algorithm.

The first is clustering a sequence to form $k$ partitions with a dynamic programming.

The second is an approximation that allows to compute pairwise distance avoiding $\mathcal O(n^2)$. The idea is to use efficient signatures and hash the signatures in bands so as similar items will likely end up in the same bucket.

Sequence Segmentation

A sequence is a vector $(x_1,\cdots, x_n)\in\mathbb R^n$.

Sequence segmentation: Given a set sequence $(x_1,\cdots,x_n)$, find $k$ contiguous segments that are as homogeneous as possible.

The sequence segmentation problem requires a notion of homogeneity similar to one of clustering. Points in the segment should be subsequences rendering the problem different from that of clustering.

A sequence is homogeneous if its points have a small deviation from the mean. Let $\mathcal D$ be a sequence of $n$ $d$-dimensional points.

Definition of $k$-segmentation: A $k$-segmentation $S$ is a partition of $\mathcal D$ into $k$ contiguous segments $(s_1,\cdots,s_k)$. Each segment $s_i$ is represented by a $d$-dimensional mean vector $\mu_i=\frac{1}{|s|}\sum_{\mathbb x\in s}\mathbb x$.

The mean vector has the same role of the centroid for $k$-means clustering. The error $E(S)$ of a $k$-segmentation $S$ is the error of replacing each individual point with the means. Similar to $k$-means, one of the most common errors is sum of squares error: $$ E(S)=\sum_{s\in S}\sum_{\mathbb s\in S}(\mathbb x-\mu_s)^2 $$

Problem of k-segmentation problem

Given a sequence $\mathcal D=(\mathbb x_1,\cdots,\mathbb x_n)$ of length $n$, find a set of $k$ segments $S$ of $\mathcal D$ s.t. the SSE error is minimized.

A Dynamic Programming Solution

A dynamic programming solution from Bellman solves the $k$-segmentation problem in polynomial time.

The idea is to build the solution bottom-up solving progressively larger instances of the problem.

DP requires the property of optimal sub-problem that states that the solution on a subset of the input is also optimal.

Denote $D[1,i]$ the sub-sequence $\mathbb x_1,\cdots,\mathbb x_i$ with $i\le n$. $E(i,s)$ is the error of the optimal segmentation of the sub-sequence $D[1,i]$ with $s$ segments with $s\le k$. The dynamic programming algorithm follows the recursion $$ E(i,s)=\min_{s\le j\le i-1}\left(E(j,s-1)+\sum_{j+1\le t\le i}(\mathbf t-\mu_{j+1,i})^2\right) $$

In practice we can use a 2-dimensional table that stores in position $s,i$ the value $E(i,s)$.

Pseudo-code

Input: Sequence $\mathcal D,k,E(\cdot)$

for each $i\gets1\cdots n$ do
$A[1,i]=E(D[1\cdots i])$
for each $s\gets1\cdots k$ do
$A[s,s]=0$
for each $s=2\cdots n$ do
for each $i=s+1,\cdots,n$ do
- $A[k,i]=\min_{j\lt i}(A[s-1,j]+E(D[j+1\cdots i]))$

Warning

To recover the actual segmentation, the matrix $A$ should store also the values $j$ minimizing the error.

Complexity

The algorithm needs to fill $\mathcal O(nk)$ table, each cell computes the minimum, takes $\mathcal O(n)$ cells to check, each of it computes the error in $\mathcal O(n)$. Thus total complexity of $\mathcal O(n^3k)$ for naive version.

The errors per cell can be further optimized by noticing that $$ \sum_{j+1\le t\le i}(\mathbf t-\mu_{[j+1,i]})^2=\sum_{j+1\le t\le i}\mathbf t^2-\frac{1}{i-j}(\sum_{j+1\le t\le i}\mathbf t)^2 $$ that requires the repeated computation of $\mathbf t^2$ and $\mathbf t$. These two terms can be easily pre-computed for all $i=1\cdots n$. By this optimization, complexity is reduced to $\mathbf O(n^2k)$.

Heuristic: we can further reduce the computational burden by using greedy approaches or local search that assigns border points randomly and move the in a way to reduce the errors. The total time is $\mathcal O(nk)$.

Finding Similar Points Efficiently

Comparing each point in the data with other, raising $\mathcal O(n^2)$.

Given a dataset $\mathcal D=\{\mathbb x_1,\cdots,\mathbb x_n\}$ of $d$-dimensional data points, find all the pairs of data point $(\mathbb x_i,\mathbb x_j)$ s.t. $d(\mathbb x_i,\mathbb x_j)\le s$ for some distance function $d$.

With Locality-Sensitive Hashing, we can get rid of $\mathcal O(n^2)$.

LSH 不是今天上午的随机算法课才讲的么……这下闭环了。

A large number of documents do not fit into memory and we need a representation of each document that allows for preserving the distances while considerably reducing the memory footprint. In order to achieve such a gold, we use the 3 steps:

Shingling: Convert documents to sets.
Min-hashing: Convert large sets to short signatures, while preserving similarity.
LSH: Focus on pairs of signatures likely to be from similar documents.

Shingling

A $k$-shingle is a set of $k$ tokens that appear in the document.

Idea: we return a large number of sparse vector with low similarity.

For instance, the set of $2$-shingle for document $D_1=abcab$ is $S(D_1)=\{ab,bc,ca\}$.

Long shingles can be easily compressed by hashing. Documents are compared through the hash values. Finally each document is a binary vector in the space of $k$-shingle in which a shingle is a dimension.

Warning

We assume that documents have a large number of common shingles if they are similar. However, $k$ is a critical choice to compare documents. A small $k$ leads to a large number of similar documents. As rule of thumb $k=5,10$ is a reasonable choice for short/large documents.

Shingling doesn't solve the problem of a large number of similarity computation.

Min-hash

The document is converted into a set of shingles and consequently as a binary vector, where each dimension is a shingle and a $1$ in position $i$ of document $j$ indicates that the document contains the shingle $i$. This way can we represent our dataset as a matrix where each row represents.

Computing Jaccard similarity in such a matrix relies on simple bit-wise AND, intersection and OR.

Min-Hashing is used for faster comparison. It converts the binary representation into small signatures through hashing. The key idea is to hash each column to a small signatures $h(C)$ s.t.

$h(C)$ is small enough that the signature fits in RAM
$sim(C_1,C_2)$ is the same as the similarity of signatures $h(C_1),h(C_2)$.

Definition of min-hash function: The hash function $h_\pi(C)$ returns the index of the first row in which column $C$ has value $1$. $$ h_\pi(C)=\min_\pi(C) $$ Hash function is a approximation of a set.

Min-hash preserves the property that $\Pr[h_\pi(C_1)=h_\pi(C_2)]=sim(C_1,C_2)$.

A trick

For each column $C$, and hash function, $k_i$ keep a slot for the min-hash value. Initialize $Sig(C)[i]=\infty$, then scan rows looking for $1$s. How to choose $k_i$? Using universal hashing.

又是随机算法前面学过的东西。

LSH

LSH tackles the problem of the comparison by hashing the signatures in an appropriate manner.

Intuition: use a function $f(x,y)$ to determine whether $x$ and $y$ is a candidate pair whose similarity must be evaluated. For Min-hash matrices we hash the columns of the signature matrix $M$ to different buckets. If two documents hashes into the same buckets, they become a candidate pair with similarity $sim\ge threshold$.

LSH works:

Pick a similarity threshold $s$.
The columns of $x$ and $y$ of $M$ are a candidate pair if their signatures agree on at least a fraction $s$ of their rows. $M(i,x)=M(i,y)$ for at least a fraction $s$ of $i$.
The idea is to hash columns of signatures matrix $M$ several times. Similar columns should be likely to hash to the same bucket with high probability.

LSH partitions $M$'s columns into $b$ bands each of $r$ rows. For each band, compute a signatures using a hash function with $k$ buckets, where $k$ is arbitrarily large. The candidate column pairs are those that hash to the same bucket for at least $1$ band. Tune $b$ and $r$ to catch most similar pairs, few non-similar pairs.

How to choose $b$ and $r$?

Balances false positives / negatives.

We need to strike a trade-off with the number of bands and the number of rows.

As $b$ increases, the method becomes more reliable, but the time increases as well.

If $t=sim(C_1,C_2)$ the probability that all rows in a band are equal is $t^r$ and probability that some row in a band are different is $1-t^r$.

As such the probability that no band is identical is $(1-r^t)^b$ that leads to $(1-(1-r^t)^b)$ the probability that at least $1$ band is identical.

Pros and Cons

Success

Finds similar items in linear time

Easy and fast to implement

Generalizes to other similarity measures

Failure

Performance depend on the choice of $b,r$

Requres a fairly large number of buckets