This lecture: different approaches used to speed up the JL transform.
Recall: classic JL transform works by drawing a random matrix A with coordinates i.i.d. either N(0,1) or uniform random among {−1,+1} and then embedding a vector x to m−1/2Ax. m=O(ϵ−2logn) if one wants to preserve all pairwise distances to within (1±ϵ) and embedding time is O(dm). The target dimensionality is known to be optimal via a result by Larsen and Nelson.
Sparse Embeddings
Using an embedding matrix A that is sparse.
Sparsity
A sparse matrix is one with few non-zeroes.
If each column of A has only t non-zeroes, then the embedding time improves to O(dt).
Example of sparse: bag of words (words are embedded to vector.)
With vector representation, documents with similar words map to points that are close. Here vectors are very sparse.
The sparsity of vector x: ∣∣x∣∣0= number of non-zeroes in x.
The embedding time of classic JL is O(m∣∣x∣∣0) and with a matrix having t-sparse columns, it’s O(t∣∣x∣∣0).
Kane and Nelson showed that the following construction with high probability preserves all pairwise distances for a set of n points:-
For each column of the matrix A∈Rm×d, pick a uniform random set of t=O(ϵ−1logn) rows and assign the corresponding entries either −1 or +1 uniformly at random and independently.
Then embed x as t−1/2Ax. Hence the embedding time compared to standard JL went from O(∣∣x∣∣0ϵ−1logn), ϵ−1 improvement.
The target dimensionality m remains optimal O(ϵ−2logn).
The expected length of the embedded vector is the same as the original vector:
In the above, we use E[ai,j]=E[ai,j]E[ai,h]=0 by independence and the fact that ai,j is symmetric around 0.
Moreover, we use E[ai,j2]=t/m since ai,j takes either the value −1 or +1 with probability t/m and in both cases, ai,j2 takes value 1.
Nelson and Nguyen proved a lower bound: any sparse embedding matrix A must have t=Ω(ϵ−1logn/log(1/ϵ)), i.e., the upper bound can be improved by at most a log(1/ϵ) factor.
Setting largest possible sparsity remains a fundamental open problem in dimensionality reduction.
Feature Hashing
Another line of work tries to improve the embedding time further by making assumptions on the input data.
Take bag of words as example. If one removes very frequently occurring words, then it seems reasonable to assume that the vector representing a document has no large coordinates. More formally, the ratio between ∣∣x∣∣∞ and ∣∣x∣∣2 is small.
Under such assumptions, it’s possible to speed up the sparse transforms from above.
Feature hashing by Weinberger et al. takes this to the extreme by letting A have exactly one non-zero per column, chosen at a uniform random row and with a value that is uniform among −1 and +1. So feature hashing is really the construction of Kane and Nelson with t=1. Therefore it follows from the above calculations that the expected length of embedded vectors are correct. Feature hashing clearly has the fastest possible embedding time of just O(∣∣x∣∣0).
Freksen, Kamma and Larsen showed that Feature Hashing into the optimal m=O(ϵ−2logn) dimensions has the JL guarantees (preserve all distances to within (1±ϵ)) exactly when
Instead of using sparse matrices for embedding, another approach is to use matrices for which there are efficient algorithms to compute the product Ax.
The Fast JL transform by Ailon and Chazelle.
First observation is similar to what is used in Feature Hashing: If data vectors have only small coordinates (∣∣x∣∣2∣∣x∣∣∞ is small), then one can use very sparse matrices for the embedding. The main idea is to multiply x with a matrix which ensures that coordinates becomes small, and then use a sparse matrix afterwards.
Walsch-Hadamard Matrix
Assume without loss of generality that d is a power of 2. Let Hd be the d×d Walsch-Hadamard matrix.
Hd can be defined recursively as follows:
The 2×2 Walsch-Hadamard matrix H2 has hˉ1,1=hˉ1,2=hˉ2,1=1 and hˉ2,2=−1. Clearly all rows are orthogonal.
The 2d×2d Walsch-Hadamard matrix H2d consists of four d×d Walsch-Hadamard matrices Hd, one in each of the for quadrants.
For two distinct rows in the top half of the matrix, the orthogonality follows by induction.
Similarly for two in the bottom half. For one row in the top half of the matrix and one from the bottom half, orthogonality foolows since the top row vector has the form (v∘v) and bottom has the form w∘−w where ∘ denotes concatenation.
Hence the inner product is ⟨v,w⟩−⟨v,w⟩=0.
The crucial part about WH matrices is that one can compute the matrix-vector product Hˉdx efficiently.
Write x as (x1x2) where x1 is the first d/2 coordinates of x and x2 is the last d/2 coordinates.
Then Hˉdx=(Hˉd/2x+Hˉd/2x2Hˉd/2x1−Hˉd/2x2). Thus if we compute Hˉd/2x1 and Hˉd/2x2 recursively, we can compute Hˉdx in O(d) time. The total time to compute Hˉdx thus satisfies the recurrence T(d)=2T(d/2)+O(d), by Master’s Theorem it resolves to O(dlogd).
Let H be the normalized WH matrix, i,e, H=d−1/2Hˉd. Then all rows are still orthogonal and the norm of any row is 1. Hence H is an orthogonal matrix and ∣∣Hx∣∣22=∣∣x∣∣22 for all vectors x∈Rd.
That is, H preserves all norms. The construction of Alion and Chazelle is then as follows:
Draw a random diagonal matrix D∈Rd×d where each entry is uniform be ±1.
Draw a matrix P∈Rm×d with m=O(ϵ−2logn) s.t. for every entry, with probability 1−q we set the entry to 0, and otherwise, we let the entry be N(0,(mq)−1) distributed. The expected number of non-zeroes of P is qmd.
The embedding of x is the PHDx.
Computing Dx takes O(d) time. Multiplying the result with H takes O(dlogd) time and multiplying with P takes expected O(qmd) time.
Ailon and Chazelle showed that the construction achieves the JL guarantee if one sets q=O(log2n/d), resulting in an embedding time of O(dlogd+mlog2n)=O(dlogd+ϵ−2log3n).
Proving that the construction satisfies the JL guarantee consists of two steps.
In the first step, we need to show ∣∣HDx∣∣∞=O(log(nd)/d) with probability 1−1/n3 for any unit vector x∈Rd. That is, HD ensures that most coordinates of x are small.
In the second step, one shows that, assuming ∣∣HDx∣∣∞=O(log(nd)/d), the application of P preserves norms to within (1±ϵ) with good probability. Here we remark that HD preserves the norm of all vectors perfectly but the dimension is still d. (Only do first step here.)
Consider the i-th coordinate of HDx. Each entry of H is either −d−1/2 or d−1/2 and D multiplies a random sign onto each coordinate of x. Hence the i-th coordinate is distributed as ∑iσid−1/2xi where the σi’s are independent and uniform among −1 and 1. Clearly E[(HDx)i]=0 by linearity of expectation. We prove Hoeffding’s theory first.
Hoeffding’s Inequality. Let X1,⋯,Xd be independent random variables where Xi takes values in [ai,bi]. Let X=∑iXi, then
Pr[∣X−E[X]∣>t]<2exp(−∑i=1d(bi−ai)22t2).
For sum ∑iσid−1/2xi, we have the random variable Xi=σid−1/2xi takes values in the interval [−d−1/2xi,d−1/2∣xi∣] and thus (bi−ai)2=(2d−1/2∣xi∣)2=4d−1xi2. Thus,
For t=Cln(nd)/d for a big enough constant C, this probability is less than 1/(dn)3. A union bound over all d coordinates shows that ∣∣HDx∣∣∞=O(ln(nd)/d) with probability at least 1−1/n3.
Conclude by showing E[∣∣Px∣∣22]=∣∣x∣∣22 for all vectors x, i.e., P preserves norms in expectation. Each entry of P is distributed as the product of a Bernoulli variable bi,j taking value 1 with probability q and 0 otherwise, and a variable ni,j∼N(0,(mq)−1). We thus get:
In the above, we use E[ni,j]=0 and E[bi,j2]=E[bi,j]=q. We also use for ni,j∼N(0,σ2) random variables, we have E[ni,j2]=σ2. We also use independence of the entries in P to split the expectation of the product into the product of expectations.