Lecture notes by Kasper.

This lecture: different approaches used to speed up the JL transform.

Recall: classic JL transform works by drawing a random matrix AA with coordinates i.i.d. either N(0,1)\mathcal N(0,1) or uniform random among {1,+1}\{-1,+1\} and then embedding a vector xx to m1/2Axm^{-1/2}Ax. m=O(ϵ2logn)m=O(\epsilon^{-2}\log n) if one wants to preserve all pairwise distances to within (1±ϵ)(1\pm\epsilon) and embedding time is O(dm)O(dm). The target dimensionality is known to be optimal via a result by Larsen and Nelson.

Sparse Embeddings

Using an embedding matrix AA that is sparse.

Sparsity

A sparse matrix is one with few non-zeroes.

If each column of AA has only tt non-zeroes, then the embedding time improves to O(dt)O(dt).

Example of sparse: bag of words (words are embedded to vector.)

With vector representation, documents with similar words map to points that are close. Here vectors are very sparse.

The sparsity of vector xx: x0=||x||_0= number of non-zeroes in xx.

The embedding time of classic JL is O(mx0)O(m||x||_0) and with a matrix having tt-sparse columns, it’s O(tx0)O(t||x||_0).

Kane and Nelson showed that the following construction with high probability preserves all pairwise distances for a set of nn points:-

  • For each column of the matrix ARm×dA\in\mathbb R^{m\times d}, pick a uniform random set of t=O(ϵ1logn)t=O(\epsilon^{-1}\log n) rows and assign the corresponding entries either 1-1 or +1+1 uniformly at random and independently.
  • Then embed xx as t1/2Axt^{-1/2}Ax. Hence the embedding time compared to standard JL went from O(x0ϵ1logn)O(||x||_0\epsilon^{-1}\log n), ϵ1\epsilon^{-1} improvement.
  • The target dimensionality mm remains optimal O(ϵ2logn)O(\epsilon^{-2}\log n).

The expected length of the embedded vector is the same as the original vector:

E[t1/2Ax22]=t1E[Ax22]=t1i=1mE[(j=1dai,jxj)2]=t1i=1m(j=1dE[ai,j2xj2]+j=1dhjE[ai,jai,hxjxh])=t1i=1mj=1dE[ai,j2]xj2=t1i=1mj=1d(t/m)xj2=j=1dxj2=x22\mathbb E[||t^{-1/2}Ax||_2^2]=t^{-1}\mathbb E[||Ax||_2^2]\\ =t^{-1}\sum_{i=1}^m\mathbb E[(\sum_{j=1}^da_{i,j}x_j)^2]\\ =t^{-1}\sum_{i=1}^m(\sum_{j=1}^d\mathbb E[a_{i,j}^2x_j^2]+\sum_{j=1}^d\sum_{h\neq j}\mathbb E[a_{i,j}a_{i,h}x_jx_h])\\ =t^{-1}\sum_{i=1}^m\sum_{j=1}^d\mathbb E[a_{i,j}^2]x_j^2\\ =t^{-1}\sum_{i=1}^m\sum_{j=1}^d(t/m)x_j^2\\ =\sum_{j=1}^dx_j^2\\ =||x||_2^2

In the above, we use E[ai,j]=E[ai,j]E[ai,h]=0\mathbb E[a_{i,j}]=\mathbb E[a_{i,j}]\mathbb E[a_{i,h}]=0 by independence and the fact that ai,ja_{i,j} is symmetric around 0.

Moreover, we use E[ai,j2]=t/m\mathbb E[a_{i,j}^2]=t/m since ai,ja_{i,j} takes either the value 1-1 or +1+1 with probability t/mt/m and in both cases, ai,j2a_{i,j}^2 takes value 11.

Nelson and Nguyen proved a lower bound: any sparse embedding matrix AA must have t=Ω(ϵ1logn/log(1/ϵ))t=\Omega(\epsilon^{-1}\log n/\log(1/\epsilon)), i.e., the upper bound can be improved by at most a log(1/ϵ)\log(1/\epsilon) factor.

Setting largest possible sparsity remains a fundamental open problem in dimensionality reduction.

Feature Hashing

Another line of work tries to improve the embedding time further by making assumptions on the input data.

Take bag of words as example. If one removes very frequently occurring words, then it seems reasonable to assume that the vector representing a document has no large coordinates. More formally, the ratio between x||x||_\infty and x2||x||_2 is small.

Under such assumptions, it’s possible to speed up the sparse transforms from above.

Feature hashing by Weinberger et al. takes this to the extreme by letting AA have exactly one non-zero per column, chosen at a uniform random row and with a value that is uniform among 1-1 and +1+1. So feature hashing is really the construction of Kane and Nelson with t=1t=1. Therefore it follows from the above calculations that the expected length of embedded vectors are correct. Feature hashing clearly has the fastest possible embedding time of just O(x0)O(||x||_0).

Freksen, Kamma and Larsen showed that Feature Hashing into the optimal m=O(ϵ2logn)m=O(\epsilon^{-2}\log n) dimensions has the JL guarantees (preserve all distances to within (1±ϵ)(1\pm\epsilon)) exactly when

xx2=O(ϵ1/2min{log(1/ϵ)logn,1logn}).\frac{||x||_\infty}{||x||_2}=O(\epsilon^{-1/2}\cdot\min\{\frac{\log(1/\epsilon)}{\log n},\sqrt{\frac{1}{\log n}}\}).

And more generally, if one embeds into mϵ2lognm\ge\epsilon^{-2}\log n dimensions, then distances are preserved when

xx2=O(ϵ1/2min{log(ϵm/logn)logn,log(ϵ2m/logn)logn}).\frac{||x||_\infty}{||x||_2}=O(\epsilon^{-1/2}\cdot\min\{\frac{\log(\epsilon m/\log n)}{\log n},\sqrt{\frac{\log(\epsilon^2 m/\log n)}{\log n}}\}).

Fast JL Transform

Instead of using sparse matrices for embedding, another approach is to use matrices for which there are efficient algorithms to compute the product AxAx.

The Fast JL transform by Ailon and Chazelle.

First observation is similar to what is used in Feature Hashing: If data vectors have only small coordinates (xx2\frac{||x||_\infty}{||x||_2} is small), then one can use very sparse matrices for the embedding. The main idea is to multiply xx with a matrix which ensures that coordinates becomes small, and then use a sparse matrix afterwards.

Walsch-Hadamard Matrix

Assume without loss of generality that dd is a power of 22. Let HdH_d be the d×dd\times d Walsch-Hadamard matrix.

HdH_d can be defined recursively as follows:

  • The 2×22\times2 Walsch-Hadamard matrix H2H_2 has hˉ1,1=hˉ1,2=hˉ2,1=1\bar h_{1,1}=\bar h_{1,2}=\bar h_{2,1}=1 and hˉ2,2=1\bar h_{2,2}=-1. Clearly all rows are orthogonal.
  • The 2d×2d2d\times2d Walsch-Hadamard matrix H2dH_{2d} consists of four d×dd\times d Walsch-Hadamard matrices HdH_d, one in each of the for quadrants.

Hˉ2=(1111)Hˉ2d=(HˉdHˉdHˉdHˉd)Hˉ4=(1111111111111111)\bar H_2=\left(\begin{array}{cc}1&1\\1&-1\end{array}\right)\\ \bar H_{2d}=\left(\begin{array}{cc}\bar H_d& \bar H_d\\ \bar H_d& -\bar H_d\end{array}\right)\\ \bar H_4=\left( \begin{array}{cc} 1&1&1&1\\ 1&-1&1&-1\\ 1&1&-1&-1\\ 1&-1&-1&1 \end{array} \right)

Again, all rows of Hˉ2d\bar H_{2d} are orthogonal:

  • For two distinct rows in the top half of the matrix, the orthogonality follows by induction.
  • Similarly for two in the bottom half. For one row in the top half of the matrix and one from the bottom half, orthogonality foolows since the top row vector has the form (vv)(v\circ v) and bottom has the form www\circ-w where \circ denotes concatenation.
  • Hence the inner product is v,wv,w=0\langle v,w\rangle-\langle v,w\rangle=0.

The crucial part about WH matrices is that one can compute the matrix-vector product Hˉdx\bar H_{d}x efficiently.

Write xx as (x1x2)(\begin{array}{cc}x_1\\x_2\end{array}) where x1x_1 is the first d/2d/2 coordinates of xx and x2x_2 is the last d/2d/2 coordinates.

Then Hˉdx=(Hˉd/2x+Hˉd/2x2Hˉd/2x1Hˉd/2x2)\bar H_dx=(\begin{array}{cc}\bar H_{d/2}x+\bar H_{d/2}x_2\\\bar H_{d/2}x_1-\bar H_{d/2}x_2\end{array}). Thus if we compute Hˉd/2x1\bar H_{d/2}x_1 and Hˉd/2x2\bar H_{d/2}x_2 recursively, we can compute Hˉdx\bar H_{d}x in O(d)O(d) time. The total time to compute Hˉdx\bar H_{d}x thus satisfies the recurrence T(d)=2T(d/2)+O(d)T(d)=2T(d/2)+O(d), by Master’s Theorem it resolves to O(dlogd)O(d\log d).

Let HH be the normalized WH matrix, i,e, H=d1/2HˉdH=d^{-1/2}\bar H_{d}. Then all rows are still orthogonal and the norm of any row is 11. Hence HH is an orthogonal matrix and Hx22=x22||Hx||_2^2=||x||_2^2 for all vectors xRdx\in\mathbb R^{d}.

That is, HH preserves all norms. The construction of Alion and Chazelle is then as follows: Draw a random diagonal matrix DRd×dD\in\mathbb R^{d\times d} where each entry is uniform among 1q1-q we set the entry to 00, and otherwise, we let the entry be N(0,(mq)1)\mathcal N(0,(mq)^{-1}) distributed. The expected number of non-zeroes of PP is qmdqmd. The embedding of xx is the PHDxPHDx. Computing DxDx takes O(d)O(d) time. Multiplying the result with HH takes O(dlogd)O(d\log d) time and multiplying with PP takes expected O(qmd)O(qmd) time. Ailon and Chazelle showed that the construction achieves the JL guarantee if one sets q=O(log2n/d)q=O(\log^2n/d), resulting in an embedding time of O(dlogd+mlog2n)=O(dlogd+ϵ2log3n)O(d\log d+m\log^2n)=O(d\log d+\epsilon^{-2}\log^3n).

Proving that the construction satisfies the JL guarantee consists of two steps.

In the first step, we need to show HDx=O(log(nd)/d)||HDx||_\infty=O(\sqrt{\log(nd)/d}) with probability 11/n31-1/n^3 for any unit vector xRdx\in\mathbb R^d. That is, HDHD ensures that most coordinates of xx are small.

In the second step, one shows that, assuming HDx=O(log(nd)/d)||HDx||_\infty=O(\sqrt{\log(nd)/d}), the application of PP preserves norms to within (1±ϵ)(1\pm\epsilon) with good probability. Here we remark that HDHD preserves the norm of all vectors perfectly but the dimension is still dd. (Only do first step here.)

Consider the ii-th coordinate of HDxHDx. Each entry of HH is either d1/2-d^{-1/2} or d1/2d^{-1/2} and DD multiplies a random sign onto each coordinate of xx. Hence the ii-th coordinate is distributed as iσid1/2xi\sum_i\sigma_id^{-1/2}x_i where the σi\sigma_i’s are independent and uniform among 1-1 and 11. Clearly E[(HDx)i]=0\mathbb E[(HDx)_i]=0 by linearity of expectation. We prove Hoeffding’s theory first.

Hoeffding’s Inequality. Let X1,,XdX_1,\cdots,X_d be independent random variables where XiX_i takes values in [ai,bi][a_i,b_i]. Let X=iXiX=\sum_iX_i, then

Pr[XE[X]>t]<2exp(2t2i=1d(biai)2).\Pr[|X-\mathbb E[X]|\gt t]\lt2\exp(-\frac{2t^2}{\sum_{i=1}^d(b_i-a_i)^2}).

For sum iσid1/2xi\sum_i\sigma_id^{-1/2}x_i, we have the random variable Xi=σid1/2xiX_i=\sigma_id^{-1/2}x_i takes values in the interval [d1/2xi,d1/2xi][-d^{-1/2}x_i,d^{-1/2}|x_i|] and thus (biai)2=(2d1/2xi)2=4d1xi2(b_i-a_i)^2=(2d^{-1/2}|x_i|)^2=4d^{-1}x_i^2. Thus,

Pr[XE[X]>t]<2exp(2t2i=1d4d1xi2)=2exp(2t24d1x22)=2exp(t2d/2).\Pr[|X-\mathbb E[X]|\gt t]\lt2\exp(-\frac{2t^2}{\sum_{i=1}^d4d^{-1}x_i^2})\\ =2\exp(-\frac{2t^2}{4d^{-1}||x||_2^2})\\ =2\exp(-t^2d/2).

For t=Cln(nd)/dt=C\sqrt{\ln(nd)/d} for a big enough constant CC, this probability is less than 1/(dn)31/(dn)^3. A union bound over all dd coordinates shows that HDx=O(ln(nd)/d)||HDx||_\infty=O(\sqrt{\ln(nd)/d}) with probability at least 11/n31-1/n^3.

Conclude by showing E[Px22]=x22\mathbb E[||Px||_2^2]=||x||_2^2 for all vectors xx, i.e., PP preserves norms in expectation. Each entry of PP is distributed as the product of a Bernoulli variable bi,jb_{i,j} taking value 11 with probability qq and 00 otherwise, and a variable ni,jN(0,(mq)1)n_{i,j}\sim\mathcal N(0,(mq)^{-1}). We thus get:

E[Px22]=i=1mE[(j=1dbi,jni,jxj)2]=i=1mj=1dE[bi,j2ni,j2xj2]+j=1dhjE[bi,jbi,hni,jni,hxixj]=i=1mj=1dE[bi,j2]E[ni,j2]xj2+j=1dhjE[bi,j]E[bi,h]E[ni,jn]E[i,h]xixj=i=1mj=1dq(mq)1xj2=x22\mathbb E[||Px||_2^2]=\sum_{i=1}^m\mathbb E[(\sum_{j=1}^db_{i,j}n_{i,j}x_j)^2]\\ =\sum_{i=1}^m\sum_{j=1}^d\mathbb E[b_{i,j}^2n_{i,j}^2x_j^2]+\sum_{j=1}^d\sum_{h\neq j}\mathbb E[b_{i,j}b_{i,h}n_{i,j}n_{i,h}x_ix_j]\\ =\sum_{i=1}^m\sum_{j=1}^d\mathbb E[b_{i,j}^2]\mathbb E[n_{i,j}^2]x_j^2+\sum_{j=1}^d\sum_{h\neq j}\mathbb E[b_{i,j}]\mathbb E[b_{i,h}]\mathbb E[n_{i,j}n]\mathbb E[_{i,h}]x_ix_j\\ =\sum_{i=1}^m\sum_{j=1}^dq(mq)^{-1}x_j^2\\ =||x||_2^2

In the above, we use E[ni,j]=0\mathbb E[n_{i,j}]=0 and E[bi,j2]=E[bi,j]=q\mathbb E[b_{i,j}^2]=\mathbb E[b_{i,j}]=q. We also use for ni,jN(0,σ2)n_{i,j}\sim\mathcal N(0,\sigma^2) random variables, we have E[ni,j2]=σ2\mathbb E[n_{i,j}^2]=\sigma^2. We also use independence of the entries in PP to split the expectation of the product into the product of expectations.