Goal of dimensionality reduction: represent a high dimensional data set in a lower dimensional space while preserving much of the important structure.
PCA (principle component analysis): to maintain as much variance in the data as possible while reducing the dimensionality.
The mapping of data from a high dimensional space to a low dimensional space is called an embedding.
Johnson-Lindenstrauss Dimensionality Reduction
Lecture notes by Allan Grønlund
Main results of JL: any m point subset of Euclidean space can be linearly embedded in k=O(logm/ϵ2) dimensions without distorting the distance between any pair of points by more than a factor of (1±ϵ), for any 0<ϵ<21.
For PCA to be relevant the original data points x1,⋯,xm must be inherently low dimensional.
While JL theorem requires no assumption on the original data and the target dimension is even independent of the input dimension.
Further results: JL lemma is tight even for non-linear embeddings.
Simple JL Lemma
Theorem 1. For any 0<ϵ<21 and any integer m, then for integer
k=O(ϵ21logm)
large enough and any m points x1,⋯,xm⊂Rd there exists a linear map matrix L:Rd→Rk s.t. for any 1≤i,j≤m:
The linear transformation L in Theorem 1 is simply multiplication by a matrix whose entries are sampled independently from a standard Gaussian.
Define a random variable A as k×d matrix, where each entry Ai,j∼N(0,1). The final embedding matrix is sample of the random variable L=k1A.
First we prove a simply simpler result for such random variables.
Lemma 1. Fix any vector unit vectorx.
x is unit vector if ∑ixi2=1.
For any 0<ϵ,δ≤21. For k=O(ϵ21logδ1) large enough let L=k1A be a random variable, where A is a k×d random matrix whose entries are independent ∼N(0,1). Then:
LPr(∣∣∣Lx∣∣2−1∣>ϵ)≤δ
If we pick a random matrix as described, then with probability at least 1−δ the norm of x is distorted by a factor at most (1±ϵ) by L.
Note that this generalizes to any non-unit vector since a vector v can be written as v=∣∣v∣∣∣∣v∣∣v. Since the embedding is linear, Lv=∣∣v∣∣L∣∣v∣∣v, thus with probability at least 1−δ,
Fix a point set x1,⋯,xm, Set δ=m31 and k=O(ϵ−2logδ1) large enough and consider the random variable matrix L=k1A where Ai,j∼N(0,1).
Observation: From linearity of L, it follows that ∣∣Lxi−Lxj∣∣=∣∣L(xi−xj)∣∣. This means that the distance between xi and xj is distorted by more than (1±ϵ) iff the norm of the vector (xi−xj) is distorted by more than (1±ϵ).
Assuming Lemma 1, for any pair of indices i,j, the probability that the matrix L distorts the norm of the vector xi−xj by more than (1±ϵ) is at most δ.
Since we need all pairwise distances to be approximately maintained, we need to bound the probability that there will be at least one index pair i,j s.t. xi−xj is distorted by at least (1±ϵ). We call an index pair i,j bad if L does not maintain the norm of xi−xj.
Applying the Union Bound, the probability over L that there is a bad index pair is at most the sum over all i,j the probability that L distorts the norm of xi−xj by more than (1±ϵ). We get:
LPr(∃(i,j) that is bad)≤i=1∑m−1j=i+1∑mPr((i,j) bad)=(m2)m31≤m1
Thus with probability at least 1−m1>0 the random matrix L maintains all the pairwise norms in the dataset within a factor (1±ϵ). In particular, this means that there exists a matrix that can embed the points and maintain all pairwise distances and to find one just sample a matrix and it works with probability at least 1−m1.
It remains to prove Lemma 1.
Fact 1. Fir any constants a,b, if X∼N(0,1),Y∼N(0,1), then aX+bY∼(0,a2σx2+b2σy2).
Chi-square distribution: the sum of square of k independent random variables which ∼N(0,1)
Lemma 2. Let Y∼χk2. Then
YPr(∣∣∣∣∣kY−1∣∣∣∣∣≥x)≤exp(−163kx2),x∈[0,21)
First we prove that the random variable defined as the norm of ∣∣Ax∣∣2 is χk2 distributed. The we apply Lemma 2 and Lemma 1.
Lemma 3. Given any unit vector x∈Rd. Let A be the random variable matrix with each Ai,j∼N(0,1) independently of the other entries. Then the random variable that is the squared norm of Ax is χk,d distributed:
∣∣Ax∣∣2∼χk2
Let ai denote the i-th row of A:
∣∣Ax∣∣2=i=1∑k⟨ai,x⟩2=i=1∑k(j=1∑dai,jxj)2.
By Fact 1, ai,j∼N(0,1), we know ai,jxj∼N(0,xj2) and we are computing a sum of these. This means that ∑j=1dai,jxj∼N(0,∑j=1dxj2)=N(0,1), since ∣∣x∣∣2=1. In total, we get that ∣∣Ax∣∣2 is distributed as a sum of k standard Gaussians i.e.
∣∣Ax∣∣2∼i=1∑kN(0,1)2∼χk2
Proof of Lemma 1
Fix a unit vector x and consider the random variable matrix L=k1A defined as usual.
First note that the expected value E[∣∣Ax∣∣2]=k, which means that E[∣∣Lx∣∣2]=1, i.e., in expectation the norm of x is preserved by L. This is a very relevant property indeed. But not enough.
We need to argue that
LPr(∣∣∣Lx∣∣2−1∣>ϵ)≤δ
Luckily this is exactly captured in Lemma 2. Since ∣∣Ax∣∣2∼χk2, then
APr(∣∣∣∣∣k∣∣Ax∣∣2−1∣∣∣∣∣≥ϵ)≤exp(−163kϵ2),
since ∣∣Ax∣∣2/k=∣∣k1Ax∣∣2=∣∣Lx∣∣2, this means that
LPr(∣∣∣Lx∣∣2−1∣>ϵ)≤exp(−163kϵ2).
To finalize the proof, we need to find a k s.t. exp(−163kϵ2) which is easily achieved by setting k≥316ϵ−2logδ1.
Remarks
JL Lemma generalizes to the case where the entries in A are instead uniform random signs ±1. Also, it can be shown that the matrix can be made sparse and the amount of randomness required reduced, all steps to make the embedding more efficient.
We can reduce the dimensionality of any dataset by using a simple randomized linear transformation.
But what if we allow non-linear embedding? Amazed: We can’t!
JL embedding lemma is tight even for non-linear embeddings by showing that there exists data set s.t. there is no way to embed into a space with lower dimension than what is promised in the JL Lemma without violating some of the pairwise distances.
Application on K-Means
lecture notes by Kasper
K-Means
To compute k centers minimizing the sum of distances each point to its nearest cluster center.
Computing the optimal k-means is NP-hard. So we use heuristic algorithm such as Lloyd’s algorithm.
The Lloyd’s algorithm starts by partitioning the input points into k clusters X1,⋯,Xk in some way. Repeat: compute the optimal cluster centers, then reassign each point to the cluster center and repeat until no clusters change.
This algorithm may get stuck in a local minimum and depend on initialization.
Let t denote the number of iterations, the total running time is O(tndk)
t iterations
compute new clusters in O(nd) time
reassign points in O(ndk) time.
Applying Johnson-Lindenstrauss
To speed up Lloyd’s algorithm
Before running Lloyd’s algorithm, perform a JL transform s.t. all distances between points are preserved to within (1±ϵ).
Let A∈Rm×d be the embedding matrix s.t. each original point xi is mapped to Axi. The new dimensionality is m=O(ϵ−2logn). Performing the transform takes O(ndϵ−2logn) time, which may be much less than O(tdnk). Therefore, running Lloyd’s in the lower dimensional space takes tnkϵ−2logn for a total running time of O(nϵ−2(d+kt)logn).
Lemma 4. For any set of points x1,⋯,xn and a partitioning of the points into clusters X1,⋯,Xk, the cost of the clustering satisfies
The interesting thing about Lemma 4 is that it shows that the cost of a clustering can be written solely in terms of pairwise distances. This means that the cost of all clusterings will be roughly preserved if all distances are roughly preserved.
Consider any clustering X1,⋯,Xk on the dimensionality reduced data set Ax1,⋯,Axn and assume its cost is μ. By lemma 4, we get that the cost of the clustering equals
JL guarantees that each term ∣∣Axi−Axh∣∣22 is within (1±ϵ)∣∣xi−xh∣∣22, thus the cost of same clustering on the original data points lies in the interval:
Since this holds for all clusterings, we can in particular conclude that if one finds a good clustering on the low dimensional data set (say one with a cost of at most (1+γ)μ∗ where μ∗ is the optimal clustering in low dimensional space), then that clustering has a cost of at most (1+ϵ)(1+γ)μ∗ on the original data set.
Moreover, if we let ψ∗ be the cost of the optimal clustering X∗ on the original data set, then we know that the optimal clustering must have a cost of 1±ϵ1ψ∗≥μ∗ in low dimensional space. It follows that the clustering X has a cost of no more than (1+ϵ)(1+γ)1±ϵ1ψ∗. For ϵ<21, we have 1/(1−ϵ)=1+ϵ(1−ϵ)≤1+2ϵ and (1+ϵ)(1+2ϵ)=(1+3ϵ+2ϵ2)≤1+4ϵ and thus the cost is no more than (1+4ϵ)(1+γ)ψ∗. That is, if we can find an approximately optimal clustering for the dimensionality reduced points, we have found an approximately optimal clustering for the original points.