Parallel Sorting
Sequential sorting lower bound (comparison-based): \(O(n\log n)\)
Parallel sorting
Tournament Algorithm
Idea: Generate \(n\times n\) comparison matrix in \(O(1)\) time with \(n^2\) processors. Then sum up each row, the sum is the new index in the sorted array.
Pseudocode
func tournament(arr v)
n = v.len
m = new matrix(n,n) // Initialization
for all 1 <= i,j <= n-1 do in parallel // matrix construction
if v[i] >= v[j]
m[i, j] = 1
else m[i, j] = 0
p = new arr(n)
for all 0 <= i < = n-1 do in parallel // parallel sum
parallel_sum(m[i])
for all 0 <= i <= n-1 do in parallel // modify array
v[p[i]-1] = v[i]
return v
Analysis:
- Initialization: time \(O(1)\), work \(O(n^2)\)
- Matrix construction: time \(O(1)\), work \(O(n^2)\)
- Parallel sum: time \(O(\log n)\), work \(O(n^2)\)
- Modify array: time \(O(1)\), work \(O(n^2)\)
Thus, tournament has time \(O(\log n)\) and work \(O(n^2)\)
CREW to EREW
Is tournament EREW? No! It's CREW (during matrix construction).
To transform EREW to CREW, we introduce make_n_copies.
func make_n_copies(x,v,n)
v[0] = x
for 0 <= i <= log(n)
for 0 <= j <= 2^i-1 do in parallel
v[j+2^i] = v[j]
return v
Time \(O(\log n)\), work \(O(n)\).
Better Algorithms
Parallel sorting is possible in
- Time \(O(\log^2n)\), work \(O(n\log^2n)\) by Bitonic merge sort
- Time \(O(\log n)\), work \(O(n\log n)\) by AKS algorithm (complicated)
Pointer Jumping
Prefix sum on lists.
Given an array \(v\) of elements, an array \(p\) of pointers.
\(v:[10,15,20,30,25,2,8,7]\)
\(p:[2,5,6,1,/,7,3,4]\)
They represent the "linked list": \(10\to20\to8\to30\to15\to2\to7\to25\)
We want to get: \(10\to30\to38\to68\to83\to85\to92\to117\)
Algorithm Prefix Sum on List
Repeat \(O(\log n)\) times:
- Every element add to the successor.
- Pointer jump.
First round,
the list after addition:
\(10\to30\to28\to38\to45\to17\to9\to32\);
pointer jumping produces two lists:
\(10\to28\to45\to9\) and
\(30\to38\to17\to32\)
Second round, after addition
\(10\to38\to73\to54\) and
\(30\to68\to55\to49\)
after pointer jumping
\(10\to73\) and \(38\to54\) and
\(30\to55\) and \(68\to49\)
Third round, after addition
\(10\to83\)
\(38\to92\)
\(30\to85\)
\(68\to117\)
after pointer jumping, we get the desired prefix sum on list.
Pseudocode
func prefix_sum_list(v,p)
n = p.len //(=v.len)
succ = new arr(n)
sum = new arr(n)
for all 0 <= i <= n-1 do in parallel
succ[i] = p[i]
sum[i] = v[i] // Initialization
for i = 1 to n
for j = 0 to n-1 do in parallel
if succ[i] != None
sum[succ[j]] += sum[j]
succ[j] = succ[succ[j]] // Pointer jump
return sum, p
Analysis of algorithm:
- Initialization: Creating a copy of \(v\) and \(p\). Time \(O(1)\), work \(O(n)\).
- Pointer jumping: Time \(O(\log n)\), work \(O(n\log n)\)
Exercise: Depth on Binary Trees
Given a binary tree \(T\) with \(n\) nodes, array of parent \(P\), left child \(L\), right child \(R\).
Compute the depth array \(D\) for every node. CRCW-PRAM: \(O(\log n)\) time.
-
Depth array \(D\), set \(D[root]=0\), others \(1\).
-
Next array \(N\), \(N[i]=P[i]\).
- Do in parallel on node \(i\): if \(N[i]\neq NULL\), \(D[i]+=D[N[i]]\), \(N[i]=N[N[i]]\).
Time \(O(\log n)\), work \(O(n\log n)\), processor \(O(n)\).
Can We Parallelize Everything?
Not really.
The complexity class \(\mathsf{NC}\): \(O(\mathsf{poly}(\log n))\) time using \(O(\mathsf{poly}(n))\) processors.
\(\mathsf{P}\): \(O(\mathsf{poly}(n))\) time using \(1\) processor.
Open problem: \(\mathsf{NC}=\mathsf{P}\)? Conjecture: \(\mathsf{NC}\subsetneq\mathsf{P}\)
See more detail here.