Parallel Sorting

Sequential sorting lower bound (comparison-based): \(O(n\log n)\)

Parallel sorting

Tournament Algorithm

Idea: Generate \(n\times n\) comparison matrix in \(O(1)\) time with \(n^2\) processors. Then sum up each row, the sum is the new index in the sorted array.

Pseudocode

func tournament(arr v)
    n = v.len
    m = new matrix(n,n) // Initialization
    for all 1 <= i,j <= n-1 do in parallel // matrix construction
        if v[i] >= v[j]
            m[i, j] = 1
        else m[i, j] = 0
    p = new arr(n)
    for all 0 <= i < = n-1 do in parallel // parallel sum
        parallel_sum(m[i])
    for all 0 <= i <= n-1 do in parallel // modify array
        v[p[i]-1] = v[i]
    return v

Analysis:

Initialization: time \(O(1)\), work \(O(n^2)\)
Matrix construction: time \(O(1)\), work \(O(n^2)\)
Parallel sum: time \(O(\log n)\), work \(O(n^2)\)
Modify array: time \(O(1)\), work \(O(n^2)\)

Thus, tournament has time \(O(\log n)\) and work \(O(n^2)\)

CREW to EREW

Is tournament EREW? No! It's CREW (during matrix construction).

To transform EREW to CREW, we introduce make_n_copies.

func make_n_copies(x,v,n)
    v[0] = x
    for 0 <= i <= log(n)
        for 0 <= j <= 2^i-1 do in parallel
            v[j+2^i] = v[j]
    return v

Time \(O(\log n)\), work \(O(n)\).

Better Algorithms

Parallel sorting is possible in

Time \(O(\log^2n)\), work \(O(n\log^2n)\) by Bitonic merge sort
Time \(O(\log n)\), work \(O(n\log n)\) by AKS algorithm (complicated)

Pointer Jumping

Prefix sum on lists.

Given an array \(v\) of elements, an array \(p\) of pointers.

\(v:[10,15,20,30,25,2,8,7]\)

\(p:[2,5,6,1,/,7,3,4]\)

They represent the "linked list": \(10\to20\to8\to30\to15\to2\to7\to25\)

We want to get: \(10\to30\to38\to68\to83\to85\to92\to117\)

Algorithm Prefix Sum on List

Repeat \(O(\log n)\) times:

Every element add to the successor.
Pointer jump.

First round,

the list after addition:

\(10\to30\to28\to38\to45\to17\to9\to32\);

pointer jumping produces two lists:

\(10\to28\to45\to9\) and

\(30\to38\to17\to32\)

Second round, after addition

\(10\to38\to73\to54\) and

\(30\to68\to55\to49\)

after pointer jumping

\(10\to73\) and \(38\to54\) and

\(30\to55\) and \(68\to49\)

Third round, after addition

\(10\to83\)

\(38\to92\)

\(30\to85\)

\(68\to117\)

after pointer jumping, we get the desired prefix sum on list.

Pseudocode

func prefix_sum_list(v,p)
    n = p.len //(=v.len)
    succ = new arr(n)
    sum = new arr(n)
    for all 0 <= i <= n-1 do in parallel
        succ[i] = p[i]
        sum[i] = v[i] // Initialization
    for i = 1 to n
        for j = 0 to n-1 do in parallel
            if succ[i] != None
                sum[succ[j]] += sum[j]
                succ[j] = succ[succ[j]] // Pointer jump
    return sum, p

Analysis of algorithm:

Initialization: Creating a copy of \(v\) and \(p\). Time \(O(1)\), work \(O(n)\).
Pointer jumping: Time \(O(\log n)\), work \(O(n\log n)\)

Exercise: Depth on Binary Trees

Given a binary tree \(T\) with \(n\) nodes, array of parent \(P\), left child \(L\), right child \(R\).

Compute the depth array \(D\) for every node. CRCW-PRAM: \(O(\log n)\) time.

Depth array \(D\), set \(D[root]=0\), others \(1\).
Next array \(N\), \(N[i]=P[i]\).
Do in parallel on node \(i\): if \(N[i]\neq NULL\), \(D[i]+=D[N[i]]\), \(N[i]=N[N[i]]\).

Time \(O(\log n)\), work \(O(n\log n)\), processor \(O(n)\).

Can We Parallelize Everything?

Not really.

The complexity class \(\mathsf{NC}\): \(O(\mathsf{poly}(\log n))\) time using \(O(\mathsf{poly}(n))\) processors.

\(\mathsf{P}\): \(O(\mathsf{poly}(n))\) time using \(1\) processor.

Open problem: \(\mathsf{NC}=\mathsf{P}\)? Conjecture: \(\mathsf{NC}\subsetneq\mathsf{P}\)

See more detail here.