阅读材料

prefix sums and their application

by Guy E. Blelloch, CMU

parallel tasks: building blocks, and merge.

dynamic programming & divide and conquer

extended from sequential algorithms

all-prefix-sums operation on PRAM

intro

prefix sum operation has many extension: quick-sort, lexical analysis, tree operations…

primitive instruction in some machine（原子操作）

define: scan operation is an array all-prefix-sums operation.

implementation

array length n, EREW PRAM, time complexity: O(n/p+log p)

binary tree, for each depth: each father node first passes its left child value to right child.

if n>p, 采用分块的方式, time complexity: O(n/p+log p)

up sweep: sum[v]=sum[L[v]]+sum[R[v]]

down sweep: prescan[L[v]]=prescan[v], prescan[R[v]]=sum[L[v]]+prescan[v]

after completing down sweep, each vertex of the tree contains the sum of all the leaf value that precede it.

applications

line-of-sight and radix-sort

recurrence equations 递归方程的计算

segmented scans

allocation processors

课堂笔记

More application of prefix sums

minimum
broadcast
partition

Sorting

Quicksort

O(n) work, O(log n) depth, EREW PRAM

randomized, sub-optimal depth, optimal work

Merge Sort

procedure BASICMS(I,S)
	split I into 2 equal part I_l and I_r of size n/2
	split S into 2 equal part S_l and S_r of size n/2
	for each I_l and I_r in parallel do
		(I_l,S_l)=BASICMS(I_l,S_l)
		(I_r,S_r)=BASICMS(I_r,S_r)
	sequentially merge S_l and S_r into I
	return (S,I)

O(n) depth
optimal work O(n log n)
Theoretically bad
probably lower constants compared to fully parallel
with p processor we get $\frac{W}{p}+T=\frac{n\log n}{p}+n$

对 merge 操作并行化

假设这里有 2 个排好序的数组 A B，合并为一个数组

假设数组 B 中的一个数字 bi，它在合并后的数组的位置为 i+rank(B[i],A)，rank 用二分查找找到。

procedure SEGMERGE(A,B,M,p)
	allocate array R[0,...,p]
	R[0]=0
	for each i=1...p in parallel do
		R[i]=rank(A,B[i*n/p])
	for each i=1..p in parallel do
		asymmerge(B[(i-1)*n/p+1],...,A[R[i-1],R[i]])

asymmetric merge

break large trunk into small piece

so in the parallel merge sort we replace sequentially merge into SEGMERGE.

Searching

input: A sorted array with n elements

query: a value v

output: the predecessor of v

The brutal force:

procedure BF(A[1...n],v)
	r=0
	for i=0...n in parallel do
		if A[i]<=v<A[i+1] then
			r=i
	return r

how to do parallel?

In CREW

parallel based on brutal force method:

procedure PARALLEL-BF(A[1...n],l,h,v)
	if h-l<=p then
		return BF(A[1...n],v)
	for i=1...p in parallel do
		B[i]=A[l+(i-1)*(h-l)/p,l+r*(h-l)/p]
	r=BF(B,v)
	return PARALLEL-BF(A,(r-1)*(h-l)/p,l+r*(h-l)/p,v)

$Q(n)=O(1)$ $n\le p$

$Q(n)=Q(n/p)+O(1)$ $n>p$

$Q(n)=O(\log_pn)$

In EREW

divide into p subsets of size n/p

Binary search in parallel

$T(n)=\log(n/p)=\log(n)-\log(p)$

both results optimal
CREW provably stronger than EREW
for a natural problem

Lower Bounds of PRAM

basic of lower bounds

goal:

showing that a problem is difficult
worst-case lower bound
- existence of at least one difficult input

CREW Lower bounds

OR Problem

n boolean values 0 or 1

output: boolean or of the values

O(1) solution in common CRCW

a lower bound will also separate the weakest CRCW from CREW

CREW provably weaker than CRCW on a natural problem

$O(\log n)$ seems impossible, even if there is at most one value of 1

Building lower bounds

assuming

processors have infinite registers
processors have infinite computational power
processors can compute arbitrary functions of their register

a computational model for lower bounds

from PRAM to circuits

we have a depth i circuit
magic CPU gates
arbitrary fan-out
2 fan-in

Separations

deep connections between complexity theory, circuits and PRAM algorithms

switching lemma from complexity theory

XOR problem: compute the XOR of n bits

$O(1)$ in sum-CRCW PRAM
$\Omega(\frac{\log n}{\log\log n})$ Lower bound in priority/common-CRCW PRAM

OR problem

$O(1)$ time in common-CRCW
$\Omega(\log n)$ in CREW

Broadcast: send v to all processors

$O(1)$ time in CREW
$\Omega(\log n)$ in EREW

Searching: given n numbers in increasing order find predecessor of v

$O(\log_pn)$ in CREW
$\Omega(\log\frac{n}{p})$ in EREW

A Searching Lower bound in EREW

searching: given n numbers in increasing order, find predecessor of 0 in 000…000111…111

proof idea:

every change should change the output
every change should change some memory locations
every change should affect some processor
at least one processor affected by $\ge\frac{n}{p}$ changes
$\log\frac{n}{p}$ lower bound

tight!

divide into disjoint $\frac{n}{p}$ sub-arrays
Binary search in each

PRAM models have close ties to Complexity theory.

CRCW PRAM simulated using arbitrary fan out circuits.
CRCW can compute arbitrary boolean functions in $O(1)$ time (potentially exponentially number of processors)