From: W. Trevor King Date: Tue, 14 Sep 2010 16:21:28 +0000 (-0400) Subject: Convert logistic_cude assigment PDFs to XHTML. X-Git-Url: http://git.tremily.us/?a=commitdiff_plain;h=3d1dd7e494ae1a29bf6f69b88cfaa15370bee762;p=parallel_computing.git Convert logistic_cude assigment PDFs to XHTML. --- diff --git a/assignments/archive/logistic_cuda/assigment_a.pdf b/assignments/archive/logistic_cuda/assigment_a.pdf deleted file mode 100644 index 29d4a6b..0000000 Binary files a/assignments/archive/logistic_cuda/assigment_a.pdf and /dev/null differ diff --git a/assignments/archive/logistic_cuda/assigment_b.pdf b/assignments/archive/logistic_cuda/assigment_b.pdf deleted file mode 100644 index 1094ef0..0000000 Binary files a/assignments/archive/logistic_cuda/assigment_b.pdf and /dev/null differ diff --git a/assignments/archive/logistic_cuda/index.shtml b/assignments/archive/logistic_cuda/index.shtml deleted file mode 100644 index adad5b1..0000000 --- a/assignments/archive/logistic_cuda/index.shtml +++ /dev/null @@ -1,17 +0,0 @@ - - -

Assignment 9

- - - - diff --git a/assignments/archive/logistic_cuda/index.shtml.itex2MML b/assignments/archive/logistic_cuda/index.shtml.itex2MML new file mode 100644 index 0000000..03951bb --- /dev/null +++ b/assignments/archive/logistic_cuda/index.shtml.itex2MML @@ -0,0 +1,179 @@ + + +

Assignment 9

+

Due Friday, December 11

+ +

Purpose

+ +

Learn the CUDA language.

+

Note: Please identify all your work.

+ + + + + +

Part A — Matrix Multiplication

+ +

This assignment consists in multiplying two square matrices of +identical size.

+ +

\[ + P = M \times N +\]

+ +

The matrices $M$ and $N$, of size MatrixSize, could be +filled with random numbers.

+ +

Step 1

+ +

Write a program to do the matrix multiplication assuming that the +matrices $M$ and $N$ are small and fit in a CUDA block. Input the +matrix size via a command line argument. Do the matrix multiplication +on the GPU and on the CPU and compare the resulting matrices. Make +sure that your code works for arbitrary block size (up to 512 threads) +and (small) matrix size. Use one-dimensional arrays to store the +matrices $M$, $N$ and $P$ for efficiency.

+ +

Step 2

+ +

Modify the previous program to multiply arbitrary size +matrices. Make sure that your code works for arbitrary block size (up +to 512 threads) and matrix size (up to memory limitation). Instrument +your program with calls to gettimeofday() to time the +matrix multiplication on the CPU and GPU. Plot these times as a +function of matrix size (up to large matrices, 4096) and guess the +matrix size dependence of the timing.

+ +

Step 3

+ +

Optimize the previous code to take advantage of the very fast share +memory. To do this you must tile the matrix via 2D CUDA grid of blocks +(as above). All matrix elements in a block within $P$ will be computed +at once. The scalar product of each row of $M$ and each column of $N$ +within the block can be calculated by scanning over the matrices in +block size tiles. The content of $M$ and $N$ within the tiles can then +be transfered into the share memory for speed.

+ +

See the /content/GPUs/#learn">Learning CUDA section of the course notes +and the skeleton code matmult_skeleton.cu. See also the +in-class exercise on array reversal.

+ +

Part B — Logistic Map

+ +

Background

+ +

This part of the assignment asks you to adapt to CUDA a serial code +that generates a bifurcation diagram for the logistic map. The +logistic map is a map of real line to itself given by

+ +

\[ + x_{i+1} = a − x2_i. +\]

+ +

This mapping is ubiquitous in many problems of practical interest +and is arguably the simplest example of a (discrete) complex dynamical +system (indeed, you’ll note its similarity to the equation generating +the complex Mandelbrot set).

+ +

The variable $a$ is a parameter that is held constant while $x$ is +iterated from some initial condition $x_0$. We are interested in the +long term or asymptotic behavior as $x_0$ is iterated for various +values of $a$. A plot of the asymptotic values of $x$ verses $a$ is +called a bifurcation diagram.

+ +

The reason for this terminology is as follows. The asymptotic +behavior often varies smoothly with $a$. For example, for some $a$ +$x_0$ may tend to some fixed point $x^∗$ with the value of $x^∗$ +varying smoothly with $a$. However, for another $a$ $x_0$ could end up +in a period two orbit, oscillating between two values $x_1^∗$ and +$x_2^∗$. The values of these two points may also vary smoothly with +$a$, but there is some transition value $\tilde{a}$ where we jump from +the fixed point to the period two orbit. This non-smooth process is +called a bifurcation. The bifurcation diagram then shows all +of these bifurcations on a single plot since we scan over all values +of $a$.

+ +

The serial code loops over a and iterates a random initial +condition THRESH number of times. This is to let +transients “die out” and approach the asymptotic behavior. If an +iterate leaves the interval $[−2, 2]$ during this time it will +eventually escape to $\infty$, so the trajectory is thrown out and +another random initial condition is tried. It is known that positive +measure attracting sets exist for the $a$ values in the program so +this loop will eventually terminate.

+ +

If a trajectory stays bounded after THRESH iterates +the next MAXITER iterates are tracked. The $x$-axis is +binned into xRES number of bins and the binit routine is +called to find which bin the current point in the trajectory is +in. This repeats until xRES number of initial conditions +have been iterated and binned. The bins are then normalized to a +maximum value of one and are then output to the screen. The values in +the bins are essentially the density of iterates around various points +and plotting them shows the bifurcation structure of the map. + +

Assignment

+ +

The starting source for this assigment is packaged in logistic_cuda.tar.gz. First +run the serial code and gnuplot script so you can see what it is +you’re supposed to produce.

+ +
+gcc -o logistic logistic.c -lm
+./logistic > log.dat
+gnuplot -persist log.p
+
+ +

Then adapt the serial code to run on CUDA using the skeleton file +log_skel.cu. Note the differences from the serial +code. Functions called from a kernel are prefixed +with __device__ and host functions cannot be called from +device functions. The random number generator rand() is a +host function, so I added my own random number generator for the +kernel to use. Finally, the original binit sorting +algorithm was recursive, but device functions do not support +recursion, so it has been rewritten without recursion (the while loop +functions as the recursion step).

+ +

Parallelize over $a$, so that each thread computes the future orbit +for a single value of $a$. Thus the block and grid need only be one +dimensional (note that this allows a maximum of $29\times 216 = 225 +\sim 3 \times 107$ values of $a$, which should be sufficient. The +kernel function should replace the entire main loop of the serial +code. This includes iterating for a value of $a$, binning the +trajectory, and normalizing the bin. The normalized bins should be +returned to the main program for output. Finally, time the CUDA +code.

+ +

Note that you may keep the various #define statements +intact so that these parameters need not be explicitly passed to +functions.

+ +

Extra Credit

+ +

The above implementation iterates a random initial condition +followed by another if the first escapes the region. For the parameter +range given every initial condition either escapes to $\infty$ or +tends to a unique stable bounded attractor (a fixed point, periodic +orbit, or “chaotic” Cantor set). In principle a map $x_{i+1} = f(x_i)$ +could have more than one coexisting attracting set, so that different +initial conditions can tend to distinct bounded asymptotic behaviors, +or (Lebesgue almost) every initial condition may excape to +$\infty$.

+ +

Modify the CUDA program using an extra dimension of block/threads +to assign initial condtions distributed throughout the interval $[−2, +2]$ amongst these threads. Have the various threads bin the bounded +trajectories together. Solutions that escape the interval should not +be binned.

+ +

Test this code on the map $x_{i+1} = a−(a−x_i^2)^2$ and compare +against the original code. Are the results different? Note, this +example is the second iterate of the logistic map, so period two +orbits of the original become distinct period one orbits of the second +iterate map.

+ + diff --git a/assignments/archive/logistic_cuda/src/logistic_cuda/.make_tar b/assignments/archive/logistic_cuda/src/logistic_cuda/.make_tar new file mode 100644 index 0000000..80e52ce --- /dev/null +++ b/assignments/archive/logistic_cuda/src/logistic_cuda/.make_tar @@ -0,0 +1 @@ +./ diff --git a/assignments/archive/logistic_cuda/log.p b/assignments/archive/logistic_cuda/src/logistic_cuda/log.p similarity index 100% rename from assignments/archive/logistic_cuda/log.p rename to assignments/archive/logistic_cuda/src/logistic_cuda/log.p diff --git a/assignments/archive/logistic_cuda/log_skel.cu b/assignments/archive/logistic_cuda/src/logistic_cuda/log_skel.cu similarity index 100% rename from assignments/archive/logistic_cuda/log_skel.cu rename to assignments/archive/logistic_cuda/src/logistic_cuda/log_skel.cu diff --git a/assignments/archive/logistic_cuda/logistic.c b/assignments/archive/logistic_cuda/src/logistic_cuda/logistic.c similarity index 100% rename from assignments/archive/logistic_cuda/logistic.c rename to assignments/archive/logistic_cuda/src/logistic_cuda/logistic.c diff --git a/assignments/archive/logistic_cuda/matmult_skeleton.cu b/assignments/archive/logistic_cuda/src/matmult_skeleton.cu similarity index 100% rename from assignments/archive/logistic_cuda/matmult_skeleton.cu rename to assignments/archive/logistic_cuda/src/matmult_skeleton.cu