--- /dev/null
+<!--#set var="root_directory" value="../../.." --><!--#include virtual="$root_directory/shared/header.shtml"-->
+
+<h1>Assignment 9</h1>
+<p><em>Due Friday, December 11</em></p>
+
+<h2>Purpose</h2>
+
+<p>Learn the CUDA language.</p>
+</p>Note: Please identify all your work.</p>
+
+<!--TableOfContents:Begin-->
+<!--TableOfContents:End-->
+
+
+<h2 id="A">Part A — Matrix Multiplication</h2>
+
+<p>This assignment consists in multiplying two square matrices of
+identical size.</p>
+
+<p class="equation">\[
+ P = M \times N
+\]</p>
+
+<p>The matrices $M$ and $N$, of size <code>MatrixSize</code>, could be
+filled with random numbers.</p>
+
+<h2 id="A1">Step 1</h2>
+
+<p>Write a program to do the matrix multiplication assuming that the
+matrices $M$ and $N$ are small and fit in a CUDA block. Input the
+matrix size via a command line argument. Do the matrix multiplication
+on the GPU and on the CPU and compare the resulting matrices. Make
+sure that your code works for arbitrary block size (up to 512 threads)
+and (small) matrix size. Use one-dimensional arrays to store the
+matrices $M$, $N$ and $P$ for efficiency.</p>
+
+<h2 id="A2">Step 2</h2>
+
+<p>Modify the previous program to multiply arbitrary size
+matrices. Make sure that your code works for arbitrary block size (up
+to 512 threads) and matrix size (up to memory limitation). Instrument
+your program with calls to <code>gettimeofday()</code> to time the
+matrix multiplication on the CPU and GPU. Plot these times as a
+function of matrix size (up to large matrices, 4096) and guess the
+matrix size dependence of the timing.</p>
+
+<h2 id="A2">Step 3</h2>
+
+<p>Optimize the previous code to take advantage of the very fast share
+memory. To do this you must tile the matrix via 2D CUDA grid of blocks
+(as above). All matrix elements in a block within $P$ will be computed
+at once. The scalar product of each row of $M$ and each column of $N$
+within the block can be calculated by scanning over the matrices in
+block size tiles. The content of $M$ and $N$ within the tiles can then
+be transfered into the share memory for speed.</p>
+
+<p>See the <a href="<!--#echo var="root_directory"
+-->/content/GPUs/#learn">Learning CUDA</a> section of the course notes
+and the skeleton code <a
+href="src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See also the
+in-class exercise on array reversal.</p>
+
+<h2 id="B">Part B — Logistic Map</h2>
+
+<h3 id="B-background">Background</h3>
+
+<p>This part of the assignment asks you to adapt to CUDA a serial code
+that generates a bifurcation diagram for the logistic map. The
+logistic map is a map of real line to itself given by</p>
+
+<p class="equation">\[
+ x_{i+1} = a − x2_i.
+\]</p>
+
+<p>This mapping is ubiquitous in many problems of practical interest
+and is arguably the simplest example of a (discrete) complex dynamical
+system (indeed, you’ll note its similarity to the equation generating
+the complex Mandelbrot set).</p>
+
+<p>The variable $a$ is a parameter that is held constant while $x$ is
+iterated from some initial condition $x_0$. We are interested in the
+long term or asymptotic behavior as $x_0$ is iterated for various
+values of $a$. A plot of the asymptotic values of $x$ verses $a$ is
+called a <em>bifurcation diagram</em>.</p>
+
+<p>The reason for this terminology is as follows. The asymptotic
+behavior often varies smoothly with $a$. For example, for some $a$
+$x_0$ may tend to some fixed point $x^∗$ with the value of $x^∗$
+varying smoothly with $a$. However, for another $a$ $x_0$ could end up
+in a period two orbit, oscillating between two values $x_1^∗$ and
+$x_2^∗$. The values of these two points may also vary smoothly with
+$a$, but there is some transition value $\tilde{a}$ where we jump from
+the fixed point to the period two orbit. This non-smooth process is
+called a <em>bifurcation</em>. The bifurcation diagram then shows all
+of these bifurcations on a single plot since we scan over all values
+of $a$.</p>
+
+<p>The serial code loops over a and iterates a random initial
+condition <code>THRESH</code> number of times. This is to let
+transients “die out” and approach the asymptotic behavior. If an
+iterate leaves the interval $[−2, 2]$ during this time it will
+eventually escape to $\infty$, so the trajectory is thrown out and
+another random initial condition is tried. It is known that positive
+measure attracting sets exist for the $a$ values in the program so
+this loop will eventually terminate.</p>
+
+<p>If a trajectory stays bounded after <code>THRESH</code> iterates
+the next <code>MAXITER</code> iterates are tracked. The $x$-axis is
+binned into <code>xRES</code> number of bins and the binit routine is
+called to find which bin the current point in the trajectory is
+in. This repeats until <code>xRES</code> number of initial conditions
+have been iterated and binned. The bins are then normalized to a
+maximum value of one and are then output to the screen. The values in
+the bins are essentially the density of iterates around various points
+and plotting them shows the bifurcation structure of the map.</code>
+
+<h3 id="B-assigment">Assignment</h3>
+
+<p>The starting source for this assigment is packaged in <a
+href="src/logistic_cuda/logistic_cuda.tar.gz">logistic_cuda.tar.gz</a>. First
+run the serial code and gnuplot script so you can see what it is
+you’re supposed to produce.</p>
+
+<pre>
+gcc -o logistic logistic.c -lm
+./logistic > log.dat
+gnuplot -persist log.p
+</pre>
+
+<p>Then adapt the serial code to run on CUDA using the skeleton file
+<code>log_skel.cu</code>. Note the differences from the serial
+code. Functions called from a kernel are prefixed
+with <code>__device__</code> and host functions cannot be called from
+device functions. The random number generator <code>rand()</code> is a
+host function, so I added my own random number generator for the
+kernel to use. Finally, the original <code>binit</code> sorting
+algorithm was recursive, but device functions do not support
+recursion, so it has been rewritten without recursion (the while loop
+functions as the recursion step).</p>
+
+<p>Parallelize over $a$, so that each thread computes the future orbit
+for a single value of $a$. Thus the block and grid need only be one
+dimensional (note that this allows a maximum of $29\times 216 = 225
+\sim 3 \times 107$ values of $a$, which should be sufficient. The
+kernel function should replace the entire main loop of the serial
+code. This includes iterating for a value of $a$, binning the
+trajectory, and normalizing the bin. The normalized bins should be
+returned to the main program for output. Finally, time the CUDA
+code.</p>
+
+<p>Note that you may keep the various <code>#define</code> statements
+intact so that these parameters need not be explicitly passed to
+functions.</p>
+
+<h3 id="B-extra">Extra Credit</h3>
+
+<p>The above implementation iterates a random initial condition
+followed by another if the first escapes the region. For the parameter
+range given every initial condition either escapes to $\infty$ or
+tends to a unique stable bounded attractor (a fixed point, periodic
+orbit, or “chaotic” Cantor set). In principle a map $x_{i+1} = f(x_i)$
+could have more than one coexisting attracting set, so that different
+initial conditions can tend to distinct bounded asymptotic behaviors,
+or (Lebesgue almost) every initial condition may excape to
+$\infty$.</p>
+
+<p>Modify the CUDA program using an extra dimension of block/threads
+to assign initial condtions distributed throughout the interval $[−2,
+2]$ amongst these threads. Have the various threads bin the bounded
+trajectories together. Solutions that escape the interval should not
+be binned.</p>
+
+<p>Test this code on the map $x_{i+1} = a−(a−x_i^2)^2$ and compare
+against the original code. Are the results different? Note, this
+example is the second iterate of the logistic map, so period two
+orbits of the original become distinct period one orbits of the second
+iterate map.</p>
+
+<!--#include virtual="$root_directory/shared/footer.shtml"-->