assignments/archive/logistic_cuda/index.shtml.itex2MML

   1 <!--#set var="root_directory" value="../../.." --><!--#include virtual="$root_directory/shared/header.shtml"-->
   2
   3 <h1>Assignment 9</h1>
   4 <p><em>Due Friday, December 11</em></p>
   5
   6 <h2>Purpose</h2>
   7
   8 <p>Learn the CUDA language.</p>
   9
  10 <p>Note: Please identify all your work.</p>
  11
  12 <!--TableOfContents:Begin-->
  13 <!--TableOfContents:End-->
  14
  15
  16 <h2 id="A">Part A — Matrix Multiplication</h2>
  17
  18 <p>This assignment consists in multiplying two square matrices of
  19 identical size.</p>
  20
  21 <p class="equation">\[
  22   P = M \times N
  23 \]</p>
  24
  25 <p>The matrices $M$ and $N$, of size <code>MatrixSize</code>, could be
  26 filled with random numbers.</p>
  27
  28 <h3 id="A1">Step 1</h3>
  29
  30 <p>Write a program to do the matrix multiplication assuming that the
  31 matrices $M$ and $N$ are small and fit in a CUDA block. Input the
  32 matrix size via a command line argument. Do the matrix multiplication
  33 on the GPU and on the CPU and compare the resulting matrices. Make
  34 sure that your code works for arbitrary block size (up to 512 threads)
  35 and (small) matrix size. Use one-dimensional arrays to store the
  36 matrices $M$, $N$ and $P$ for efficiency.</p>
  37
  38 <h3 id="A2">Step 2</h3>
  39
  40 <p>Modify the previous program to multiply arbitrary size
  41 matrices. Make sure that your code works for arbitrary block size (up
  42 to 512 threads) and matrix size (up to memory limitation). Instrument
  43 your program with calls to <code>gettimeofday()</code> to time the
  44 matrix multiplication on the CPU and GPU. Plot these times as a
  45 function of matrix size (up to large matrices, 4096) and guess the
  46 matrix size dependence of the timing.</p>
  47
  48 <h3 id="A2">Step 3</h3>
  49
  50 <p>Optimize the previous code to take advantage of the very fast share
  51 memory. To do this you must tile the matrix via 2D CUDA grid of blocks
  52 (as above). All matrix elements in a block within $P$ will be computed
  53 at once. The scalar product of each row of $M$ and each column of $N$
  54 within the block can be calculated by scanning over the matrices in
  55 block size tiles. The content of $M$ and $N$ within the tiles can then
  56 be transfered into the share memory for speed.</p>
  57
  58 <p>See the <a href="../../../content/GPUs/#learn">Learning CUDA</a>
  59 section of the course notes and the skeleton
  60 code <a href="../../../src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See
  61 also the in-class exercise on <a href="../../../content/GPUs/#example">array
  62 reversal</a>.</p>
  63
  64 <h2 id="B">Part B — Logistic Map</h2>
  65
  66 <h3 id="B-background">Background</h3>
  67
  68 <p>This part of the assignment asks you to adapt to CUDA a serial code
  69 that generates a bifurcation diagram for the logistic map. The
  70 logistic map is a map of real line to itself given by</p>
  71
  72 <p class="equation">\[
  73   x_{i+1} = a - x_i^2.
  74 \]</p>
  75
  76 <p>This mapping is ubiquitous in many problems of practical interest
  77 and is arguably the simplest example of a (discrete) complex dynamical
  78 system (indeed, you’ll note its similarity to the equation generating
  79 the complex Mandelbrot set).</p>
  80
  81 <p>The variable $a$ is a parameter that is held constant while $x$ is
  82 iterated from some initial condition $x_0$. We are interested in the
  83 long term or asymptotic behavior as $x_0$ is iterated for various
  84 values of $a$. A plot of the asymptotic values of $x$ verses $a$ is
  85 called a <em>bifurcation diagram</em>.</p>
  86
  87 <p>The reason for this terminology is as follows. The asymptotic
  88 behavior often varies smoothly with $a$. For for some $a$, $x_0$ may
  89 tend to some fixed point $x^{(1)}$ with the value of $x^{(1)}$ varying
  90 smoothly with $a$. For another $a$, $x_0$ could end up in a period two
  91 orbit, oscillating between two values $x_1^{(2)}$ and $x_2^{(2)}$. The
  92 values of these two points may also vary smoothly with $a$, but there
  93 is some transition value $\tilde{a}$ where we jump from the fixed
  94 point to the period two orbit. This non-smooth process is called a
  95 <em>bifurcation</em>. The bifurcation diagram then shows all of these
  96 bifurcations on a single plot since we scan over all values of
  97 $a$.</p>
  98
  99 <p>The serial code loops over a and iterates a random initial
 100 condition <code>THRESH</code> number of times. This is to let
 101 transients “die out” and approach the asymptotic behavior.  If an
 102 iterate leaves the interval $[-2, 2]$ during this time it will
 103 eventually escape to $\infty$, so the trajectory is thrown out and
 104 another random initial condition is tried. It is known that positive
 105 measure attracting sets exist for the $a$ values in the program so
 106 this loop will eventually terminate.</p>
 107
 108 <p>If a trajectory stays bounded after <code>THRESH</code> iterates
 109 the next <code>MAXITER</code> iterates are tracked. The $x$-axis is
 110 binned into <code>xRES</code> number of bins and the binit routine is
 111 called to find which bin the current point in the trajectory is
 112 in. This repeats until <code>xRES</code> number of initial conditions
 113 have been iterated and binned. The bins are then normalized to a
 114 maximum value of one and are then output to the screen. The values in
 115 the bins are essentially the density of iterates around various points
 116 and plotting them shows the bifurcation structure of the map.</p>
 117
 118 <h3 id="B-assigment">Assignment</h3>
 119
 120 <p>The starting source for this assigment is packaged in <a
 121 href="src/logistic_cuda/logistic_cuda.tar.gz">logistic_cuda.tar.gz</a>. First
 122 run the serial code and gnuplot script so you can see what it is
 123 you’re supposed to produce.</p>
 124
 125 <pre>
 126 gcc -o logistic logistic.c -lm
 127 ./logistic &gt; log.dat
 128 gnuplot -persist log.p
 129 </pre>
 130
 131 <p>Then adapt the serial code to run on CUDA using the skeleton file
 132 <code>log_skel.cu</code>. Note the differences from the serial
 133 code. Functions called from a kernel are prefixed
 134 with <code>__device__</code> and host functions cannot be called from
 135 device functions. The random number generator <code>rand()</code> is a
 136 host function, so I added my own random number generator for the
 137 kernel to use. Finally, the original <code>binit</code> sorting
 138 algorithm was recursive, but device functions do not support
 139 recursion, so it has been rewritten without recursion (the while loop
 140 functions as the recursion step).</p>
 141
 142 <p>Parallelize over $a$, so that each thread computes the future orbit
 143 for a single value of $a$. Thus the block and grid need only be one
 144 dimensional (note that this allows a maximum of $29\times 216 = 225
 145 \sim 3 \times 107$ values of $a$, which should be sufficient. The
 146 kernel function should replace the entire main loop of the serial
 147 code. This includes iterating for a value of $a$, binning the
 148 trajectory, and normalizing the bin. The normalized bins should be
 149 returned to the main program for output.  Finally, time the CUDA
 150 code.</p>
 151
 152 <p>Note that you may keep the various <code>#define</code> statements
 153 intact so that these parameters need not be explicitly passed to
 154 functions.</p>
 155
 156 <h3 id="B-extra">Extra Credit</h3>
 157
 158 <p>The above implementation iterates a random initial condition
 159 followed by another if the first escapes the region. For the parameter
 160 range given every initial condition either escapes to $\infty$ or
 161 tends to a unique stable bounded attractor (a fixed point, periodic
 162 orbit, or “chaotic” Cantor set). In principle a map $x_{i+1} = f(x_i)$
 163 could have more than one coexisting attracting set, so that different
 164 initial conditions can tend to distinct bounded asymptotic behaviors,
 165 or (Lebesgue almost) every initial condition may excape to
 166 $\infty$.</p>
 167
 168 <p>Modify the CUDA program using an extra dimension of block/threads
 169 to assign initial condtions distributed throughout the interval $[-2,
 170 2]$ amongst these threads. Have the various threads bin the bounded
 171 trajectories together. Solutions that escape the interval should not
 172 be binned.</p>
 173
 174 <p>Test this code on the map $x_{i+1} = a-(a-x_i^2)^2$ and compare
 175 against the original code. Are the results different? Note, this
 176 example is the second iterate of the logistic map, so period two
 177 orbits of the original become distinct period one orbits of the second
 178 iterate map.</p>
 179
 180 <!--#include virtual="$root_directory/shared/footer.shtml"-->