be transfered into the share memory for speed.</p>
<p>See the <a href="../../../content/GPUs/#learn">Learning CUDA</a>
-section of the course notes and the skeleton code <a
-href="src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See also the
-in-class exercise on array reversal.</p>
+section of the course notes and the skeleton
+code <a href="../../../src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See
+also the in-class exercise on <a href="../../../content/GPUs/#example">array
+reversal</a>.</p>
<h2 id="B">Part B — Logistic Map</h2>
--- /dev/null
+<!--#set var="root_directory" value="../../.." --><!--#include virtual="$root_directory/shared/header.shtml"-->
+
+<h1>Assignment 7</h1>
+<p><em>Due Friday, December 3</em></p>
+
+<h2>Purpose</h2>
+
+<p>Learn the CUDA language.</p>
+
+<p>Note: Please identify all your work.</p>
+
+<!--TableOfContents:Begin-->
+<!--TableOfContents:End-->
+
+<p>This assignment consists in multiplying two square matrices of
+identical size.</p>
+
+<p class="equation">\[
+ P = M \times N
+\]</p>
+
+<p>The matrices $M$ and $N$, of size <code>MatrixSize</code>, could be
+filled with random numbers.</p>
+
+<h2 id="1">Step 1</h2>
+
+<p>Write a program to do the matrix multiplication assuming that the
+matrices $M$ and $N$ are small and fit in a CUDA block. Input the
+matrix size via a command line argument. Do the matrix multiplication
+on the GPU and on the CPU and compare the resulting matrices. Make
+sure that your code works for arbitrary block size (up to 512 threads)
+and (small) matrix size.</p>
+
+<p>Use one-dimensional arrays to store the matrices $M$, $N$ and $P$
+for efficiency.</p>
+
+<h2 id="2">Step 2</h2>
+
+<p>Modify the previous program to multiply arbitrary size
+matrices. Make sure that your code works for arbitrary block size (up
+to 512 threads) and matrix size (up to memory limitation). Instrument
+your program with calls to <code>gettimeofday()</code> to time the
+matrix multiplication on the CPU and GPU. Plot these times as a
+function of matrix size (up to large matrices, 4096) and guess the
+matrix size dependence of the timing.</p>
+
+<h2 id="2">Step 3</h2>
+
+<p>Optimize the previous code to take advantage of the very fast share
+memory. To do this you must tile the matrix via 2D CUDA grid of blocks
+(as above). All matrix elements in a block within $P$ will be computed
+at once. The scalar product of each row of $M$ and each column of $N$
+within the block can be calculated by scanning over the matrices in
+block size tiles. The content of $M$ and $N$ within the tiles can then
+be transfered into the share memory for speed.</p>
+
+<p>Time this code and compare the results to the code
+in <a href="#2">Step 2</a></p>.
+
+<p>See the <a href="../../../content/GPUs/#learn">Learning CUDA</a>
+section of the course notes and the skeleton
+code <a href="../../../src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See
+also the in-class exercise on <a href="../../../content/GPUs/#example">array
+reversal</a>.</p>
+
+<!--#include virtual="$root_directory/shared/footer.shtml"-->