<p>The matrices $M$ and $N$, of size <code>MatrixSize</code>, could be
filled with random numbers.</p>
-<h2 id="A1">Step 1</h2>
+<h3 id="A1">Step 1</h3>
<p>Write a program to do the matrix multiplication assuming that the
matrices $M$ and $N$ are small and fit in a CUDA block. Input the
and (small) matrix size. Use one-dimensional arrays to store the
matrices $M$, $N$ and $P$ for efficiency.</p>
-<h2 id="A2">Step 2</h2>
+<h3 id="A2">Step 2</h3>
<p>Modify the previous program to multiply arbitrary size
matrices. Make sure that your code works for arbitrary block size (up
function of matrix size (up to large matrices, 4096) and guess the
matrix size dependence of the timing.</p>
-<h2 id="A2">Step 3</h2>
+<h3 id="A2">Step 3</h3>
<p>Optimize the previous code to take advantage of the very fast share
memory. To do this you must tile the matrix via 2D CUDA grid of blocks
block size tiles. The content of $M$ and $N$ within the tiles can then
be transfered into the share memory for speed.</p>
-<p>See the <a href="<!--#echo var="root_directory"
--->/content/GPUs/#learn">Learning CUDA</a> section of the course notes
-and the skeleton code <a
+<p>See the <a href="../../../content/GPUs/#learn">Learning CUDA</a>
+section of the course notes and the skeleton code <a
href="src/matmult_skeleton.cu">matmult_skeleton.cu</a>. See also the
in-class exercise on array reversal.</p>