From fe580010391a33ec152315ce61d22cd68f7c3624 Mon Sep 17 00:00:00 2001
From: "W. Trevor King"
The matrices $M$ and $N$, of size MatrixSize
, could be
filled with random numbers.
Write a program to do the matrix multiplication assuming that the matrices $M$ and $N$ are small and fit in a CUDA block. Input the @@ -35,7 +35,7 @@ sure that your code works for arbitrary block size (up to 512 threads) and (small) matrix size. Use one-dimensional arrays to store the matrices $M$, $N$ and $P$ for efficiency.
-Modify the previous program to multiply arbitrary size matrices. Make sure that your code works for arbitrary block size (up @@ -45,7 +45,7 @@ matrix multiplication on the CPU and GPU. Plot these times as a function of matrix size (up to large matrices, 4096) and guess the matrix size dependence of the timing.
-Optimize the previous code to take advantage of the very fast share memory. To do this you must tile the matrix via 2D CUDA grid of blocks @@ -55,9 +55,8 @@ within the block can be calculated by scanning over the matrices in block size tiles. The content of $M$ and $N$ within the tiles can then be transfered into the share memory for speed.
-See the /content/GPUs/#learn">Learning CUDA section of the course notes -and the skeleton code See the Learning CUDA +section of the course notes and the skeleton code matmult_skeleton.cu. See also the in-class exercise on array reversal.
-- 2.26.2