From fe580010391a33ec152315ce61d22cd68f7c3624 Mon Sep 17 00:00:00 2001 From: "W. Trevor King" Date: Tue, 14 Sep 2010 13:39:34 -0400 Subject: [PATCH] Fix html_toc.py issues in logistic_cuda/index.shtml.itex2MML. --- .../archive/logistic_cuda/index.shtml.itex2MML | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/assignments/archive/logistic_cuda/index.shtml.itex2MML b/assignments/archive/logistic_cuda/index.shtml.itex2MML index 80445bf..705eeb7 100644 --- a/assignments/archive/logistic_cuda/index.shtml.itex2MML +++ b/assignments/archive/logistic_cuda/index.shtml.itex2MML @@ -25,7 +25,7 @@ identical size.

The matrices $M$ and $N$, of size MatrixSize, could be filled with random numbers.

-

Step 1

+

Step 1

Write a program to do the matrix multiplication assuming that the matrices $M$ and $N$ are small and fit in a CUDA block. Input the @@ -35,7 +35,7 @@ sure that your code works for arbitrary block size (up to 512 threads) and (small) matrix size. Use one-dimensional arrays to store the matrices $M$, $N$ and $P$ for efficiency.

-

Step 2

+

Step 2

Modify the previous program to multiply arbitrary size matrices. Make sure that your code works for arbitrary block size (up @@ -45,7 +45,7 @@ matrix multiplication on the CPU and GPU. Plot these times as a function of matrix size (up to large matrices, 4096) and guess the matrix size dependence of the timing.

-

Step 3

+

Step 3

Optimize the previous code to take advantage of the very fast share memory. To do this you must tile the matrix via 2D CUDA grid of blocks @@ -55,9 +55,8 @@ within the block can be calculated by scanning over the matrices in block size tiles. The content of $M$ and $N$ within the tiles can then be transfered into the share memory for speed.

-

See the /content/GPUs/#learn">Learning CUDA section of the course notes -and the skeleton code See the Learning CUDA +section of the course notes and the skeleton code matmult_skeleton.cu. See also the in-class exercise on array reversal.

-- 2.26.2