General Matrix Multiply

UCSD, California | Apr - Jun 2022

General Matrix Multiplication

Matrix multiplication is a really frequent operation performed in various computer applications (ML and graphics processing, to name a couple). The naive matrix multiplication algorithm runs in O(n^3) and is really inefficient in terms of leveraging computational and memory resources. This project implements matrix multiplication modeled after CUTLASS and achieves 60.86% of the performance benchmark of cuBLAS.

Dissecting GEMM

My GEMM program is written using CUDA, and runs on an Nvidia Kepler K80. My program makes use of the hardware advantages presented by GPU architecture, such as fast shared memory, multiple threads (TLP - thread-level parallelism), and a large number of cores.
In addition to thread-level parallelism, my program also implements instruction-level parallelism (ILP) and vectorization to boost performance. This yielded a performance of 621.7 GFlops.

Check it out