such approaches can only achieve practical speedup for relatively large, “squarish” matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We resent novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms.
We have presented a practical implementation of Strassen’s algorithm on GPUs, which outperforms the state-of-the-art implementation on small problem sizes and consumes no additional memory compared to gemm. By developing a specialized kernel, we utilized the memory and thread hierarchies on GPUs.