|
|
@@ -1,22 +1,22 @@ |
|
|
# How to Implement Fast Matrix Multiplication |
|
|
|
|
|
This is a short post that explains how to write a high-performance matrix |
|
|
multiplication program on modern processors. In this tutorial I will use a |
|
|
multiplications on modern processors. In this tutorial I will use a |
|
|
single core of the Skylake-client CPU with AVX2, but the principles in this post |
|
|
apply to other processors with different instruction sets. |
|
|
also apply to other processors with different instruction sets (such as AVX512). |
|
|
|
|
|
## Intro |
|
|
|
|
|
Matrix multiplication is implemented as a dot product between the row matrix A |
|
|
and a column of matrix B. In other words, it’s a sum over element-wise |
|
|
multiplication of two scalars. This is matrix multiplication in Einstein |
|
|
notation: |
|
|
Matrix multiplication is a mathematical operation that defines the product of |
|
|
two matrices. It's defined as |
|
|
|
|
|
``` |
|
|
C(m, n) = A(m, k) * B(k, n) |
|
|
``` |
|
|
|
|
|
And this is a naïve implementation in C: |
|
|
It is implemented as a dot-product between the row matrix A and a column of |
|
|
matrix B. In other words, it’s a sum over element-wise multiplication of two |
|
|
scalars. And this is a naïve implementation in C: |
|
|
|
|
|
```c++ |
|
|
for (int p = 0; p < k; p++) { |
|
|
@@ -31,22 +31,24 @@ for (int p = 0; p < k; p++) { |
|
|
The picture below visualizes the computation of a single element in the result |
|
|
matrix C. Each element in the result matrix C is the sum of element-wise |
|
|
multiplication of a row from A and a column from B. |
|
|
|
|
|
 |
|
|
|
|
|
## Performance analysis |
|
|
|
|
|
The naïve implementation is pretty inefficient. In the next paragraph we’ll |
|
|
estimate the capabilities of the machine in terms of flops/sec (floating point |
|
|
operations per second). When you read the C code above try to guess how |
|
|
efficient the naïve algorithm is. |
|
|
|
|
|
My laptop has an Intel Skylake processor with AVX2. This processor has two ports |
|
|
that are capable of running 8-wide FMA (fused multiply-add) instructions on each |
|
|
port every cycle. The machine runs at a frequency of about 4Ghz (let’s ignore |
|
|
Turbo in this discussion). To calculate the throughput of the machine we’ll need |
|
|
to multiply these numbers together. 8 floating point numbers, times two ports, |
|
|
times two (because we perform multiply and add together), times 4 giga hertz |
|
|
equals 128 Giga flops per second. (Giga is 1 followed by 9 zeros). |
|
|
estimate the capabilities of some processor in terms of flops/sec (floating |
|
|
point operations per second). When you read the C code above try to guess how |
|
|
efficient the naïve algorithm is. Can we make this program twice as fast? Maybe |
|
|
more? |
|
|
|
|
|
My laptop has an Intel Skylake processor with AVX2. This processor has two |
|
|
execution ports that are capable of running 8-wide FMA (fused multiply-add) |
|
|
instructions on each port every cycle. The machine runs at a frequency of about |
|
|
4Ghz (let’s ignore Turbo in this discussion). To calculate the throughput of the |
|
|
machine we’ll need to multiply these numbers together. 8 floating point numbers, |
|
|
times two ports, times two (because we perform multiply and add together), times |
|
|
4 giga hertz equals 128 Giga flops per second. (Giga is 1 followed by 9 zeros). |
|
|
|
|
|
Now, let’s estimate how much work we need to perform. Let’s assume that we are |
|
|
trying to multiply two matrices of size `1024 x 1024`. The algorithm that we use |
|
|
@@ -65,12 +67,12 @@ utilization. In other words, the processor multipliers are idle 93% of the time. |
|
|
Our program is memory bound, which means that the multipliers are not active |
|
|
most of the time because they are waiting for memory. One way to verify this |
|
|
claim is to use performance counters. The two relevant counters are: the number |
|
|
of executed AVX instructions, and the total number of cycle. In efficient |
|
|
programs the number of AVX arithmetic operations is close to the total number of |
|
|
cycles. If there's a big gap between the two numbers, then we need to |
|
|
of executed AVX instructions, and the total number of cycle. When the code is |
|
|
efficient, the number of AVX arithmetic operations is close to the total number |
|
|
of cycles. If there's a big gap between the two numbers, then we need to |
|
|
investigate the cause. There are performance counters that record different |
|
|
kinds of stalls, including memory stalls. This is a picture from Xcode’s |
|
|
Instruments that can record performance counters. |
|
|
profiler that can record cpu performance counters. |
|
|
|
|
|
 |
|
|
|
|
|
@@ -81,14 +83,14 @@ The CPU multipliers are not fully utilized used because the naïve implementatio |
|
|
is memory bound. The processor is capable of loading wide SIMD vectors. On my |
|
|
machine, using AVX, it’s 8 floating point numbers at once. In the case of matrix |
|
|
A, the processor loads a cache line from the main memory into the L1 cache. Then it |
|
|
loads the scalars from the cache line one at a time. This is not great, but the |
|
|
loads the scalars from the cache line one at a time. This is not ideal, but the |
|
|
situation of matrix B is much much worse. In matrix B, we scan the matrix down a |
|
|
column and use a single scalar from every cache line that we load. By the time |
|
|
we finish scanning the column and start the next column, the processor had |
|
|
already flushed the cache. Here I assume that the length of the column times |
|
|
the size of the cache line is greater than our L1 cache. On my machine the L1 |
|
|
cache is 32K, and the size of the cache line is 64 bytes. This means that a |
|
|
naïve scan of a matrix with 512 rows would invalidate the L1 cache. |
|
|
naïve scan of matrix with 512 rows would invalidate the L1 cache. |
|
|
|
|
|
 |
|
|
|
|
|
@@ -97,34 +99,37 @@ memory utilization of matrix B. |
|
|
|
|
|
One possible implementation, which is kind of intuitive when you look at the |
|
|
first diagram, is to use multiple scalar values from each cache line in matrix B |
|
|
at once. We can load a scalar from matrix A and broadcast it 8 times to fill the |
|
|
SIMD register. Then, we can load 8 consecutive columns from matrix B and |
|
|
perform the vectorized element-wise computation. This means that we process and |
|
|
calculate 8 results in matrix C at once. The picture below depicts the |
|
|
vectorized matrix multiplication. |
|
|
at once. We need to multiply the value that we load from matrix A with 8 |
|
|
consecutive values from matrix B. We can load a scalar from matrix A and |
|
|
broadcast it 8 times to fill the SIMD register. Then, we can load 8 consecutive |
|
|
columns from matrix B and perform the vectorized element-wise computation. This |
|
|
means that we process and calculate 8 results in matrix C at once. The picture |
|
|
below depicts the vectorized matrix multiplication. Notice that instead of |
|
|
accumulating into a single scalar we'll be accumulating into a short vector of 8 |
|
|
scalars. The vectorized method will calculate 8 values in matrix C together. |
|
|
|
|
|
 |
|
|
|
|
|
This transformation reduces the memory bandwidth of matrix B by a factor of 8 |
|
|
(the SIMD width). Instead of performing `N*N` reads (that most likely miss the |
|
|
cache) we perform `N*N/8` reads. |
|
|
|
|
|
Now that we’ve optimized the bandwidth of matrix B, let’s look at our machine |
|
|
again. Our machine has two FMA ports, and two memory ports. This means that we |
|
|
can load two elements from memory in each cycle and perform two matrix |
|
|
multiplications in each cycle. However, in our algorithm we multiply an element |
|
|
from matrix A with an element from matrix B. This means that we have two memory |
|
|
operations for every arithmetic operation. Half of the time our multipliers are |
|
|
waiting for data to be read from memory. In other words, our current |
|
|
implementation can achieve at most 50% utilization. |
|
|
Now that we’ve optimized the bandwidth of matrix B, let’s look at our CPU |
|
|
architecture again. Our machine has two FMA ports, and two memory ports. This |
|
|
means that we can load two elements from memory each cycle and perform two |
|
|
arithmetic operations each cycle. However, in our algorithm we multiply an |
|
|
element from matrix A with an element from matrix B. This means that we have two |
|
|
memory operations for every arithmetic operation. Half of the time our |
|
|
multipliers are waiting for data to be read from memory. In other words, our |
|
|
current implementation can achieve at most 50% utilization. |
|
|
|
|
|
|
|
|
## Register Blocking |
|
|
|
|
|
To get from 50% utilization to 100% utilization we need to perform one memory |
|
|
operation for every arithmetic operation. The technique that people use is |
|
|
called “register blocking”. It sounds scary, but it’s actually very simple. |
|
|
After loading a value from memory, we should use it more than once when |
|
|
After loading a value from memory, we need to use it more than once when |
|
|
performing arithmetic calculations. Every value that we load from matrix A |
|
|
should be multiplied with several values from matrix B, and every value that we |
|
|
load from matrix B should be multiplied with several values from matrix A. Take |
|
|
@@ -133,23 +138,24 @@ we calculate a small patch. We accumulate into a number of registers at once. |
|
|
|
|
|
 |
|
|
|
|
|
The code below implements the innermost loops that calculate a patch of size |
|
|
regA x regB into matrix C. The code loads `regA` rows from matrixA and `regB` |
|
|
times SIMD-width from matrix B. |
|
|
The templated code below implements the innermost loops that calculate a patch |
|
|
of size `regA x regB` in matrix C. The code loads `regA` scalars from matrixA |
|
|
and `regB` SIMD-width vectors from matrix B. The program uses Clang's extended |
|
|
vector syntax. |
|
|
|
|
|
```c++ |
|
|
/// Compute a RAxRB block of C using a vectorized dot product, where RA is the |
|
|
/// number of registers to load from matrix A, and RB is the number of registers |
|
|
/// to load from matrix B. |
|
|
template <unsigned int regsA, unsigned int regsB> |
|
|
void libjit_matmul_dot(int k, const float *a, int lda, const float *b, int ldb, |
|
|
float *c, int ldc) { |
|
|
template <unsigned regsA, unsigned regsB> |
|
|
void matmul_dot_inner(int k, const float *a, int lda, const float *b, int ldb, |
|
|
float *c, int ldc) { |
|
|
float8 csum[regsA][regsB] = {{0.0}}; |
|
|
for (int p = 0; p < k; p++) { |
|
|
|
|
|
// Perform the DOT product. |
|
|
for (int bi = 0; bi < regsB; bi++) { |
|
|
float8 bb = LoaduFloat8(&B(p, bi * 8)); |
|
|
float8 bb = LoadFloat8(&B(p, bi * 8)); |
|
|
for (int ai = 0; ai < regsA; ai++) { |
|
|
float8 aa = BroadcastFloat8(A(ai, p)); |
|
|
csum[ai][bi] += aa * bb; |
|
|
@@ -167,7 +173,7 @@ void libjit_matmul_dot(int k, const float *a, int lda, const float *b, int ldb, |
|
|
|
|
|
``` |
|
|
|
|
|
The C++ code above compiles to the assembly sequence below when `RegA = 3`, and |
|
|
The C++ code above compiles to the assembly sequence below where `RegA = 3`, and |
|
|
`RegB = 4`. You’ll notice that all of the arithmetic operations (vfmadd231ps) |
|
|
operate directly on registers. This is a good thing because the values that are |
|
|
loaded from matrix A and matrix B are reused by multiple arithmetic |
|
|
@@ -209,28 +215,25 @@ matrix B? One thing that we need to notice is that the number of registers is |
|
|
limited. AVX2 can only encode 16 registers, which means that the size of the |
|
|
patch in matrix C that we can calculate can be at most `16 * SIMD-width`. |
|
|
|
|
|
Now, we need to decide how many registers to load from matrix A and how many |
|
|
registers to load from matrix B, under the constraint that the multiplication of |
|
|
the two has to be less than 16. |
|
|
|
|
|
The table below lists a few possible register blocking values and their |
|
|
properties. The memory operations is calculated as `RegA + RegB`, because we |
|
|
need to load the registers from memory. The arithmetic is calculated as `RegA * |
|
|
RegB` because we are multiplying each value from A with each value from B. |
|
|
|
|
|
The total number of registers used is calculated as the number of accumulation |
|
|
registers plus the number of registers from matrix A, that we broadcast values |
|
|
into. |
|
|
The configuration that we pick needs to minimize the ratio between the number of |
|
|
memory operations and the number of arithmetic operations. The table below |
|
|
lists a few possible register blocking values and their properties. The memory |
|
|
operations column is calculated as `RegA + RegB`, because we need to load all of |
|
|
the registers from memory. The arithmetic operations column is calculated as |
|
|
`RegA * RegB` because we are multiplying each value from A with each value from |
|
|
B. The total number of registers used is calculated as the number of |
|
|
accumulation registers plus the number of registers from matrix A, that we |
|
|
broadcast values into. |
|
|
|
|
|
|
|
|
Regs A | Regs B | Total Registers | Memory | Arithmetic | |
|
|
------ | ------ | --------------- | -------| ---------- | |
|
|
1 | 4 | 5 | 5 | 4 | |
|
|
2 | 4 | 10 | 6 | 8 | |
|
|
2 | 5 | 12 | 7 | 10 | |
|
|
4 | 3 | 16 | 7 | 12 | |
|
|
1 | 8 | 9 | 9 | 8 | |
|
|
3 | 4 | 15 | 7 | 12 | |
|
|
Regs A | Regs B | Registers Used | Memory Ops | Arithmetic Ops| |
|
|
------ | ------ | --------------- |------------| ------------- | |
|
|
1 | 4 | 5 | 5 | 4 | |
|
|
2 | 4 | 10 | 6 | 8 | |
|
|
2 | 5 | 12 | 7 | 10 | |
|
|
4 | 3 | 16 | 7 | 12 | |
|
|
1 | 8 | 9 | 9 | 8 | |
|
|
3 | 4 | 15 | 7 | 12 | |
|
|
|
|
|
|
|
|
As you can see in the table, when we select `RegA = 3` and `RegB = 4` we get the |
|
|
@@ -241,42 +244,54 @@ best compute-to-memory ratio while still using less than 16 registers. |
|
|
On Skylake the FMA execution ports are pipelined and have a latency of 5 cycles. |
|
|
This means that to keep two FMA ports busy we need 10 independent chains of |
|
|
computation. If we were to execute the code `C += A * B` in a loop then we would |
|
|
need to wait 5 cycles before each iteration and the utilization of the machine |
|
|
would be 20%. |
|
|
need to wait 5 cycles before each iteration, because the current C would depend |
|
|
on the result of the previous C, and the utilization of the machine would be |
|
|
at most 20%. |
|
|
|
|
|
When we block our registers with the parameters `3x4` we ensure that we have 12 |
|
|
independent chains of calculation which ensure that our FMA ports are always |
|
|
independent chains of calculation that ensure that our FMA ports are always |
|
|
busy. If we had decided to use register blocking of `2x4` then our FMAs would |
|
|
only have 8 independent chains of calculation, which would result in only 80% |
|
|
utilization because only 8 values would be processed at once, instead of 10. |
|
|
|
|
|
The high-performance Convolution implementation in |
|
|
[Glow](https://arxiv.org/abs/1805.00907) uses a `2x5` register blocking |
|
|
implementation because these dimensions happen to work better for the shapes of |
|
|
the matrices used. |
|
|
|
|
|
## Tiling |
|
|
|
|
|
Register Blocking will significantly reduce the memory bandwidth of matrix B. |
|
|
We will simply need to iterate over the matrix less times. However, our memory |
|
|
access pattern is still inefficient. We now know how to calculate a patch in the |
|
|
output matrix C. Notice that the order in which we calculate patches in matrix C |
|
|
affects the order in which we access memory in matrix A and B. An ideal memory |
|
|
access pattern would be one where we load matrix B into our L1 cache and keep it |
|
|
there for a long time. One way to make this happen is to tile the calculation of |
|
|
matrix C. This is pretty easy to do, just divide matrix C into small tiles (say, |
|
|
`128 x 256`) and calculate the results in C one patch at a time. |
|
|
Register Blocking significantly reduces the number of times we load elements |
|
|
from matrix B. However, our memory access pattern is still inefficient. We |
|
|
access matrix B many times and multiply the values with different parts of |
|
|
matrix A. As a result, we load matrix B into our cache, and then invalidate |
|
|
this cache by loading a different part of matrix B. |
|
|
|
|
|
We now know how to calculate a patch in the output matrix C. Notice that the |
|
|
order in which we calculate patches in matrix C affects the order in which we |
|
|
access memory in matrix A and B. An ideal memory access pattern would be one |
|
|
where we load matrix B into our L1 cache and keep it there for a long time. One |
|
|
way to make this happen is to tile the calculation of matrix C. This is pretty |
|
|
easy to do, just divide matrix C into small tiles (say, `128 x 256`) and |
|
|
calculate the results in C one patch at a time. |
|
|
|
|
|
 |
|
|
|
|
|
|
|
|
## Conclusion |
|
|
|
|
|
The register-blocked code is implemented in the [Glow compiler](https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/libjit/libjit_matmul.cpp). |
|
|
On my machine the code runs at over 100 Gflops, which is close to the |
|
|
theoretical peak, and similar to Intel's own implementation. |
|
|
The register-blocked matrix multiplication is implemented in the |
|
|
[Glow compiler](https://github.com/pytorch/glow/blob/master/lib/Backends/CPU/libjit/libjit_matmul.cpp). |
|
|
The code is very similar to the code in this post, except that it also handles |
|
|
matrices with dimensions that are not a multiplication of the SIMD width. On my |
|
|
machine the code runs at around 100 Gflops, which is not far from the |
|
|
theoretical peak, and similar to Intel's optimized implementation. To see what else can be done I recommend |
|
|
reading this [paper](https://www.cs.utexas.edu/users/pingali/CS378/2008sp/papers/gotoPaper.pdf) by Kazushige Goto. |
|
|
|
|
|
The high-performance implementations of matrix multiplication is actually kind |
|
|
of strange: load 3 scalars from the left-hand-side matrix and broadcast them |
|
|
into full SIMD registers, then load 4 vector values from the right-hand-side |
|
|
matrix, and multiply all of them into 12 accumulation registers. This odd |
|
|
implementation is not something that the compiler vectorizer can do for you |
|
|
automatically. However, this does not mean that you need to write this code in |
|
|
assembly either. Clang will do an excellent job compiling C++ code that uses |
|
|
vector types, such as float8, into high-performance code. |
|
|
|
|
|
implementation is not something that the compiler can do for you |
|
|
automatically. |
|
|
|