## SIMD through AVX/AVX2: Getting Started / Tutorial / Introduction

Free lines C/C++ 105 3 months ago 3 months ago
``````#include <iostream>
#include <immintrin.h>

__m256 multiply_and_add(__m256 a, __m256 b, __m256 c) {
__m256 multiplied = _mm256_mul_ps(a, b);

}

int main() {
__m256 a = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 b = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 c = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 d;

for (int i = 0; i < 8; i++)
std::cout << d[i] << std::endl;
}``````

### In this snippet

• This snippet converted into the snippet you're looking at right now.
• 88% of computational time eliminated in this example on my CPU.

### The thing we're doing

• `a` - Array of 8 elements of 32-bit floats
• `b` - Array of 8 elements of 32-bit floats
• `c` - Array of 8 elements of 32-bit floats

We write code to make us `d` which is:

• `d` - `a` multiplied with `b` then added to `c`

### Quick introduction

Normal computation works like this:

``````// create some empty `d` array
// iterate from 0 to 8
d[i] = (a[i] * b[i]) + c[i]);``````

It sounds fairly simple, and that does work. But if you're doing this more like a few trillion times, you may care about performance.

Enter SIMD: SIMD is basically some API that you can use to interact with the CPU in a special way (as implemented by Intel/AMD/whatever) to do some certain tasks faster.

SIMD works like this:

``````// create 256-bit register (think variable) called `d`
// Multiply `a` with `b` AT THE SAME TIME
// Add `result` to `c` AT THE SAME TIME (somehow)``````

It's basically passing entire arrays to CPU for computation using one instruction. Without SIMD, we were iterating and asking the CPU to add things, multiply things, ONE BY ONE.

In this example:

• SIMD: 2 instruction(s) (one for multiplication, one for addition)
• Not SIMD: `a bunch` instruction(s)

tl;dr:

• SIMD is parallel data processing. Think of it like CUDA but a bit different and less powerful.
• SIMD is an API, whereas AVX, SSE, etc. are [hardware] implementations of that API.

### Benchmarking old code

Code benchmark is done through this timing method

I benchmarked this same task without SIMD, using the traditional loop approach, here is the time it took to perform that operation 1 million times on my CPU core:

``>> Operation took: 55743µs``

### AVX/AVX2 naming conventions and API

AVX code can look scary. I'll make it simpler:

• Methods start with `_mm`
• Data types start with `__m`

Interesting.

Data type names include two bits of information: size, and type:

• `__m128`: 128-bit of 4 single-precision (32-bit) (`float`) elements
• `__m128d`: 128-bit of 2 double-precision (64-bit) (`double`) elements
• `__m128i`: 128-bit of ? integer (`int`) elements
• `__m256`: 256-bit of 8 single-precision (`float`) elements
• `__m256d`: 256-bit of 4 double-precision (`double`) elements
• `__m256i`: 256-bit of ? integer (`int`) elements

For methods, it's basically `_mm<size>_<operation>_<type>`, so you got things like:

• `_mm256_set_ps` Set 256-bit single-precision (ps) data

Here is a reference you can crawl through.

### Compiler setup

I don't use `g++`, but for `clang` you just need to use `-march=skylake` or whatever.

### Optimizing old code

I'll walk you through converting old code to the new code you can see in this snippet. You can copy the example from there, have a look at how it works, and get back here.

First, we want to re-define our data using AVX data types, so instead of using `float a`, we're going to use `__m256` which is a data type for storing 8 `float`s, but remember, we're going to fill that also using a special AVX function called `_mm256_set_ps`:

``````__m256 a = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 b = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 c = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 d;``````

We are no longer initializing `d` because our function will now return the computed value instead of changing `d` directly.

We'll modify our `multiply_and_add` function declaration to take and return AVX data types instead of traditional types as well (we'll also remove `d` and use `__m256` instead of `void` because again, we're going to return data instead):

``__m256 multiply_and_add(__m256 a, __m256 b, __m256 c) {``

Now, instead of looping over things, we're going to ask the CPU to process this in one instruction, using an AVX function again:

``````__m256 multiplied = _mm256_mul_ps(a, b);
return _mm256_add_ps(multiplied, c); // returns a __m256``````

Update our method call appropriately:

``d = multiply_and_add(a, b, c);``

Pretty straightforward, isn't it.

that's it :o

### Benchmarking new code

``>> Operation took: 6790µs``

Interesting. That's 88% less computational time.

P.S. benchmarking is done like this:

``````toggleBench();

for (int i = 0; i < 1e6; i++)
`toggleBench` is again, this timing method.