SIMD through AVX/AVX2: Getting Started / Tutorial / Introduction

Free lines C/C++ 5 revisions 105 3 months ago 3 months ago
#include <iostream>
#include <immintrin.h>

__m256 multiply_and_add(__m256 a, __m256 b, __m256 c) {
	__m256 multiplied = _mm256_mul_ps(a, b);

	return _mm256_add_ps(multiplied, c);

int main() {
	__m256 a = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
	__m256 b = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
	__m256 c = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
	__m256 d;

	d = multiply_and_add(a, b, c);

	for (int i = 0; i < 8; i++)
		std::cout << d[i] << std::endl;

In this snippet

  • This snippet converted into the snippet you're looking at right now.
  • 88% of computational time eliminated in this example on my CPU.

The thing we're doing

  • a - Array of 8 elements of 32-bit floats
  • b - Array of 8 elements of 32-bit floats
  • c - Array of 8 elements of 32-bit floats

We write code to make us d which is:

  • d - a multiplied with b then added to c

Quick introduction

Normal computation works like this:

// create some empty `d` array
// iterate from 0 to 8
d[i] = (a[i] * b[i]) + c[i]);

It sounds fairly simple, and that does work. But if you're doing this more like a few trillion times, you may care about performance.

Enter SIMD: SIMD is basically some API that you can use to interact with the CPU in a special way (as implemented by Intel/AMD/whatever) to do some certain tasks faster.

SIMD works like this:

// create 256-bit register (think variable) called `d`
// Multiply `a` with `b` AT THE SAME TIME
// Add `result` to `c` AT THE SAME TIME (somehow)

It's basically passing entire arrays to CPU for computation using one instruction. Without SIMD, we were iterating and asking the CPU to add things, multiply things, ONE BY ONE.

In this example:

  • SIMD: 2 instruction(s) (one for multiplication, one for addition)
  • Not SIMD: a bunch instruction(s)


  • SIMD is parallel data processing. Think of it like CUDA but a bit different and less powerful.
  • SIMD is an API, whereas AVX, SSE, etc. are [hardware] implementations of that API.

Benchmarking old code

Code benchmark is done through this timing method

I benchmarked this same task without SIMD, using the traditional loop approach, here is the time it took to perform that operation 1 million times on my CPU core:

>> Operation took: 55743µs

AVX/AVX2 naming conventions and API

AVX code can look scary. I'll make it simpler:

  • Methods start with _mm
  • Data types start with __m


Data type names include two bits of information: size, and type:

  • __m128: 128-bit of 4 single-precision (32-bit) (float) elements
  • __m128d: 128-bit of 2 double-precision (64-bit) (double) elements
  • __m128i: 128-bit of ? integer (int) elements
  • __m256: 256-bit of 8 single-precision (float) elements
  • __m256d: 256-bit of 4 double-precision (double) elements
  • __m256i: 256-bit of ? integer (int) elements

For methods, it's basically _mm<size>_<operation>_<type>, so you got things like:

  • _mm256_set_ps Set 256-bit single-precision (ps) data

Here is a reference you can crawl through.

Compiler setup

I don't use g++, but for clang you just need to use -march=skylake or whatever.

Optimizing old code

I'll walk you through converting old code to the new code you can see in this snippet. You can copy the example from there, have a look at how it works, and get back here.

First, we want to re-define our data using AVX data types, so instead of using float a[8], we're going to use __m256 which is a data type for storing 8 floats, but remember, we're going to fill that also using a special AVX function called _mm256_set_ps:

__m256 a = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 b = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 c = _mm256_set_ps(1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f);
__m256 d;

We are no longer initializing d because our function will now return the computed value instead of changing d directly.

We'll modify our multiply_and_add function declaration to take and return AVX data types instead of traditional types as well (we'll also remove d and use __m256 instead of void because again, we're going to return data instead):

__m256 multiply_and_add(__m256 a, __m256 b, __m256 c) {

Now, instead of looping over things, we're going to ask the CPU to process this in one instruction, using an AVX function again:

__m256 multiplied = _mm256_mul_ps(a, b);
return _mm256_add_ps(multiplied, c); // returns a __m256

Update our method call appropriately:

d = multiply_and_add(a, b, c);

Pretty straightforward, isn't it.

that's it :o

Benchmarking new code

>> Operation took: 6790µs

Interesting. That's 88% less computational time.

P.S. benchmarking is done like this:


for (int i = 0; i < 1e6; i++)
    d = multiply_and_add(a, b, c);


toggleBench is again, this timing method.