Speed difference of iterating forwards and backwards on a vector

Preamble

An interesting interview question came up. Is it quicker to iterate a vector forwards or backwards, or is it the same both way?

Spoiler

——————————————————-
Benchmark Time CPU Iterations
——————————————————-
vectorbackwards 237 ns 237 ns 2715315
vectorforwards  96 ns 96 ns 7125770

Investigation

Firstly wrote a simple program with;


for(auto it = a.begin(); it != a.end(); ++it)
{
sum += (*it);
}

for(auto it = a.rbegin(); it != a.rend(); ++it)
{
sum += (*it);
}

And picking out the interesting code from the godbolt compiler code with -O2


.L2:
add esi, DWORD PTR [rax]
add rax, 4
cmp rax, rdx
jne .L2
...
.L3:
add esi, DWORD PTR [rax]
sub rax, 4
cmp rdx, rax
jne .L3

It is obvious from this code the compile only does a ‘sub’ rather than an ‘add’ when going over the loop.
So job done? Iterating both forward and backward is almost identical code? So same speed? Well the compiler thinks so. But wait on! What about cache!

I decided to run the test using google benchmark.

2018-05-08 11:59:48
Running /Users/brombo/git/interviews/benchtest/Debug/benchtest
Run on (8 X 2500 MHz CPU s)
CPU Caches:
L1 Data 32K (x4)
L1 Instruction 32K (x4)
L2 Unified 262K (x4)
L3 Unified 6291K (x1)
——————————————————-
Benchmark Time CPU Iterations
——————————————————-
vectorbackwards 237 ns 237 ns 2715315
vectorforwards  96 ns 96 ns 7125770

Benchmark code:

#include <benchmark/benchmark.h>
#include
#include

using namespace std;

static int sum = 0;

static void vectorbackwards(benchmark::State& state)
{
vector a{1,2,3,4,5,6,7,8,9,10};

for (auto _ : state)
{
for(auto it = a.rbegin(); it != a.rend(); ++it)
{
sum += (*it);
}
}
}

BENCHMARK(vectorbackwards);

static void vectorforwards(benchmark::State& state)
{
vector a{1,2,3,4,5,6,7,8,9,10};

for (auto _ : state)
{
for(auto it = a.begin(); it != a.end(); ++it)
{
sum += (*it);
}
}
}

BENCHMARK(vectorforwards);

BENCHMARK_MAIN();

Conclusion

Reverse iterating a vector is 2.5x slower than iterating forward, albeit the compiler generated code is almost identical. The effect has to be cache/cpu related.

Caveat 

I ran my test on a 2014 MacBook Pro, and it would be good to run the test on a later architecture. It does strike me that this area of CPU/Cache may have improved, but needs testing.

One Reply to “Speed difference of iterating forwards and backwards on a vector”

  1. Keen observers may have noticed that the assembler pertains to GCC but as I’m doing this on a Mac I’m using clang. Using Godbolt Clang seems better.
    If calculated sum is predicable clang optimises out loop and vector, but adding rand() to the sum fixes that.
    In the case of -O2, the assembler looks similar to GCC.
    In the case of -O3 the assembler is loop unrolled.
    The conclusions however still stands.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.