How I have beaten Boost.Pool #3: Branches are bad

13 Apr 2016 by Jonathan

Yes, exaggerated title.

Branches and conditional jumps are essential for every program, you cannot write anything but the most trivial code without them. Yet they sometimes have a certain overhead and can lead to problems in performance critical code paths.

It is often faster if they weren’t there. But how can you do that?

In this series, I’ll explain my changes and share some lessons about optimization I’ve learned in the process of beating Boost.Pool. This time its all about branches and a more detailed information about the detail::small_free_memory_list.

What’s the problem with branches?

But first let me talk about the problems with branches.

Conditional jumps like used in if,for, etc. have one problem: They’re slow.

Ok, this is only partly true: The instruction itself is not inherently slower than other instructions, its execution can be.

The problem… Eh, A really good thing is that CPUs execute instructions in a pipeline. This allows them to start working on the next instruction while the current one is still being processed. Pipelining works fine as long as you can predict what’s the next instruction going to be.

But if you have a conditional jump the next instruction depends on the branch taken!

So in theory a CPU cannot do pipelining with branches, it has to wait until it is known which branch is being taken. This is not feasible, however, it is too slow.

In one of my most favourite Stackoverflow answers a solution is described using a great analogy.

The answer also has almost 20k upvotes…

The analogy uses a train junction:

You are the operator of a junction and you hear a train coming. You have no idea which way it is supposed to go. You stop the train to ask the driver which direction they want. And then you set the switch appropriately.

But this is slow, because trains need time to stop and accelerate again. Like the pipeline of a CPU.

So the CPU tries to predict which branch it will take. This technique is called Branch Prediction.

Is there a better way? You guess which direction the train will go!

If you guessed right, it continues on.

If you guessed wrong, the captain will stop, back up, and yell at you to flip the switch. Then it can restart down the other path.

The same is branch prediction. The CPU guesses which branch will be taken and starts executing its instructions. If it guesses right, there is no penalty. But if it guesses wrong, it must abort executing the pipeline to execute the other instructions.

That is slow.

Thankfully, CPU’s branch predictors are good at those things. For example, if you have an error path, the CPU will learn that you will usually not enter it. So in the regular code path there is not much overhead of the branch.

But if there is an error and you have to enter the error handling path, the branch prediction usually will fail - after all, this is an abnormal case - and you have the slow pipeline flushing. Luckily, this is not a problem because after all it’s an error path! It is not going to affect your performance.

And if it does, you have way too many errors anyway.

On the other hand, you have branches regarding the regular flow. They still have a normal and abnormal case, but the abnormal case is more often.

Then branches can negatively affect your performance.

There is also another, more trivial, cost regarding branches. Consider the following code:

if (alignment > max_alignment())
    throw bad_alignment(...);

There is an if, so you have to pay the cost for the branch instruction. It should be small because the CPU will have detected that one of the cases is rarely executed, so branch prediction will do the right thing. But there is also the cost evaluation of the expression.

And this cost leads me directly to the first guideline.

Guideline I: Optionally disable precondition checks

After I have done all optimizations, after my code has been inlined, after I have removed other branches - this post - and after I have optimized the algorithms - next post, my memory_pool was still slower.

Well, that is not entirely true. It was faster, then I changed the profiling code. After that it was slower.

memory_pool is a class. It has a certain interface specific for a memory_pool. For example, it has an allocate_node() function with the following signature:

void* allocate_node();

This function returns a node from the pool. You do not need to pass the size of the node because it’s a pool: the size is implicitly given!

But the interface of memory_pool is specific to pools. Other allocators need the size to give to allocate_node() because they have no implicit node size.

So in generic code you’d have a problem if you call the functions directly.

And allocators are meant to be used in generic code!

I’ve solved this problem through the allocator_traits. They can be specialized to adapt for specialized interfaces.

Generic code then calls its allocate_node(), there you need to pass size (and alignment):

static void* allocate_node(allocator_type &state, std::size_t size, std::size_t alignment);

In the profiling code I than made the access to the allocator through the traits.

Previously, I’ve used a hand written wrapper to easily change from memory_pool to a free list (which does not have specialized traits).

This was the only change! The compiler did inline everything, didn’t it? If so, how can it lead to a significant performance change?

Title gave it away. They really removes all the suspense.

The answer is: precondition checks.

The general allocate_node() from the size has a custom size and alignment paramater. Obviously, a pool can only accept sizes less than or equal to its node size. Otherwise bad things will happen™.

So to prevent those, there are checks for size and alignment. Those checks are branches…

But the problem wasn’t the branching code itself. As I’ve said, branch prediction would have guessed right.

The problem was the alignment check. The maximum supported alignment of a pool is determined through the free list which forwards to detail::alignment_for() which computes a logarithm for small sizes. This is slow.

In fact, the logarithm isn’t really that bad, it uses CLZ and template magic.

So if you need full speed no matter what, consider an option to disable expensive precondition checks. They can slow you down.

Of course, only use them where really necessary because safety first.

Like when you want to beat a popular library.

Guideline II: Mark unreachable code as unreachable

Talking about expressions that are unnecessarily evaluated, I have also written my own assert() macro. It looked like so:

#if FOONATHAN_MEMORY_DEBUG_ASSERT && !defined(FOONATHAN_MEMORY_ASSERT)
    #define FOONATHAN_MEMORY_ASSERT(Expr) \
        static_cast<void>((Expr) || (detail::handle_failed_assert("Assertion \"" #Expr "\" failed",__FILE__, __LINE__, __func__), true))
#else
    #define FOONATHAN_MEMORY_ASSERT(Expr) static_cast<void>(Expr)
#endif

Spotted the error?

In release mode, assert casts the evaluation to void. This still evaluates expression however!

Removing that gave me an easy speed up.

The discussion about the difference between precondition checks and assertions is a topic for a future blog post.

It was a good thing I made the mistake though.

While I was there I was also forced to take a look at my “unreachable” macro.

#if FOONATHAN_MEMORY_DEBUG_ASSERT && !defined(FOONATHAN_MEMORY_ASSERT)
    #define FOONATHAN_MEMORY_UNREACHABLE(Msg) \
                    detail::handle_failed_assert("Unreachable code reached: " Msg, __FILE__,  __LINE__, __func__)
#else
    #define FOONATHAN_MEMORY_UNREACHABLE(Msg)
#endif

Here I did the exact opposite! In release mode it did nothing.

This is also bad. An unreachable code path is, well, unreachable. The compiler should generate code so that unreachable branches are eliminated. This can lead to fewer branches and shorter assembler code.

But in release mode, the macro is evaluated to nothing so the compiler does not have the information that a code path is unreachable. To give it back I simply inserted a call to std::abort().

“Why did I disable the unreachable macro in release mode anyways?”, you ask, “It is unreachable, after all!” The function detail::handle_failed_assert() calls std::fprintf() to output the message. This is a call to the standard library. In release mode there should be no dependency on it so you can use the library on a freestanding implementation.

This is just a minor thing but it improved code generation. I didn’t really profile it, so it might be completely meaningless.

A better way would be to insert something like __builtin_unreachable() or __assume(0). Those are the proper but implementation dependent ways of telling that a code path is unreachable. But with the [[noreturn]] attribute the compiler should tell anyways.

Guideline III: Consider keeping things sorted for faster lookup

A certain form of branches that is always slow are loops. Keep the amount of loop iterations low and you’ll have faster code.

Loops aren’t slow because of branches of course. And yes, this is a “keep the big O complexity down”. But I give an interesting example and refer to it in the next guidelines, so do read on.

A free list stores the link to the next node inside the unused memory. This is awesome but only works if all nodes are bigger than sizeof(void*). detail::small_free_memory_list - inspired by the allocator from Modern C++ Design - works around that by storing only unsigned char as links. It allows all object sizes but it needs to split memory into chunks of (usually) 255 nodes each.

Allocation first needs to find a chunk with free memory and deallocation needs to find the chunk who owns the memory. To speed things up, pointers are stored to the chunk last used for allocation and deallocation. First the pointers are checked, then the list of all chunks is searched.

For allocation this isn’t so bad. Only every 255 nodes a new chunk needs to be found. And this chunk is usually near the last allocated chunk, so the search is fast.

For certain deallocation scenarios - butterfly! - deallocation is bad, however. Because then possibly for each node the list of chunks needs to be searched.

Making things worse, as I’ve explained in part 1 depending on the sorting order, you have either fast bulk or fast reversed bulk, not both, because a singly linked list can only be traversed in one direction.

But wait!

For the chunk list I don’t need to limit myself to a singly linked list. I can use a doubly linked list. There is a space overhead of 4/8 bytes but compared to the 255 bytes it can store at a minimum, this isn’t much.

And a doubly linked list allows traversal in both directions, so the search for the right chunk can go into both directions at once as well. This makes both bulks fast.

But what about butterfly?

It can be speed up if the chunks are always kept sorted. Because then you can split the list in half in the best case.

You can’t do a binary search because linked lists don’t allow random access.

Consider you want to find the chunk for ptr. There are three cases:

ptr belongs to the last deallocation chunk. Then you’re finished.
ptr is greater than the memory the last deallocation chunks manages. Then it is somewhere in (last_dealloc_, last].
ptr is less than the memory the last deallocation chunks manages. Then it is somewhere in [begin, last_dealloc).

There is actually a fourth one checked: ptr belongs to the last allocation chunk. Then you’re also finished.

After that you only need to search in the corresponding half of the list. There you can search from the beginning and the end at the same time until you’ve found the appropriate chunk.

This was a worthwile optimization but it came with a cost: Now when inserting memory into the small free list, the appropriate position to insert the chunk so that everything stays ordered needs to be found. Now insert() thus requires a traversal over (a part of) the list.

But as I’ve argued in the previous post, insert() is always slow because it needs to actually allocate memory. Also it shouldn’t be called very often, because then you are using more memory than predicted.

So the extra cost there doesn’t matter that much. But keep everything in mind when deciding to keep things sorted.

Guideline IV: Minimize branches in datastructures

The other search in the detail::small_free_memory_list needs to start at the last allocation chunk. The next chunk with capacity is then likely nearby.

Or at a totally different position, depending on the BlockAllocator used. But you have to guess something, might as well be that.

So the search starts there and goes in both directions. No you ran into a problem: in most cases you reach the end in one direction before the other. Then you have to stop that and only continue with the other direction.

This will complicate the code and - more importantly for the purpose of this post - contain branches.

Or take another example: a doubly linked list itself.

To insert a node into the front of a doubly linked list you do something like this:

node->prev = nullptr;
node->next = first;

first = node;

if (!last) // insert into empty list
    last = node;

And erasing the first node looks like this:

first = node->next;

if (node->next) // not the last node
    node->next->prev = nullptr;
else // last node
    last = nullptr;

Both functions have - you guessed it/saw it - branches.

And you can see that these branches actually have a negative performance impact, what do you do?

In the first example the problem is that one iterators run to the end of the list. It would be better if it could keep on iterating. This can be achieved by making it a circular list where the next pointer of the last chunk points to the first one and the prev pointer of the first one points back to the last one. Now you can freely iterate in both directions to the list without worrying about running of the edge.

And in the doubly linked list example the problem is that the list can be previously empty before the insert/is empty after the erase. This can be avoided by ensuring that the list never is empty. Just use a proxy node that is always the last element of the list. Now last will always point to it, no matter what and thus never needs to be updated.

It can be optimized even further by making the last pointer this proxy node, i.e. embedding it as member. Then you can directly access the last real list object. And erase don’t need the branch because the “last pointer”, i.e. the proxy, still has a prev pointer that can be accessed and set.

Of course those optimizations are not without cost.

In the circular list exampl you have a more expensive insertion into the list of chunks, i.e. more branches. But as I’ve said: insertion is slow anyway.

And if you store proxy objects as member variables, copy/move gets slower. This is because you now need to change the pointer to the proxy objects; the list node cannot refer to proxies of a different list object! But if you have a list with many inserts/erases and few copy/move the information might be worthwile.

Guideline V: Be aware of hidden branches in && and ||

When talking about branches there are certain conditional jumps that hide behind syntax sugar. For example, the && operator has short circuit evaluation; the second operand isn’t evaluated if the first one is false.

This is useful, but how is it achieved?

I should either stop giving the answer implictly in the title or stop doing these rhetorical questions. Both just don’t work together.

There is a conditional jump in the assembler level.

Let me give you a real example, again with detail::small_free_memory_list. The circular list is implemented by storing a proxy node like in the doubly list example as member. It looked like so:

struct chunk_base
{
    chunk_base *prev;
    chunk_base *next;
};

class small_free_memory_list
{
public:
    ...
    
private:
    chunk_base base_;        
};

// in the source file
struct chunk : chunk_base
{
    ...
};

chunk_base only has the two pointers needed for the chunk list stuff whereas chunk contains the actual code and members needed for the free list managment. It is now convenient to convert a chunk_base* to a chunk*. This is of course only possible if the address isn’t equal to &base_. So I wrote a little helper:

chunk* make_chunk(chunk_base *ptr)
{
    return ptr == &base_ ? nullptr : static_cast<chunk*>(ptr);
}

It can now be used like so:

if (auto c = make_chunk(ptr))
{
    // do sth with c
}

I love the feature where you can initialize variables directly inside the if. It was designed for these kinds of conditional downcasts.

But sometimes just a pointer to a chunk isn’t all you need, you also need additional checks. Like in the search for a chunk with capacity, you also need to check whether a chunk has capacity:

auto c = make_chunk(ptr);
if (c && c->capacity > 0u)
{
       // do sth with c
}

capacity is a member variable of chunk. And now you have a conditional.

How can it be avoided?

Just put the capacity member down into chunk_base. Then you can access it while having a chunk_base* only - at the cost of a bigger free list object.

In fact, I’ve put all members down into chunk_base because they’d fit into the alignment buffer. chunk just contains helper member functions now.

Conclusion

Branches can sometimes slow your application down. They can be removed but at the cost of more work in other operations.

Here it is especially important that you profile each optimization you do. Do not prematurely decided to introduce additional costs elsewhere in order to remove branches. This is only a benefit in few and special cases.

I repeat it again: profile before and after every optimization. If it has a visible positive effect and you are sure that the extra cost elsewhere doesn’t hurt, and only then, keep the optimization. Otherwise revert it.

I have tried so many different things, I don’t remember them all. Many of them had no effect and some even negatively affected performance. Optimization is hard.

At this point in the series I’ve shown a lot about the optimization in the different allocators. In the next (and most likely final) part of the series I’ll finish by showing the changes in detail::fixed_memory_stack and finally explain how I managed such a fast detail::ordered_free_memory_list. There it is all about abstraction costs and algorithms.

So keep reading!

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.