How I have beaten Boost.Pool #1: Introduction and profiling results

18 Mar 2016 by Jonathan

When I’ve released memory 0.5, a guy on reddit asked how my library compared to Boost.Pool. I provided a feature comparison and also quickly profiled both Boost’s and my implementation. Sadly, Boost.Pool did beat my library - in most cases.

So over the last weeks, I’ve took care of my performance problems and rewrote my implementations. So in version 0.5-1 they are basically still using the same algorithm, but now my library is equal or faster than Boost.Pool.

In this series, I’ll explain my changes and share some lessons about optimization I’ve learned doing them. The first part is an introduction to the different allocation algorithms used here and gives an overview about the profiling results.

Yes, the title is deliberately provocative. My apologies. Also just because I’m now faster than Boost doesn’t mean I’m better than Boost’s implementation. It’s definitely more tested and probably less bugs. After all, I’m still on a 0.x release.

The Allocators

My library includes a simple profiling target that runs some performance comparisons on my allocators. Those are:

Heap: My heap_allocator, it allocates using std::malloc().
New: My new_allocator, it allocates using ::operator new.
Stack: My memory_stack modelling a stack allocator. A stack allocator takes a huge memory block and maintains a top pointer. Allocation simply shifts the top pointer forwards by the needed number of bytes and returns the old position. Deallocation is not supported directly, only unwinding the top pointer to a previously queried location.
Node: My memory_pool, a regular memory pool. Pools can only handle allocations of one size, the node size. It takes a huge memory block and maintains a linked list of all nodes that are currently free. Allocation simply pops the first node, deallocation pushes a node back onto the list. Since the memory of the free nodes is, well, free, the link can be embedded into them directly - if the node size is too small for that, it needs to be made bigger.
Array: My memory_pool<array_pool>, a pool with better support for array allocations. For array allocations, the nodes need to be stored consecutively in memory. In the beginning, they are. But after many (de-)allocations on a list, the nodes can be shuffled around. So this free list is ordered, the nodes are always kept sorted. This makes it slower but the support for array allocations is better.
Small: My memory_pool<small_node_pool> a pool optimized for small nodes. Instead of storing a pointer in the free list, it only stores an index as unsigned char. This allows small nodes but has a little bit more bookkeeping since an unsigned char can (usually) only hold 256 different values. So a list of chunks is maintained, each with a seperate free list. This is the same design as the allocator described in Modern C++ Design, but slightly optimized.

And also for this comparison two variants of Boost’s pools: one using the “normal” allocations and one using the ordered_ versions. The first one is similar to my Node pool, the second one to my Array pool.

I will refer to my Node and the un-ordered Boost.Pool as the normal/node pools and my Array and the ordered Boost.Pool as the ordered/array pools since both have similar characteristics and algorithms.

The profiling structure

The profiling code runs each allocation strategy described below for 1024 times, taking the minimum time needed in nanoseconds. All (debug) checks of my library are disabled and all optimizations, including link-time optimizations, enabled.

It is compiled using GCC 5.3.0 and executed on a 3.5GHz processor running Arch Linux.

The tested node sizes are 1, 2, 4, 8 and 256, repeated 256, 512 and 1024 times. For arrays it allocates {1, 4, 8} * {1, 4, 8} with the same number of repetitions. Only the allocators that support array allocations are tested, that are all allocators except Small and the normal Boost.Pool.

The Strategies

The allocation strategies represent different ways of allocating elements. Of course, over the lifetime of an allocator it will get a mix of many different allocation strategies, so those are not fully realistic conditions.

Actually, the strategies are deallocation strategies, not allocation strategies, since the deallocation order matters for most allocators.

The strategies are:

Single: It simply allocates one node (or one array) and deallocates it. This is repeated n times. The Single allocation strategy is encountered, for example, when you have a local std::unique_ptr in a loop that gets created each time and destroyed afterwards.
Bulk: It allocates n nodes (or n arrays of nodes) and deallocates them afterwards, in the same order of allocation. This can happen when you have std::vector<std::unique_ptr<T>>. You have n elements that are created and destroyed (I am talking about the pointers here, not the vector allocation).
Bulk (reversed): It is the same as Bulk but deallocates them in reverse order, i.e. the last allcoated node (array) is deallocated first. This can also happen with the std::vector, the order of deallocation is not specified and there are reasonable arguments for both ways. So a good allocator should support both Bulk variants equally well.

In fact, I’d guess that most library implementations are using reverse order. This is more natural, local variables on the stack are destroyed in that order as well.

Butterfly: It is another Bulk variant where the deallocation happens in random (chaotic) order, i.e. the allocated pointers are shuffled with a constant seed. This can happen when there are many pointers in a program all from one allocator.

In reality, there will not be a single strategy but a mix. For example, all strategies start with an allocator without any previous allocations. This is most likely not the case.

The expected results

Heap/New are general purpose allocators that need to handle any allocation size/scheme. So they cannot specialize on certain schemes like the other allocators. Thus they should be - in general - slower than other allocators.

Stack should be significantly faster than everything else, since its allocation is basically a pointer increment and the deallocation is non-existent in the profiling code.

The allocation of a normal pool just pops a node and the deallocation just pushes it back in. This isn’t dependent on the allocation strategy, so there should be constant results over all strategies for both my and Boost’s implementation.

The same is true for the small node pool. It will be slower, however, since it has the free list only in chunks and first needs to find the appropriate chunk.

The ordered pools are different, though. Allocation still just pops a node but deallocation needs to insert it at the right position in order to keep the list ordered. Since we are only dealing with a singly-linked list (one pointer per node), it needs to traverse the list from the head comparing each node one-by-one. For one of the two bulk strategies, this is just an insert at the front. But for the other it needs to insert at the back, so it needs to traverse the entire list. Whether the awful performance is for Bulk and Bulk (reversed) depends on the sorting order. And Butterfly is in-between: for some nodes it has to go through much of the list, for other, it can end it very early.

This should be the same for both arrays and node allocations. There shouldn’t be much difference between my and Boost’s pool implementation since they are using the same underlying algorithm.

The actual results (version 0.5)

So here are the actual results I’ve gotten: https://gist.github.com/foonathan/3aa3114284863bf3141a

Note that it measures the time in microseconds, not nanoseconds there. It also does not have fancy Markdown output, my apologies for that.

The general purpose allocators are slower, Stack is the fastest and Small and Node have a similar, constant performance, Small being slightly slower. And the ordered Boost.Pool shows the expect behavior for an ordered pool. It is obviously optimized for Bulk (reversed).

So far, so expected.

But…

Boost.Pool beats all of my allocators significantly, even the Stack! Also, my array pool manages a constant performance for both bulks and only a regression for Butterfly where it has similar performance to Boost.

Clearly, this isn’t as I’d like it.

I meant the “my normal pools and stack are slower than Boost.Pool” not “my array beats the hell out of the ordered Boost.Pool in Bulk”. I’m perfectly fine with that.

The actual results (version 0.5-1)

So after a bunch of optimizations I’ve got the following results: https://gist.github.com/foonathan/904ed04f57aeecd132e3

This time, nanoseconds and Markdown output, ‘cause I didn’t just want to beat Boost; I want to beat them in style.

Now, Stack is significantly faster and the two normal pools have a similar performance (mine is slightly faster in the two bulks and Butterfly).

Small node pool is also faster but still slower than the normal pools. It uses free lists but multiple, one per chunk. Allocation and especially deallocation first needs to find a proper chunk.

My ordered pool still shows the same characteristics, it is just much faster; now only slightly slower in Single and Bulk (reversed) but significantly faster in the other Bulk and Butterfly, albeit still bad in Butterfly.

But not as bad as Boost’s implementation!

This is the same for array allocations. The only thing I should point out is that my normal pool also supports array allocations and that those are faster than the ordered pool. This doesn’t mean that you should choose the normal pool for array allocations.

In fact, I wouldn’t ever choose a memory pool for array allocations.

Array allocations on a free list require scanning the list for enough adjacent free nodes to fulfill the allocation. If the nodes are kept ordered, adjacent nodes will always end up adjacent in the free list as well, so allocation failure of the list leading to a reallocation of the allocator is minimzed. But if the nodes aren’t kept ordered - as in the normal pool, this is more likely to occur. Also, the search can take longer.

This behavior doesn’t get obvious here because there is only a single allocation strategy with a sorted deallocation scheme (except in Butterfly) and the pool’s capacity is big enough. But in reality, the node pool will be worse for array allocations and may lead to more growing of the allocator.

So what is happening here?

How did I manage to have a great ordered pool in the two bulk cases?

And a pretty decent one in Butterfly?

And how the hell did I screw up my memory stack and pools in 0.5?

I’ll answer those questions in this series. It will cover exactly what happens and give some general advice I’ve learned during the great optimization™.

In the meantime, you could also look at the diff, of course. Overall, it has slightly more additions than deletions, the actual code has gotten much shorter though. This is only because more is extracted into their own functions. So it is not just faster: it is also simpler.

So stay tuned!

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.