How I have beaten Boost.Pool #4: About abstractions and algorithms

And apparently also about alliterations.

The last posts showed low-level techniques like ensuring inlining or removing branches.

But those techniques alone were not sufficient.

In this series, I’ll explain my changes and share some lessons about optimization I’ve learned in the process of beating Boost.Pool. The final post shows how to apply those techniques when designing your abstractions and the importance of smart algorithms.

About abstractions

The second post showcased the 0.5 implementation of memory_stack::allocate():

void* allocate(std::size_t size, std::size_t alignment)
{
	detail::check_allocation_size(size, next_capacity(), info());
	auto mem = stack_.allocate(block_end(), size, alignment);
	if (!mem)
	{
		allocate_block();
		mem = stack_.allocate(block_end(), size, alignment);
		FOONATHAN_MEMORY_ASSERT(mem);
	}
	return mem;
}

It just forwards to the detail::fixed_memory_stack::allocate(). It looked like so (plus debugging stuff I haven’t shown here and minus the comments):

void* fixed_memory_stack::allocate(const char *end, std::size_t size, std::size_t alignment) FOONATHAN_NOEXCEPT
{
    if (cur_ == nullptr) // stack is empty
        return nullptr;

    auto remaining = std::size_t(end - cur_);
    auto offset = align_offset(cur_, alignment); // calculate offset necessary for alignment

    if (offset + size > remaining)
        return nullptr; // not enough memory available
    cur_ += offset; // properly align cur

    auto memory = cur_; // cur_ now points to the memory needed
    cur_ += size; // bump cur_ past the memory

    return memory;
}

detail::fixed_memory_stack is a small class that only maintains the current pointer inside a memory block. Allocation simply bumps this pointer. Note that the class does not maintain end like explained in part 2, so it has to be given to the function for calculating the number of remaining bytes in the block.

This class follows the classic OOP paradigm. The data of the stack - the cur_ pointer - is encapsulated and only modified through member functions. Those member functions model the general kind of things you want to do with a simple stack like that: allocate(), unwind() to previously queried location and top() to query a location.

With this interface, memory_stack - which need to be able to operate on multiple blocks - uses it like shown above. First it tries to allocate in the current block. If that fails, it allocates a new block and tries again.

The problem with this abstraction

But this code above is slow. Like, really slow. It got better after inlining, but it was still slow.

Why?

Let’s do the compilers job and manually inline the two calls:

void* allocate(std::size_t size, std::size_t alignment)
{
    detail::check_allocation_size(size, next_capacity(), info());

    // auto mem = stack_.allocate(block_end(), size, alignment);
    void *mem;
    if (stack_.cur_ == nullptr)
        mem = nullptr;
    else
    {
        auto remaining = std::size_t(block_end() - stack_.cur_);
        auto offset = detail::align_offset(stack_.cur_, alignment);

        if (offset + size > remaining)
            mem = nullptr;
        else
        {
            stack_.cur_ += offset;

            mem = stack_.cur_;
            cur_ += size;
        }
    }

    if (!mem)
    {
        allocate_block();
        //mem = stack_.allocate(block_end(), size, alignment);
        if (stack_.cur_ == nullptr)
            mem = nullptr;
        else
        {
            auto remaining = std::size_t(block_end() - stack_.cur_);
            auto offset = detail::align_offset(stack_.cur_, alignment);

            if (offset + size > remaining)
                mem = nullptr;
            else
            {
                stack_.cur_ += offset;

                mem = stack_.cur_;
                cur_ += size;
            }
        }
        FOONATHAN_MEMORY_ASSERT(mem);
    }
    return mem;
}

Notice how I needed to write if-else to simulate the early returns.

This is a lot of code, some of it duplicated. And other parts of the code are unnecessary given the postconditions of allocate_block(). The compiler isn’t able to optimize it as well. For starters, it doesn’t have the post conditions.

Making it better

So let’s optimize it manually.

At the end of the if (!mem) branch there is an assertion requiring that mem is non-null. This is logical because the post condition of allocate_block() is that it has allocated a new memory block that has the size of next_capacity(). And the precondition of memory_stack::allocate() is that the memory is smaller than next_capacity().

So the only way that mem is nullptr at the end of that branch is due to a pre- or postcondition violation. We can thus safely remove the branches that would result in mem being nullptr:

void* allocate(std::size_t size, std::size_t alignment)
{
    detail::check_allocation_size(size, next_capacity(), info());

    void *mem;
    if (stack_.cur_ == nullptr)
        mem = nullptr;
    else
    {
        auto remaining = std::size_t(block_end() - stack_.cur_);
        auto offset = detail::align_offset(stack_.cur_, alignment);

        if (offset + size > remaining)
            mem = nullptr;
        else
        {
            stack_.cur_ += offset;

            mem = stack_.cur_;
            cur_ += size;
        }
    }

    if (!mem)
    {
        allocate_block();

        auto offset = detail::align_offset(stack_.cur_, alignment);

        stack_.cur_ += offset;

        mem = stack_.cur_;
        cur_ += size;
    }
    return mem;
}

With those removed we also don’t need to calculate the remaining memory in that branch.

If we look at the first branch now we have two nested if-else cases. Because align_offset() works on nullptr this can be put outside the first one. The calculation of remaining doesn’t work though, but if we remove the variable and do it in the second branch of a short-circuiting condition, we can merge both cases together:

void* allocate(std::size_t size, std::size_t alignment)
{
    detail::check_allocation_size(size, next_capacity(), info());

    void *mem;

    auto offset = detail::align_offset(stack_.cur_, alignment);
    if (stack_.cur_ && offset + size <= std::size_t(block_end() - stack_.cur_))
    {
        stack_.cur_ += offset;

        mem = stack_.cur_;
        cur_ += size;
    }
    else
        mem = nullptr;

    if (!mem)
    {
        allocate_block();

        auto offset = detail::align_offset(stack_.cur_, alignment);
        stack_.cur_ += offset;

        mem = stack_.cur_;
        cur_ += size;

        FOONATHAN_MEMORY_ASSERT(mem);
    }
    return mem;
}

Now we clearly see that the second if (!mem) is just the else of the first one. Furthermore, the calculation of the value of mem and the following bump of cur_ are done exactly the same in the two branches. So we can move the duplicated code to the end of the function and do it only once:

void* allocate(std::size_t size, std::size_t alignment)
{
    detail::check_allocation_size(size, next_capacity(), info());

    auto offset = detail::align_offset(stack_.cur_, alignment);
    if (stack_.cur_ && offset + size <= std::size_t(block_end() - stack_.cur_))
    {
        stack_.cur_ += offset;
    }
    else
    {
        allocate_block();

        auto offset = detail::align_offset(stack_.cur_, alignment);
        stack_.cur_ += offset;
    }

    auto mem = stack_.cur_;
    cur_ += size;

    return mem;
}

There is still a bit of duplication: aligning the stack is done in both branches. Here this is not a big deal, but the actual code also needs to take care of filling the alignment buffer and also add a debug fence. This is a significant amount of duplication.

So the alignment can be put at the end. Then the first if is completely empty, so it can be removed by reversing the condition and putting it before the else:

void* allocate(std::size_t size, std::size_t alignment)
{
    auto offset = detail::align_offset(stack_.cur_, alignment);
    if (stack_.cur_ || offset + size <= std::size_t(block_end() - stack_.cur_))
    {
        allocate_block();

        // recalculate alignment offset
        offset = detail::align_offset(stack_.cur_, alignment);

        detail::check_allocation_size(offset + size, next_capacity(), info());
    }

    stack_.cur_ += offset;
    auto mem = stack_.cur_;
    cur_ += size;

    return mem;
}

I’ve also moved the allocation size check into this branch because then the exact required size is known.

This is the final piece of code. Compare that with the initial version and you can clearly see that this code is a lot faster and smaller.

It incorporates a few of the bugfixes I’ve done in the latest patch. Apparently writing about your design helps spotting errors.

Reflecting the abstraction actually needed

The code above makes direct operations on detail::fixed_memory_stacks only member. If it was exactly that I’d probably keep it like so. In fact, I most likely remove the struct altogether because it is just a pointer then.

But the actual production code is slightly more complicated, each time stack_.cur_ is increased by an offset, the memory range is filled. So it is not just a pointer increment but also a call to detail::debug_fill(). Those two tasks must always be done together so it makes sense to have an abstraction here.

What kind of functions do we actually need to do here?

With this abstraction the code looks now like so:

void* allocate(std::size_t size, std::size_t alignment)
{
    auto offset = detail::align_offset(stack_.top(), alignment);
    if (stack_.top() || offset + size <= std::size_t(block_end() - stack_.top()))
    {
        allocate_block();

        // recalculate alignment offset
        offset = detail::align_offset(stack_.top(), alignment);

        detail::check_allocation_size(offset + size, next_capacity(), info());
    }

    stack_.bump(offset);
    return stack_.bump_return(size);
}

Implementation of the function is straightforward and simple.

Now that is how efficient code looks!

Guideline: Choose the right level of abstraction

Abstraction is a good thing.

It prevents developers from always worrying about all the small and complicated details and allows them to create easy-to-use building-block for higher level tasks. Abstraction also prevents code duplication and decreases the likelihood of errors by enabling focus on the current functionality.

Abstractions are nested, a core functions is called by a mid-level function which is called by a high-level function. And obviously, the design of a high-level abstraction is fundamentally different from a low-level abstraction.

A low-level abstraction solves only a really small problem. But it solves it fast and well. It also solves it genericly. By using the low-level abstractions you can solve any problem you want, at the cost of more verbosity.

A high-level abstraction removes this verbosity by coupling multiple lower-level abstractions. Clients of high-level abstraction neeed to write less-code to fulfill the same task but also have less control over the details and only solve a, well, more abstract problem.

After all that is the entire point of abstractions.

The problem in the original code was that I made detail::fixed_memory_stack a high-level abstraction. It solved the problem of “allocating memory from a stack”. It did so resonably well and it was easy to use.

The problem was that using it to implement another high-level abstraction, memory_stack, was less effective. memory_stack didn’t actually need an abstraction that solves “allocating memory from a stack”. That’s what it does!

It needed an abstraction that solves “managing the top pointer in a memory block”. This was the more efficient choice of abstraction, and the right choice.

I ran into the pitfall of object oriented design. When writing detail::fixed_memory_stack I had the users of a stack allocator in mind. So naturally I gave it the operations you want to do on a memory stack. This made it a high-level abstraction.

The actual use of it actually was straighforward, it allowed a simple implementation. But it was inefficient because the level of abstraction wasn’t appropriate. By switching to the more low-level abstraction it increased the performance.

So when designing your classes always keep the actual usage and required level of abstraction in mind. Especially classes that are in your detail namespace shouldn’t have high-level abstractions.

Always think:

This helps writing abstractions that are not only easy to use but also efficient.

About algorithms

Even back in the “slow” 0.5, before the optimiziation, memory_pool<array_pool> was significantly faster than the ordered pool in bulk without losing performance in the reversed bulk.

As I’ve explained in the first post deallocation of an ordered free list requires going through the list searching for the right position to insert the node into. Linked lists aren’t random access, to get to node N, you have to visit nodes 0 to N - 1 first. Thus, they can only be traversed linearly. Searching for the position cannot do a fast binary search they can do on a continous memory (like in std::vector) but need to go from one node to the next.

Yes, the binary search algorithms like std::lower_bound() have a complexity of O(log n) even for bidirectional iterators you have in a std::list. But this “complexity” is only defined in terms of application of the search predicate, not in terms of actual iterator operations. They just cheated in the specification to allow that.

And since the free list is a singly linked list the only thing you can choose is the sorting order, depending on it either bulk or bulk reversed is fast because the node needs to be inserted directly in the beginning. In the other case the search needs to go over the entire list before finding an appropriate position. And for the same reason butterfly in the ordered Boost.Pool is in the middle: there some nodes require only a short traversal, some a long one; it averages out.

So how to make it faster? I obviously managed it. How?

a) Use continous storage

To do a proper binary search you need continous storage. Then deallocation easily has logarithmic complexity.

Except that you cannot use continous storage in a free list. This would involve allocating extra memory just for a continous sequence of pointers to the actual nodes or similar.

An allocator that actually requires a lot of additional bookkeeping memory to the point where it could get its own allocator is kind of a pointless allocator.

b) Remember the last node in the list

If you not only remember the first node of the free list but also the last one, you can at least get rid of the worst case: inserting at the end. Before traversing you simply check at the end.

This would actually make both bulks fast.

But this alone is even more cheating than the standard does with its specifications. It also won’t help with butterfly. There my list had equal performance - without manual optimizations!

c) Remember the last deallocated node in the list

So let’s take the last step further. Instead of (or in addition to) remembering the end of the list, remember the last node deallocated. Then check there.

If the address of the last deallocated node is smaller than the current address, search from the beginning. Otherwise search from the last deallocated node.

In the given sorting order, it is very fast if the node allocated is bigger than the last one, i.e. on a deallocation in the same order as allocation. But in the reverse order it is still slow, because then the node has to be put before the last one. This means traversing the list from the front, because you cannot just go back one node in a singly linked list.

d) Use a doubly linked list

“Hey”, you might say, “that’s the same problem you had with the chunks of detail::small_free_memory_list back in part 3. I know what to do: use a doubly linked list.”

You’re right. That’s the exact same problem, I also needed to find a position in sorted list starting from a marker. Doubly linked list allowed me to traverse the list in both directions there and so go backwards very easily.

But a doubly linked list has a disadvantage: It has two pointers, not just one. In the small free list this overhead wasn’t so bad, because there only the chunks had them, not every node.

But in the ordered free list the pointers are embedded directly into the nodes. You need to have space for them, the node must be big enough. A normal free list is singly linked because it only requires a minimum size of sizeof(void*). But with a doubly linked list this size doubles!

If you use it for ints you normally have an overhead of 4 bytes on a 64bit system. But with two pointers you had an overhead of 8 bytes! This is wasted space!

So using a doubly linked list is not possible.

e) Use an XOR linked list

What is possible though is using an XOR linked list.

An XOR linked list allows traversal in both directions but only requires a single pointer. The pointer does not store the next or prev pointer directly but next ^ prev - hence the name.

Bitwise XOR has the property that you can get the original value back if you now the other: the result of an XOR operation xor next will give prev, for example. And when doing list operations you always have one of the nodes so you can get the other back. For example, when traversing in one direction you need to remember the current node and the node before that and can use the address of the node before that to get the next node:

// advances a pointer pair forward/backward
void xor_list_iter_next(char *&cur, char *&prev)
{
    auto next = xor_list_get_other(cur, prev);
    prev = cur;
    cur = next;
}

Where xor_list_get_other() is:

char *xor_list_get_other(void *address, char *prev_or_next)
{
    return from_int(get_int(address) ^ to_int(prev_or_next));
}

get_int() obtains the std::uintptr_t stored at address while to_int() casts it to std::uintptr_t because prev_or_next is already the address of the next node. from_int() simply makes it a pointer again.

Insert after or before a node isn’t directly supported, only insert between two nodes. Because for the previous node you need to change the next pointer and for the next node you need to change the prev pointer. Changing a pointer is only supported if you know the old value:

void xor_list_change(void *address, char *old_ptr, char *new_ptr)
{
    auto other = xor_list_get_other(address, old_ptr);
    xor_list_set(address, other, new_ptr);
}

Because then you’ll get the other pointer value and can set the XOR again:

void xor_list_set(void *address, char *prev, char *next)
{
    set_int(address, to_int(prev) ^ to_int(next));
}

set_int() will write the std::uintptr_t at address.

Using an XOR linked list allows me to go backwards from the remembered deallocation position if needed. Furthermore the same technique as in the chunk list can be used by determing the interval where the node must be inserted and go from both ends towards the middle.

XOR linked lists aren’t perfect though. For starters due to the XOR manipulation for access they are certainly slower than regular doubly linked lists. Also their implementation is way more complicated than in regular lists and they are much more error prune. As a bonus, debugging is a nightmare because you cannot just inspect a node and see the prev and next pointer.

So only use them if they are justified. But as the benchmark has shown the programming overhead was definitely worth it.

In the previous post I showed how you can eliminate branches in a doubly linked list by storing prxoy nodes. This was in fact the technique I have used in my XOR linked list. It applies there as well. I just had a circular dependency between the two posts and couldn’t tell that it was an XOR linked list at that point in time.

Guideline: Choosing a fast algorithm is the most important optimization possible

Algorithms are essential.

They determine how efficient your program is.

All the tricks I’ve shown you in this series are just mirco optimizations to squeeze out the last microseconds. Things like branch elimination and better inlining are only relevant if you scale things up.

I did have a speedup in memory_stack from up to 1500ns, which looks much but really, really isn’t. It was also the time needed for 256 allocations, thats a speedup of less than 6ns - six nanoseconds! - per allocation. 6ns aren’t that important in the grand scheme of things.

The only optimization that actually matters is choosing a better algorithm with a smaller big O complexity. So the last piece of advise you’ll get in this series is this:

When your code is slow, look for faster algorithms and fancier datastructures. Only if that isn’t sufficient consider microing your exact assembler output.

Conclusion

When designing classes or functions choose the right (level of) abstraction. Interfaces that aren’t designed properly can easily slow your code down because of multiple redundant work.

But over all micro-optimizations always remember that most of the stuff don’t even matter. Always profile your code to see which functions need optimizing and first try a smarter algorithm before anything else.

Optimization is a very broad topic and there are many more things you can do but this was all I have to share with you about the optimizations done for memory update 0.5-1. While writing I’ve discovered multiple bugs and released two patchs in the last week, update to 0.5-3 as soon as possible.

If you are using my library please contact me, I really appreciate your feedback. And I have many awesome things planned for 0.6 which will come in summer, so look forward to it.

But first get hyped for my next project I’m starting this week.

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.