Implementation Challenge: A count leading zeroes function

11 Feb 2016 by Jonathan

When doing arithmetic in a programming language there is the arcane art of optimizing with the help of bit-wise operations. Of course I’m talking about Bit Hacks.

On a readability and maintainability ranking from 1 to awk Bit Hacks reach a level of Brainfuck. Yet they can be an incredibly low-level optimization useful to tweak the last bit of performance out of an operation but are difficult to get right and 100% portable.

In this post we’ll take a look at a rather easy function - clz(x) that will return the number of leading zero bits in an unsigned integer type x. In particular I’ll show you how to properly wrap GCC’s __builtin_clz().

Motivation

People usually tend to use base 10 in calculations they perform in their head.

Expect those who design Bit Hacks I assume.

In base 10 operations like multiply or divide by 10, 100,… are trivial: just add or remove the appropriate numbers of zeroes. To be precise: Shift the decimal point by a certain amount. Likewise, calculating the integer logarithm for base 10 (i.e. the number of decimal places) is just that: counting the digits the number has.

Computers - usually - tend to use base 2, so all those operations are trivial for powers of 2 or calculating the logarithm for base 2. Multiplication/Division by a power of 2 is just a bit shift, for example.

But please do not write it as a bit shift, unless you actually mean a bit shift.

And the ilog2(), the base 2 logarithm for integers, is just counting the number of binary digits a certain integer value needs. For counting those, you can use clz(): Just take the width of the integer - i.e. the number of bits - subtract the number of leading zeroes and add/subtract one depending on whether or not it is a power of two and whether or not you want a ceiling or flooring implementation (i.e. whether ilog2(3) should be 1 or 2; log2(3) would be 1.xxx).

Yes, there are other - possible better - implementations of ilog2(). But I wanted to use clz() and couldn’t come up with an allocator example. Okay, that’s not true: there is an allocator that’s actually using clz(), but just as ilog2() as well.

The number of bits of an integer x is just sizeof(x) * CHAR_BIT. sizeof(x) returns the number of “bytes” in x. CHAR_BIT is a macro from <climits> providing the number of bits in a char.

I put “bytes” in quotation marks because sizeof() doesn’t return the real number of bytes. sizeof() returns the number of C++ bytes with sizeof(char) always being 1 even if it is actual 4 bytes in memory or something strange like that.

And detecting whether or not a number is a power of two can easily done by another bit hack, so what’s left is clz().

The Challenge

The function clz() takes any unsigned integer type and returns the number of leading zero-bits in the binary representation of its value.

We limit ourselves to unsigned integer types because signed types have a sign-bit that will mess thinks up. Also bit operations on them are not a good idea™.

As an example, consider clz(4). 4 in binary is 100. But how many 0s are in front? 0? 13? 29? 1334?

It depends.

If 4 is stored in a 16bit integer, the result is 13 because there are 13 unused zeroes in front of 100.

If 4 is stored in a 32bit integer, the result is 29 because there are 16 more zeroes.

And if 4 were stored in a 1337bit integer, the result would be 1334. Note to reader: If you know an architecture with 1337bit integers, please let me know!

clz() can only be properly defined for integers of given size, i.e. for given number of bits. To get portable and consistent result, we need integers of a fixed size - the std::uintX_t types from <cstdint>.

Strictly speaking, this is still not completely portable. In the C++ standard, the synopsis of <cstdint> marks those fixed-sized integers as “optional”, because they may not be supported on some architectures.

With this in mind, we can declare our clz() function as follows:

unsigned clz(std::uint8_t  x);
unsigned clz(std::uint16_t x);
unsigned clz(std::uint32_t x);
unsigned clz(std::uint64_t x);

It is overloaded for each integer size and returns the number of leading zeroes for that size.

To be fair: This is unnecessary for the use case in ilog2() since the width is being subtracted anyway.

The manual implementation

I’m not going to go into much detail, because writing it manual is just boring.

We could do a loop over all bits but this is too slow. Instead, I’ve used a binary search. The integer is split into two halves, the upper and lower half. If the upper half is non-zero, the first 1 is in the upper half, so return clz() of the upper half. Otherwise the first 1 is in the lower half - the upper half is all zero, so the result is the width of the upper half plus the clz() on the lower half.

This maps very well to the four clz() overloads. We split the integer into the two smaller integer types and call clz() on the smaller type, overload resolution will automatically select the different implementation:

unsigned clz(std::uint32_t x)
{
	// shift upper half down, rest is filled up with 0s
	auto upper = std::uint16_t(x >> 16); 
	// mask upper half away
	auto lower = std::uint16_t(x & 0xFFFF);
	// their type is std::uint16_t so a smaller overload is chosen
	return upper ? clz(upper) : 16 + clz(lower);
}

// similar for std::uint64_t and std::uint16_t

The final overload for std::uint8_t splits it into 4bit-halves and uses a lookup-table:

unsigned clz(std::uint8_t x)
{
    static constexpr std::uint8_t clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
    auto upper = x >> 4;
    auto lower = x & 0x0F;
    return upper ? clz_lookup[upper] : 4 + clz_lookup[lower];
}

So far, so slow.

__builtin_clz()

Most architectures have special instructions for doing those calculations. But writing assembler isn’t exactly portable. Luckily, many compilers wrap those into intrinsic functions that will be translated into the optimal assembler.

Under GCC and compatible compilers like clang it is called __builtin_clz(). It comes in the following variants.

int __builtin_clz(unsigned int x);
int __builtin_clzl(unsigned long x);
int __builtin_clzll(unsigned long long x);

So if those built-ins are available we can use them in the implementation of our clz() function.

Note that they have undefined results for the value 0. I’ll just ignore that here and mark my clz() as undefined for 0 as well.

But e.g. the first version returns the clz() for unsigned int. Its size may change from platform to platform and with it the result of clz()!

This isn’t good.

We need to portably map each fixed-size integers to the appropriate built-in. The argument type of the built-in needs to be at least the size of the fixed-size integers, so we don’t run into an overflow. But we cannot just use the biggest - long long - version: It may not be very effective.

I cannot portably do this mapping manually. Instead I trick the compiler into doing it for me.

I do that with my favorite technique: (ab)using overload resolution.

Wrapping the builtins

The first step in order to use overload resolution is to create a set of overloaded functions. Thus I wrap the builtins simply into a function that just takes unsigned int/long/long long and forwards:

// real code would put those into a namespace
unsigned clz_impl(unsigned int x)
{
	return __builtin_clz(x);
}

unsigned clz_impl(unsigned long x)
{
	return __builtin_clzl(x);
}

unsigned clz_impl(unsigned long long x)
{
	return __builtin_clzll(x);
}

Okay, so now they’ve all got the same name, they’re overloaded.

But the default resolution the compiler does is not good enough, e.g. calling clz_impl() inside the std::uint8_t version gives an ambiguity error: none of the candidates takes std::uint8_t and all promotions are equally good.

The compiler needs more baby-sitting until it has figured out what we want from him.

SFINAE to the rescue

In order to get an exact match we need to template the implementation functions. But they must not get any integer type, only integer types whose size is not bigger than the argument size to the built-in.

Conditionally disabling certain templates sounds a lot like SFINAE, so that’s what I’m going to use:

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(T x)
{
	return __builtin_clz(x);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(T x)
{
	return __builtin_clzl(x);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(T x)
{
	return __builtin_clzll(x);
}

The second template parameter is the result of a std::enable_if<> construct. If the condition given to it is false, substiution failure will prevent that overload from participating in overload resolution. In this case, this means that an overload is only chosen if the size of the type is less than or equal to the size of the argument its __builtin_clz() variant expects.

Now nothing works, the compiler complains about a redefinition. The conditions are not mutually exclusive, everything matches the last overload. What should the poor compiler do!

Tag dispatching to rescue the rescue

A built-in should only take the types that are smaller or equal to its argument type. We’ve already expressed that with the enable_if construct.

But we want the smallest argument type that works, in order to be the most effective. There is thus a priority in the overloads: At first, everything should use unsigned int version. Only if the type is bigger, the unsigned long version should be considered. And only if the type is even bigger, the unsigned long long version should be used as a last resort.

This priority can be expressed through tag dispatching. The tag is a type of a class hierachy like so:

struct clzll_tag {};
struct clzl_tag : clzll_tag {};
struct clz_tag : clzl_tag {};

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(clz_tag, T x)
{
	return __builtin_clz(x);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(clzl_tag, T x)
{
	return __builtin_clzl(x);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(clzll_tag, T x)
{
	return __builtin_clzll(x);
}

I’ve just now realized - as I’m typing this - that just like in the last implementation challenge the final solution is SFINAE combined with tag dispatching. But that just shows how powerful it is.

Each overload now takes a corresponding tag type as unnamed first argument. It’s only purpose is to help the compiler choose the right overload. The key here is the hierachy of the tag types. It is exactly inversed, the tag with the lowest priority is the base and the tag with the highest priority the most derived class.

Now we can finally use the wrappers in our clz() function:

unsigned clz(std::uint8_t x)
{
	return clz_impl(clz_tag{}, x);
}

unsigned clz(std::uint16_t x)
{
	return clz_impl(clz_tag{}, x);
}

// exactly the same for the other two overloads

We pass an instance of the tag with the highest priority as first argument. This means that the unsigned int version will be the best match - it is an exact match on the tag type. If it cannot be used, because the type of the template parameter is bigger than unsigned int, SFINAE kicks in and disables it. Now - and only now - the compiler will select one of the other overloads which require a derived-to-base conversions and are thus worse than the exact match. The unsigned long version is the second-best because it does only need to convert the tag one base deeper, not two for the remaining version. This unsigned long long gets only chosen if SFINAE disables the unsigned long one as well.

Bug fixing

The compiler will now select the right built-in. But the results are not always correct.

For example, the call to clz(std::uint16_t(1)) will return 31.

Either the compiler can fit 31 zeroes into 16bits or we have a bug.

I’ll guess my compiler is just very good.

Remember what I’ve said at the beginning? The result of clz() depends on the width of the type?

Yeah, we may select the right built-in, but then we just return the clz() for the built-in’s argument type! The above call will select the unsigned int version because that’s the smallest type that is big enough. But then it will just return the clz() for the - here! - 32bit integer.

We need to adjust the result.

To be precise, we need to subtract the width difference between the implementation’s argument type and the calling argument type:

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(clz_tag, T x)
{
	return __builtin_clz(x) - (sizeof(unsigned int) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(clzl_tag, T x)
{
	return __builtin_clzl(x) - (sizeof(unsigned long) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}

template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(clzll_tag, T x)
{
	return __builtin_clzll(x) - (sizeof(unsigned long long) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}

sizeof(unsigned XXX) * CHAR_BIT is the width of the argument type, sizeof(T) * CHAR_BIT the width of the argument type. Since SFINAE guarantees that the first is always greater than or equal to the second, we can simply subtract those two widths to get the difference which needs to be subtracted from the result.

For the 16bit integer the width difference to the 32bit integer is 16, so we subtract that from the resulting 31 and get the right answer: 15 zeroes for the first 1.

Conclusion

We have created a rather portable clz() implementation.

The GCC builtins are wrapped with the help of SFINAE and prioritized tag dispatching. This will thus choose always the perfect version for a given integer type and will dynamically adapt to the unsigned int/long/long long sizes on each platform.

The full code of the GCC version can be found here. What’s missing is the check for the support of the built-in. This is a completely different challenge. I’ve built one solution for that in the form of my compatibility library. It uses CMake to check for feature support and provides automated workarounds based on the result. Its clz() implementation can be found here - it is wrapped into CMake boilerplate though.

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.