# Implementation Challenge: A count leading zeroes function

When doing arithmetic in a programming language there is the arcane art of optimizing with the help of bit-wise operations. Of course I'm talking about Bit Hacks.

On a readability and maintainability ranking from 1 to `awk`

Bit Hacks reach a level of Brainfuck.
Yet they can be an incredibly low-level optimization useful to tweak the last bit of performance out of an operation but are difficult to get right and 100% portable.

In this post we'll take a look at a rather easy function - `clz(x)`

that will return the number of leading zero bits in an `unsigned`

integer type `x`

.
In particular I'll show you how to properly wrap GCC's `__builtin_clz()`

.

## Motivation

People usually tend to use base 10 in calculations they perform in their head.

Expect those who design Bit Hacks I assume.

In base 10 operations like multiply or divide by 10, 100,… are trivial: just add or remove the appropriate numbers of zeroes. To be precise: Shift the decimal point by a certain amount. Likewise, calculating the integer logarithm for base 10 (i.e. the number of decimal places) is just that: counting the digits the number has.

Computers - usually - tend to use base 2, so all those operations are trivial for powers of 2 or calculating the logarithm for base 2. Multiplication/Division by a power of 2 is just a bit shift, for example.

But

pleasedo not write it as a bit shift, unless you actuallymeana bit shift.

And the `ilog2()`

, the base 2 logarithm for integers, is just counting the number of binary digits a certain integer value needs.
For counting those, you can use `clz()`

: Just take the width of the integer - i.e. the number of bits - subtract the number of leading zeroes and add/subtract one depending on whether or not it is a power of two and whether or not you want a ceiling or flooring implementation (i.e. whether `ilog2(3)`

should be `1`

or `2`

; `log2(3)`

would be `1.xxx`

).

Yes, there are other - possible better - implementations of

`ilog2()`

. But I wanted to use`clz()`

and couldn't come up with an allocator example. Okay, that's not true: thereisan allocator that's actually using`clz()`

, but just as`ilog2()`

as well.

The number of bits of an integer `x`

is just `sizeof(x) * CHAR_BIT`

. `sizeof(x)`

returns the number of “bytes” in `x`

. `CHAR_BIT`

is a macro from `<climits>`

providing the number of bits in a `char`

.

I put “bytes” in quotation marks because

`sizeof()`

doesn't return therealnumber of bytes.`sizeof()`

returns the number of C++ bytes with`sizeof(char)`

alwaysbeing`1`

even if it is actual 4 bytes in memory or something strange like that.

And detecting whether or not a number is a power of two can easily done by another bit hack, so what's left is `clz()`

.

## The Challenge

The function `clz()`

takes any `unsigned`

integer type and returns the number of leading zero-bits in the binary representation of its value.

We limit ourselves to

`unsigned`

integer types because`signed`

types have a sign-bit that will mess thinks up. Also bit operations on them are not a good idea™.

As an example, consider `clz(4)`

. `4`

in binary is `100`

.
But how many 0s are in front? 0? 13? 29? 1334?

It depends.

If `4`

is stored in a 16bit integer, the result is 13 because there are 13 unused zeroes in front of `100`

.

If `4`

is stored in a 32bit integer, the result is 29 because there are 16 more zeroes.

And if

`4`

were stored in a 1337bit integer, the result would be 1334. Note to reader: If you know an architecture with 1337bit integers,pleaselet me know!

`clz()`

can only be properly defined for integers of given size, i.e. for given number of bits.
To get *portable* and *consistent* result, we need integers of a fixed size - the `std::uintX_t`

types from `<cstdint>`

.

Strictly speaking, this is still not completely portable. In the C++ standard, the synopsis of

`<cstdint>`

marks those fixed-sized integers as “optional”, because they may not be supported on some architectures.

With this in mind, we can declare our `clz()`

function as follows:

```
unsigned clz(std::uint8_t x);
unsigned clz(std::uint16_t x);
unsigned clz(std::uint32_t x);
unsigned clz(std::uint64_t x);
```

It is overloaded for each integer size and returns the number of leading zeroes for that size.

To be fair: This is unnecessary for the use case in

`ilog2()`

since the width is being subtracted anyway.

## The manual implementation

I'm not going to go into much detail, because writing it manual is just *boring*.

We could do a loop over all bits but this is too slow.
Instead, I've used a binary search.
The integer is split into two halves, the upper and lower half.
If the upper half is non-zero, the first `1`

is in the upper half, so return `clz()`

of the upper half.
Otherwise the first `1`

is in the lower half - the upper half is all zero, so the result is the width of the upper half plus the `clz()`

on the lower half.

This maps very well to the four `clz()`

overloads.
We split the integer into the two smaller integer types and call `clz()`

on the smaller type,
overload resolution will automatically select the different implementation:

```
unsigned clz(std::uint32_t x)
{
// shift upper half down, rest is filled up with 0s
auto upper = std::uint16_t(x >> 16);
// mask upper half away
auto lower = std::uint16_t(x & 0xFFFF);
// their type is std::uint16_t so a smaller overload is chosen
return upper ? clz(upper) : 16 + clz(lower);
}
// similar for std::uint64_t and std::uint16_t
```

The final overload for `std::uint8_t`

splits it into 4bit-halves and uses a lookup-table:

```
unsigned clz(std::uint8_t x)
{
static constexpr std::uint8_t clz_lookup[16] = { 4, 3, 2, 2, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0 };
auto upper = x >> 4;
auto lower = x & 0x0F;
return upper ? clz_lookup[upper] : 4 + clz_lookup[lower];
}
```

So far, so slow.

## __builtin_clz()

Most architectures have special instructions for doing those calculations. But writing assembler isn't exactly portable. Luckily, many compilers wrap those into intrinsic functions that will be translated into the optimal assembler.

Under GCC and compatible compilers like clang it is called `__builtin_clz()`

. It comes in the following variants.

```
int __builtin_clz(unsigned int x);
int __builtin_clzl(unsigned long x);
int __builtin_clzll(unsigned long long x);
```

So if those built-ins are available we can use them in the implementation of our `clz()`

function.

Note that they have undefined results for the value

`0`

. I'll just ignore that here and mark my`clz()`

as undefined for`0`

as well.

But e.g. the first version returns the `clz()`

for `unsigned int`

.
Its size may change from platform to platform and with it the result of `clz()`

!

This isn't good.

We need to portably map each fixed-size integers to the appropriate built-in.
The argument type of the built-in needs to be at least the size of the fixed-size integers,
so we don't run into an overflow.
But we cannot just use the biggest - `long long`

- version: It may not be very effective.

I cannot portably do this mapping manually. Instead I trick the compiler into doing it for me.

I do that with my favorite technique: (ab)using overload resolution.

## Wrapping the builtins

The first step in order to use overload resolution is to create a set of overloaded functions.
Thus I wrap the builtins simply into a function that just takes `unsigned int/long/long long`

and forwards:

```
// real code would put those into a namespace
unsigned clz_impl(unsigned int x)
{
return __builtin_clz(x);
}
unsigned clz_impl(unsigned long x)
{
return __builtin_clzl(x);
}
unsigned clz_impl(unsigned long long x)
{
return __builtin_clzll(x);
}
```

Okay, so now they've all got the same name, they're overloaded.

But the default resolution the compiler does is not good enough, e.g. calling `clz_impl()`

inside the `std::uint8_t`

version gives an ambiguity error: none of the candidates takes `std::uint8_t`

and all promotions are equally good.

The compiler needs more baby-sitting until it has figured out what we want from him.

## SFINAE to the rescue

In order to get an exact match we need to template the implementation functions.
But they must not get *any* integer type, only integer types whose size is not bigger than the argument size to the built-in.

Conditionally disabling certain templates sounds a lot like SFINAE, so that's what I'm going to use:

```
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(T x)
{
return __builtin_clz(x);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(T x)
{
return __builtin_clzl(x);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(T x)
{
return __builtin_clzll(x);
}
```

The second template parameter is the result of a

`std::enable_if<>`

construct. If the condition given to it is`false`

, substiution failure will prevent that overload from participating in overload resolution. In this case, this means that an overload is only chosen if the size of the type is less than or equal to the size of the argument its`__builtin_clz()`

variant expects.

Now nothing works, the compiler complains about a redefinition. The conditions are not mutually exclusive, everything matches the last overload. What should the poor compiler do!

## Tag dispatching to rescue the rescue

A built-in should only take the types that are smaller or equal to its argument type.
We've already expressed that with the `enable_if`

construct.

But we want the *smallest* argument type that works, in order to be the most effective.
There is thus a *priority* in the overloads:
At first, everything should use `unsigned int`

version.
Only if the type is bigger, the `unsigned long`

version should be considered.
And only if the type is even bigger, the `unsigned long long`

version should be used as a last resort.

This *priority* can be expressed through tag dispatching.
The tag is a type of a class hierachy like so:

```
struct clzll_tag {};
struct clzl_tag : clzll_tag {};
struct clz_tag : clzl_tag {};
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(clz_tag, T x)
{
return __builtin_clz(x);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(clzl_tag, T x)
{
return __builtin_clzl(x);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(clzll_tag, T x)
{
return __builtin_clzll(x);
}
```

I've just now realized - as I'm typing this - that just like in the last implementation challenge the final solution is SFINAE combined with tag dispatching. But that just shows how powerful it is.

Each overload now takes a corresponding tag type as unnamed first argument.
It's only purpose is to help the compiler choose the right overload.
The key here is the *hierachy* of the tag types.
It is exactly inversed, the tag with the lowest priority is the base and the tag with the highest priority the most derived class.

Now we can finally use the wrappers in our `clz()`

function:

```
unsigned clz(std::uint8_t x)
{
return clz_impl(clz_tag{}, x);
}
unsigned clz(std::uint16_t x)
{
return clz_impl(clz_tag{}, x);
}
// exactly the same for the other two overloads
```

We pass an instance of the tag with the highest priority as first argument.
This means that the `unsigned int`

version will be the best match - it is an exact match on the tag type.
If it cannot be used, because the type of the template parameter is bigger than `unsigned int`

, SFINAE kicks in and disables it.
Now - and only now - the compiler will select one of the other overloads which require a derived-to-base conversions and are thus worse than the exact match.
The `unsigned long`

version is the second-best because it does only need to convert the tag one base deeper, not two for the remaining version. This `unsigned long long`

gets only chosen if SFINAE disables the `unsigned long`

one as well.

## Bug fixing

The compiler will now select the right built-in. But the results are not always correct.

For example, the call to `clz(std::uint16_t(1))`

will return `31`

.

Either the compiler can fit 31 zeroes into 16bits or we have a bug.

I'll guess my compiler is just very good.

Remember what I've said at the beginning?
The result of `clz()`

depends on the width of the type?

Yeah, we may select the right built-in, but then we just return the `clz()`

for the built-in's argument type!
The above call will select the `unsigned int`

version because that's the smallest type that is big enough. But then it will just return the `clz()`

for the - here! - 32bit integer.

We need to adjust the result.

To be precise, we need to subtract the width difference between the implementation's argument type and the calling argument type:

```
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned int)>::type>
unsigned clz_impl(clz_tag, T x)
{
return __builtin_clz(x) - (sizeof(unsigned int) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long)>::type>
unsigned clz_impl(clzl_tag, T x)
{
return __builtin_clzl(x) - (sizeof(unsigned long) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}
template <typename T, typename = typename std::enable_if<sizeof(T) <= sizeof(unsigned long long)>::type>
unsigned clz_impl(clzll_tag, T x)
{
return __builtin_clzll(x) - (sizeof(unsigned long long) * CHAR_BIT - sizeof(T) * CHAR_BIT);
}
```

`sizeof(unsigned XXX) * CHAR_BIT`

is the width of the argument type, `sizeof(T) * CHAR_BIT`

the width of the argument type.
Since SFINAE guarantees that the first is always greater than or equal to the second,
we can simply subtract those two widths to get the difference which needs to be subtracted from the result.

For the 16bit integer the width difference to the 32bit integer is `16`

, so we subtract that from the resulting `31`

and get the right answer: `15`

zeroes for the first `1`

.

## Conclusion

We have created a rather portable `clz()`

implementation.

The GCC builtins are wrapped with the help of SFINAE and prioritized tag dispatching.
This will thus choose always the perfect version for a given integer type and will dynamically adapt to the `unsigned int/long/long long`

sizes on each platform.

The full code of the GCC version can be found here.
What's missing is the check for the support of the built-in.
This is a completely different challenge.
I've built one solution for that in the form of my compatibility library.
It uses CMake to check for feature support and provides automated workarounds based on the result.
Its `clz()`

implementation can be found here - it is wrapped into CMake boilerplate though.

If you've liked this blog post, consider supporting me - a dollar a month can really help me out.

This blog post was written for my old blog design and ported over. If there are any issues, please let me know.