The Fastest Scalar Utf8 Validation
In my previous post, I’ve ported a fast DFA implementation to Zig and shown how it could be used to quickly validate UTF-8 using only scalar instructions. This implementation accepts one byte each CPU cycle achieving throughput of 3.2 GB/s on my laptop. At the end of that post I hinted at doubling the performance by imposing additional limitations on the DFA, so let’s see how that works.
In order to go faster, we must first understand why we’re not faster already. Let’s take a look at the hottest assembly out there.
movzx ecx, byte ptr [rdi + offset] shrx rax, qword ptr [8*rcx + .table], rax
There are 2 limiting factors in the code above. Both instructions issue a load and most of our
target CPUs have 2 load ports. However there is a harder limit. We have a carried dependency.
shrx instruction depends on the result of the previous one and it takes one cycle for the
result to be available. Hence 1 byte/cycle is the limit.
Harder, Better, Faster, Stronger
So we want to go faster, and we know why we’re not going faster, but what can we do about that? We are not going to get more than one transition per cycle due to the carried dependency, but if we can process more data with the same amount of transitions, we improve the throughput. If we process 2 bytes at a time the states remain the same, for transitions we add a transition for all possible pairs of consecutive edges in the original DFA. This is equivalent to making one transition on the first 8 bits, then making a second transition on the last 8 bits.
However, there is one big problem with this change. The size. Our original table was 8 * 2^8 = 2048 bytes large. That’s OK and fits comfortably along other data into an L1 cache. Our new table has exploded exponentially and we’re now at 8 * 2^16 = 52488 bytes, or about 16x the size of your average L1 cache. Well that’s not going to fly, it will take ages to look up transitions and there is no space left to prefetch the input.
Less is More
If we could tame the transition table to be much smaller, we could keep our cake and eat it too. Let’s start with the easy one. We don’t need 10 states. The UTF validation DFA has 5 states and we never enter the other ones. Switching to 32bit states means also going to 5 bits-per-state to keep with the shift semantics, resulting in 6 states. Good enough. But that’s only 50% saving on the transition table. Not good enough. We need to cut down on the exponential expansion.
An observant reader might have noticed that our DFA doesn’t actually use all the bits. Instead only the top bits have an effect on which transition is taken, with the start of a 4-byte-sequence requiring the most bits to figure out the correct transition. Unfortunately we don’t know how many bits we need until we read the data, so we have to go with the conservative 5 bits for the 4-byte-sequence. At 10 bits of relevant data and 4-byte-states, we would need 4 * 2^10 = 4095 bytes, which is only double the original size and would solve our problem.
|First code point||Last code point||Byte 1||Byte 2||Byte 3||Byte 4||Code points|
Unicode codepoint to UTF-8 mapping.1
Nothing’s ever that easy. We have 16 bits, want only 10 of them and we want them fast. The standard approach would be to mask, shift and bitwise-or to extract the bits we want. Except that’s 5 additional instructions and we can only dispatch 4. A more advanced approach would be to replace the shifts and bitwise-or with an upper multiply. Except that after masking the two groups of bits are too close to each other and we would have to mask again after multiply, overshooting our budget again.
But there’s a special instruction in the x86-64 toolbox.
pext. Parallel bit extract is a single
instruction which extracts arbitrary bits from an integer according to a mask and packs them to the
lower bits of the output. This is exactly what we want. We give it 2 input bytes, a mask with ones
in the upper 5 bits of each byte and out we get the 10 bits the transition depends on. The only
downside is that Zen 1 and 2 implement
pext in microcode and its incredibely slow. I don’t have
a Zen 1/2 machine to test this code on, but I’m sure it will be unusably slow.
pext: 0xxxxxxx0yyyyyyy 1111100011111000 -> 0xxxx0yyyy 110xxxxx10yyyyyy 1111100011111000 -> 110xx10yyy
I ran the code through the simdutf8 bench suite as before. For context the single-cycle DFA ran at 3.2 GB/s, simdjson’s implementation runs at 9.2 GB/s and simdutf8 at 12.3 GB/s. The double DFA approach introduced here reached throughput of 6.6GB/s. As far as I am aware, this is the fastest scalar UTF-8 validation implementation. Furthermore this approach has basically no warmup which makes it the fastest implementation when validating strings under 100 bytes. You can check the code out in Compiler Explorer.
|testcase size||shift-based DFA||2 byte DFA||simdutf8||simdjson||rust std|
Throughput in simdutf8’s bench suite in GB/s on my i5 8350u laptop.
If you wish to use this code, feel free to do so under the terms of the MIT licence (C) Max Hollmann 2023. I would recommend exploring adding a fast path for ASCII and early error detection though.