The Fastest Scalar Utf8 Validation

In order to go faster, we must first understand why we're not faster already. Let's take a look at the hottest assembly out there.

movzx   ecx, byte ptr [rdi + offset]
shrx    rax, qword ptr [8*rcx + .table], rax

There are 2 limiting factors in the code above. Both instructions issue a load and most of our target CPUs have 2 load ports. However there is a harder limit. We have a carried dependency. The shrx instruction depends on the result of the previous one and it takes one cycle for the result to be available. Hence 1 byte/cycle is the limit.

Harder, Better, Faster, Stronger

So we want to go faster, and we know why we're not going faster, but what can we do about that? We are not going to get more than one transition per cycle due to the carried dependency, but if we can process more data with the same amount of transitions, we improve the throughput. If we process 2 bytes at a time the states remain the same, for transitions we add a transition for all possible pairs of consecutive edges in the original DFA. This is equivalent to making one transition on the first 8 bits, then making a second transition on the last 8 bits.

However, there is one big problem with this change. The size. Our original table was 8 * 2^8 = 2048 bytes large. That's OK and fits comfortably along other data into an L1 cache. Our new table has exploded exponentially and we're now at 8 * 2^16 = 52488 bytes, or about 16x the size of your average L1 cache. Well that's not going to fly, it will take ages to look up transitions and there is no space left to prefetch the input.

Less is More

If we could tame the transition table to be much smaller, we could keep our cake and eat it too. Let's start with the easy one. We don't need 10 states. The UTF validation DFA has 5 states and we never enter the other ones. Switching to 32bit states means also going to 5 bits-per-state to keep with the shift semantics, resulting in 6 states. Good enough. But that's only 50% saving on the transition table. Not good enough. We need to cut down on the exponential expansion.

An observant reader might have noticed that our DFA doesn't actually use all the bits. Instead only the top bits have an effect on which transition is taken, with the start of a 4-byte-sequence requiring the most bits to figure out the correct transition. Unfortunately we don't know how many bits we need until we read the data, so we have to go with the conservative 5 bits for the 4-byte-sequence. At 10 bits of relevant data and 4-byte-states, we would need 4 * 2^10 = 4095 bytes, which is only double the original size and would solve our problem.

Getting it

Nothing's ever that easy. We have 16 bits, want only 10 of them and we want them fast. The standard approach would be to mask, shift and bitwise-or to extract the bits we want. Except that's 5 additional instructions and we can only dispatch 4. A more advanced approach would be to replace the shifts and bitwise-or with an upper multiply. Except that after masking the two groups of bits are too close to each other and we would have to mask again after multiply, overshooting our budget again.

First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4	Code points
U+0000	U+007F	0xxxxxxx				128
U+0080	U+07FF	110xxxxx	10xxxxxx			1920
U+0800	U+FFFF	1110xxxx	10xxxxxx	10xxxxxx		61440
U+10000	U+10FFFF	11110xxx	10xxxxxx	10xxxxxx	10xxxxxx	1048576

But there's a special instruction in the x86-64 toolbox. pext. Parallel bit extract is a single instruction which extracts arbitrary bits from an integer according to a mask and packs them to the lower bits of the output. This is exactly what we want. We give it 2 input bytes, a mask with ones in the upper 5 bits of each byte and out we get the 10 bits the transition depends on. The only downside is that Zen 1 and 2 implement pext in microcode and its incredibely slow. I don't have a Zen 1/2 machine to test this code on, but I'm sure it will be unusably slow.

pext:
0xxxxxxx0yyyyyyy
1111100011111000
->    0xxxx0yyyy

110xxxxx10yyyyyy
1111100011111000
->    110xx10yyy

Result

I ran the code through the simdutf8 bench suite as before. For context the single-cycle DFA ran at 3.2 GB/s, simdjson's implementation runs at 9.2 GB/s and simdutf8 at 12.3 GB/s. The double DFA approach introduced here reached throughput of 6.6GB/s. As far as I am aware, this is the fastest scalar UTF-8 validation implementation. Furthermore this approach has basically no warmup which makes it the fastest implementation when validating strings under 100 bytes. You can check the code out in Compiler Explorer.

If you wish to use this code, feel free to do so under the terms of the MIT licence (C) Max Hollmann 2023. I would recommend exploring adding a fast path for ASCII and early error detection though.

testcase size	shift-based DFA	2 byte DFA	simdutf8	simdjson	rust std
2	0.563	0.552	0.282	0.102	0.418
8	2.059	1.830	0.561	0.412	0.715
65	3.058	4.506	2.652	2.599	0.935
513	3.283	6.244	9.042	7.471	0.987
4096	3.295	6.523	12.380	9.465	0.834
65536	3.286	6.638	12.497	9.318	0.545
131072	3.250	6.615	12.429	9.288	0.542

The Fastest Scalar Utf8 Validation

Speed Limits

Harder, Better, Faster, Stronger

Less is More

Getting it

Result