2023-03-16, Max Hollmann
This post is a Zig take on Per Vognsen's shift-based DFAs.
In theorethical computer science, DFAs are simple machines useful for determining if a sequence of symbols follows some pattern or not. An arbitrary DFA consists of a set of states and a transition function. One state is designated as the initial state where the automaton begins. Some states are marked as accept states. If the automaton terminates in such state, the input string is accepted, otherwise it is rejected. The transition function describes how the automaton transitions between states when input symbols are read.
DFAs may not be able to solve all problems, but they have nice properties on problems they do solve. Each symbol triggers one transition and the automaton halts after the last symbol. As such we are not only guaranteed the automaton will give us a result, but it will give it fast. So what can we do with DFAs? Parsing HTML is not on the list, but anything which can be described by a regular expression can be parsed by a DFA.
Unicode is a flexible character encoding which solves the problem of encoding non-English characters in computers. Today, unicode is the lingua franca of encoding text, ending the dark times of locales once and for all. The unicode standard defines encoding into different-sized words, but the most used variant is UTF-8, a variable-length encoding using 8-bit words. Wikipedia has a nice table on how various unicode codepoints are encoded into 8-bit bytes.
First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Code points |
---|---|---|---|---|---|---|
U+0000 | U+007F | 0xxxxxxx | 128 | |||
U+0080 | U+07FF | 110xxxxx | 10xxxxxx | 1920 | ||
U+0800 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | 61440 | |
U+10000 | U+10FFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | 1048576 |
Not all sequences of bytes are valid UTF-8 strings. It is therefore important to validate whether a buffer of bytes received from a user is a valid UTF-8 string before operating on it. Using an invalid string as if it were a valid one may result in corrupted text for the user or worse, a security vulnerability.
Looking at the table above a DFA can be devised which accepts all sequences of bytes which are a valid encoding of some unicode codepoints. Building on unicode's concept of continuation bytes, we can design our DFA to count the amount of expected continuation bytes. In the start state, reading a byte from the first column of our table results in a transition to a state which expects the appropriate number of continuation bytes. Reading a continuation byte or a byte which is not in the table is an error and our DFA transitions to an error state, from which it can never return. Reading an ASCII character transitions to the default state, as it also represents the state which expects 0 continuation bytes. When in a state which expects some number of continuation bytes, reading such byte transitions to a state representing one less expected continuation bytes. Anything else is an error. If the DFA is not looking for a continuation byte or in the error state after reading the input data (id est it's in the default state), the data is a valid UTF-8 strng.
A canonical method of implementing a DFA in Zig would look like this:
const utfValidator = struct {
const State = enum(u6) {
ok,
one,
two,
three,
fail,
};
fn step(char: u8, state: State) State {
return switch (state) {
.ok => switch (char) {
0x00...0x7f => .ok,
0xc0...0xdf => .one,
0xe0...0xef => .two,
0xf0...0xf7 => .three,
else => .fail,
},
.one => switch (char) {
0x80...0xbf => .ok,
else => .fail,
},
.two => switch (char) {
0x80...0xbf => .one,
else => .fail,
},
.three => switch (char) {
0x80...0xbf => .two,
else => .fail,
},
.fail => .fail,
};
}
};
A more popular method of implementation is a 2D table of states. On one axis is the current state, on the other the input symbol. The transition function is then only indexing into this table. This approach is generally faster, while still being readable.
Now would be a good time to read Per Vognsen's writeup cited at the top of this post. By rearranging
the values in the table, merging states in one row into a single uint64_t
and a bit of
bit-twiddling, Per was able to reach speeds of 1 input byte per CPU cycle. Not bad, not bad at all.
However, this achievement came at a cost. The transition table is now a bunch of magic uint64_t
constants. That's not good for readability and it's rather error-prone when writing the table.
It would be nicer if we didn't have to interact with this table and it was generated automatically.
This approach is also limited to 10 states. That is a more severe limitation, but it does not affect
our UTF-8 validating automaton.
comptime
A distinguishing feature of the Zig programming language is its approach to reflection and compile-time evaluation. Could we employ it here? Perhaps evaluate a readable transition function for all possible inputs and generate the transition table programatically? Well of course!
pub fn shiftDfa(comptime states: type, comptime transition: fn (u8, states) states) type {
comptime std.debug.assert(@typeInfo(states) == .Enum);
inline for (@typeInfo(states).Enum.fields) |f| {
comptime std.debug.assert(f.value < 10);
}
return struct {
const transitions = init: {
var tmp = [_]u64{0} ** 256;
@setEvalBranchQuota(5000);
for (&tmp, 0..) |*e, i| {
for (@typeInfo(states).Enum.fields) |f| {
e.* |= (6 * @as(u64, @enumToInt(transition(i, @intToEnum(states, f.value))))) << (f.value * 6);
}
}
break :init tmp;
};
pub fn match(string: []const u8) states {
var state: u64 = 0;
for (string) |c| {
state = transitions[c] >> @truncate(u6, state);
}
return @intToEnum(states, @truncate(u6, state) / 6);
}
};
}
Now plug in the DFA we implemented before and we're off to the races.
const validator = shiftDfa(utfValidator.State, utfValidator.step);
const str = "こんにちは";
std.debug.print("'{}' is valid utf8: {}\n", .{ str, validator.match(str) == .ok });
Before we start the race it's a good idea to check if we got the code we wanted. Plugging our code
into Compiler Explorer we see that the generated assembly
is essentially the same as in Per Vognsen's original gist. Zig uses LLVM for its optimized binaries
and LLVM likes unrolling loops, so we don't even have to do that ourselves. Unrolling 8 iterations
should get us most of the way towards full saturation, but we could get closer by unrolling more
iterations manually. Zig includes a function for validating UTF-8 strings in its standard library
(std.unicode.utf8ValidateSlice
), so that will be our point of comparison.
So how fast is it? Using an existing setup for benchmarking UTF-8 validation this implementation settled in at 3.2 GB/s on a 3.6GHz CPU. This is close to the theoretical 1 byte/cycle, but is likely held back by unrolling the loop only 8 times. More importantly this is significantly faster than Zig's stdlib implementation which peaked at 1 GB/s, with much slower performance on latin and cyrillic datasets.
This is not the fastest UTF-8 validation approach available today. That crown belongs to simdutf8, which is also where the benchmarking harness comes from. As the name would suggest, simdutf8 takes advantage of parallel processing using SIMD instructions (AVX on AMD64, NEON on Aarch64) and achieves around 12.5 GB/s of UTF-8 processing, going up to 90 GB/s on base ASCII texts. There is no way a DFA approach can compete with that.
We have an inferior solution to an already solved problem. Except that is not entirely true. SIMD algorithms are faster, but they need large datasets to reap the benefits of parallelization, which leaves room where the DFA approach is superior. On inputs smaller than ~70 bytes, the DFA came out faster than simdutf8. It also has the advantage of not touching vector registers, which is a requirement for kernel code and is a nice bonus when interleaved with some other vector-heavy code.
Another benefit is that this code works on any DFA with up to 10 states. Whatever your problem is, if you can solve it with a 10 state DFA, you can solve it at 1 byte/cycle.
In his original article, Per Vognsen discusses limitations of this approach, adding more states and improving the limits on throughput. When writing this code I had an idea in the opposite direction. Limiting the functionality of the DFA to 6 states and transitioning only using a limited number of bits from the input symbol. That might sound counterintuitive at first, but it is enough for a UTF-8 validating DFA and should unlock uptimization that would take the DFA to 2 bytes/cycle, getting dangerously close to SIMD approaches on non-ASCII inputs.
The code presented here is available under the MIT license should you wish to use it in your own projects. If there's enough interest, I'll try to upstream this code to Zig's standard library.