Getting started
SimdUnicode is a small, dependency-free C# library that validates UTF-8 with SIMD instructions. It targets .NET 8 (or better) and runs on x64 and ARM64.
Requirements
- .NET 8 SDK or newer.
- A 64-bit x64 or ARM64 CPU for the SIMD kernels (a portable scalar fallback covers everything else).
Build & reference
Clone the repository and build the library:
git clone https://github.com/simdutf/SimdUnicode.git
cd SimdUnicode/src
dotnet build -c Release
Then add a project reference to src/SimdUnicode.csproj from your own project:
dotnet add reference path/to/SimdUnicode/src/SimdUnicode.csproj
Validating a buffer
The core entry point is UTF8.GetPointerToFirstInvalidByte.
It scans pInputBuffer and returns a pointer to the first invalid byte, or a pointer to
the end of the buffer when the input is well-formed.
using SimdUnicode;
static unsafe bool IsValidUtf8(ReadOnlySpan<byte> data)
{
fixed (byte* p = data)
{
byte* end = UTF8.GetPointerToFirstInvalidByte(
p, data.Length,
out _ /* utf16 code-unit adjustment */,
out _ /* scalar code-unit adjustment */);
return end == p + data.Length;
}
}
The out parameters
| Parameter | Meaning |
|---|---|
Utf16CodeUnitCountAdjustment |
Add this to the byte count to get the number of UTF-16 code units. Counts -1 for each 2-byte character and -2 for each 3- or 4-byte character. |
ScalarCodeUnitCountAdjustment |
Add this to the byte count to get the number of Unicode scalar values. Counts -1 for each 4-byte character. |
These adjustments let you compute the resulting UTF-16 length (or scalar/code-point count) of a valid buffer for free, during validation — no second pass required.
unsafe
{
fixed (byte* p = data)
{
byte* end = UTF8.GetPointerToFirstInvalidByte(p, data.Length,
out int utf16Adjust, out int scalarAdjust);
if (end == p + data.Length)
{
int utf16Length = data.Length + utf16Adjust;
int codePointCount = data.Length + scalarAdjust;
}
}
}
Choosing a specific kernel
GetPointerToFirstInvalidByte dispatches to the fastest kernel your CPU supports.
You can also call a specific implementation directly — useful for testing or pinning behaviour:
GetPointerToFirstInvalidByteAvx512— AMD Zen 4 / Intel Ice Lake and newer.GetPointerToFirstInvalidByteAvx2— current x64.GetPointerToFirstInvalidByteSse— older x64 (SSE4.2 / SSSE3).GetPointerToFirstInvalidByteArm64— ARM NEON (Apple Silicon, Graviton…).GetPointerToFirstInvalidByteScalar— portable fallback.
Finding the first non-ASCII byte
The Ascii helper class offers fast ASCII scanning,
including GetIndexOfFirstNonAsciiByte and architecture-specific variants.
using SimdUnicode;
unsafe
{
fixed (byte* p = data)
{
nuint idx = Ascii.GetIndexOfFirstNonAsciiByte(p, (nuint)data.Length);
bool allAscii = idx == (nuint)data.Length;
}
}
Continue to How it works or jump to the API reference.