Benchmarks indicated that obfuscating multiple bytes already gives an order of magnitude speed-up, but:
* GCC still emitted scalar code;
* Clang’s auto-vectorized loop ran on the slow unaligned-load path.
Fix contains:
* peeling the misaligned head enabled the hot loop starting at an 8-byte address;
* `std::assume_aligned<8>` tells the optimizer the promise holds - required to keep Apple Clang happy;
* manually unrolling the body to 64 bytes enabled GCC to auto-vectorize.
Note that `target.size() > KEY_SIZE` condition is just an optimization, the aligned and unaligned loops work without it as well - it's why the alignment calculation still contains `std::min`.
> C++ compiler .......................... GNU 14.2.0
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.03 | 32,464,658,919.11 | 0.0% | 0.50 | 0.11 | 4.474 | 0.08 | 0.0% | 5.29 | `ObfuscationBench`
> C++ compiler .......................... Clang 20.1.7
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.02 | 41,231,547,045.17 | 0.0% | 0.30 | 0.09 | 3.463 | 0.02 | 0.0% | 5.47 | `ObfuscationBench`
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
All former `std::vector<std::byte>` keys were replaced with `uint64_t` (we still serialize them as vectors but convert immediately to `uint64_t` on load).
This is why some tests still generate vector keys and convert them to `uint64_t` later instead of generating them directly.
In `Obfuscation::Unserialize` we can safely throw an `std::ios_base::failure` since during mempool fuzzing `mempool_persist.cpp#L141` catches and ignored these errors.
> C++ compiler .......................... GNU 14.2.0
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.04 | 28,365,698,819.44 | 0.0% | 0.34 | 0.13 | 2.714 | 0.07 | 0.0% | 5.33 | `ObfuscationBench`
> C++ compiler .......................... Clang 20.1.7
| ns/byte | byte/s | err% | ins/byte | cyc/byte | IPC | bra/byte | miss% | total | benchmark
|--------------------:|--------------------:|--------:|----------------:|----------------:|-------:|---------------:|--------:|----------:|:----------
| 0.08 | 13,012,464,203.00 | 0.0% | 0.65 | 0.28 | 2.338 | 0.13 | 0.8% | 5.50 | `ObfuscationBench`
Co-authored-by: Hodlinator <172445034+hodlinator@users.noreply.github.com>
Co-authored-by: Ryan Ofsky <ryan@ofsky.org>
This is meant to focus the usages to narrow the scope of the obfuscation optimization.
`Obfuscation::Xor` is mostly a move.
Co-authored-by: maflcko <6399679+maflcko@users.noreply.github.com>
Since 31 byte xor-keys are not used in the codebase, using the common size (8 bytes) makes the benchmarks more realistic.
Co-authored-by: maflcko <6399679+maflcko@users.noreply.github.com>