Skip to content

enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes#3472

Merged
madolson merged 2 commits intovalkey-io:unstablefrom
charsyam:feature/enhance_rax
Apr 14, 2026
Merged

enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes#3472
madolson merged 2 commits intovalkey-io:unstablefrom
charsyam:feature/enhance_rax

Conversation

@charsyam
Copy link
Copy Markdown
Contributor

@charsyam charsyam commented Apr 9, 2026

Replace the open-coded byte-by-byte loop in raxLowWalk() with memchr(). libc implementations of memchr() on common platforms are SIMD-optimized (SSE2/AVX2 on x86_64, NEON on arm64), which significantly outperforms a scalar loop while remaining faster than a binary search at the small fan-out sizes (<= 256) that rax nodes can have.

in macOS ARM(M4 pro)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

nkeys keylen unstable (scalar) memchr speedup
10K 8 15.62 30.19 1.93x
100K 4 8.18 20.53 2.51x
200K 16 5.00 13.71 2.74x
500K 20 3.33 7.63 2.29x
1M 16 2.92 6.53 2.24x
200K 64 3.59 5.05 1.41x

Insert Performance (Mops/s, Avg with 5 Repeats)

nkeys keylen unstable (scalar) memchr speedup
10K 8 5.14 5.79 1.13x
100K 4 4.60 6.11 1.33x
200K 16 4.08 5.20 1.28x
500K 20 3.52 5.06 1.44x
1M 16 3.68 4.73 1.29x
200K 64 3.85 4.86 1.26x

in Linux(Linux x86_64 (Ryzen 7 8845HS, GCC 13.3, Avg with 5 Repeats)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

nkeys keylen unstable (scalar) memchr speedup
10K 8 16.68 33.99 2.04x
100K 4 9.55 21.13 2.21x
200K 16 5.62 8.18 1.46x
500K 20 3.84 5.47 1.42x
1M 16 3.55 4.93 1.39x
200K 64 3.54 4.40 1.24x

Insert Performance (Mops/s, Avg with 5 Repeats)

nkeys keylen unstable (scalar) memchr speedup
10K 8 4.97 6.32 1.27x
100K 4 4.37 5.29 1.21x
200K 16 4.00 4.85 1.21x
500K 20 3.79 4.79 1.26x
1M 16 3.50 4.36 1.25x
200K 64 3.50 4.29 1.23x

Actually I tested binary search method if h->size is greater than T(8, 16, 32) and SIMD implementation. SIMD(especially NEON in arm is a little bit faster than this PR). but memchr is stable and robust and it is more readable.

Thanks.

Replace the open-coded byte-by-byte loop in raxLowWalk() with memchr().
libc implementations of memchr() on common platforms are SIMD-optimized
(SSE2/AVX2 on x86_64, NEON on arm64), which significantly outperforms a
scalar loop while remaining faster than a binary search at the small
fan-out sizes (<= 256) that rax nodes can have.

Microbenchmark on Apple Silicon (ns/op for finding a byte in a sorted
N-byte array):

  N    scalar   memchr   speedup
  16    4.46     1.56     2.86x
  32    8.43     1.86     4.53x
  64   15.32     2.30     6.66x
 128   31.27     3.11    10.05x
 256   38.09     4.03     9.45x

The change is purely algorithmic: tree structure and memory layout are
unchanged. Workloads where every node has small fan-out still benefit
because memchr() is faster than the scalar loop even for short scans.

Signed-off-by: charsyam <charsyam@naver.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.26%. Comparing base (a3a8399) to head (7ffe43f).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3472      +/-   ##
============================================
- Coverage     76.40%   76.26%   -0.14%     
============================================
  Files           159      159              
  Lines         79809    79809              
============================================
- Hits          60977    60868     -109     
- Misses        18832    18941     +109     
Files with missing lines Coverage Δ
src/rax.c 83.55% <100.00%> (ø)

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wanted to actually validate your claim, and micro benchmarking did show it's faster. Neat!

@madolson madolson merged commit c4db208 into valkey-io:unstable Apr 14, 2026
58 of 59 checks passed
@madolson madolson added the release-notes This issue should get a line item in the release notes label Apr 14, 2026
@ahmadbelb
Copy link
Copy Markdown
Contributor

Nice work on the benchmarks and getting this merged @charsyam ! I proposed the same memchr() change in #3386 but didn't get around to opening the PR, glad to see it land !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants