enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes by charsyam · Pull Request #3472 · valkey-io/valkey

charsyam · 2026-04-09T12:39:40Z

Replace the open-coded byte-by-byte loop in raxLowWalk() with memchr(). libc implementations of memchr() on common platforms are SIMD-optimized (SSE2/AVX2 on x86_64, NEON on arm64), which significantly outperforms a scalar loop while remaining faster than a binary search at the small fan-out sizes (<= 256) that rax nodes can have.

in macOS ARM(M4 pro)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

nkeys	keylen	unstable (scalar)	memchr	speedup
10K	8	15.62	30.19	1.93x
100K	4	8.18	20.53	2.51x
200K	16	5.00	13.71	2.74x
500K	20	3.33	7.63	2.29x
1M	16	2.92	6.53	2.24x
200K	64	3.59	5.05	1.41x

Insert Performance (Mops/s, Avg with 5 Repeats)

nkeys	keylen	unstable (scalar)	memchr	speedup
10K	8	5.14	5.79	1.13x
100K	4	4.60	6.11	1.33x
200K	16	4.08	5.20	1.28x
500K	20	3.52	5.06	1.44x
1M	16	3.68	4.73	1.29x
200K	64	3.85	4.86	1.26x

in Linux(Linux x86_64 (Ryzen 7 8845HS, GCC 13.3, Avg with 5 Repeats)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

nkeys	keylen	unstable (scalar)	memchr	speedup
10K	8	16.68	33.99	2.04x
100K	4	9.55	21.13	2.21x
200K	16	5.62	8.18	1.46x
500K	20	3.84	5.47	1.42x
1M	16	3.55	4.93	1.39x
200K	64	3.54	4.40	1.24x

Insert Performance (Mops/s, Avg with 5 Repeats)

nkeys	keylen	unstable (scalar)	memchr	speedup
10K	8	4.97	6.32	1.27x
100K	4	4.37	5.29	1.21x
200K	16	4.00	4.85	1.21x
500K	20	3.79	4.79	1.26x
1M	16	3.50	4.36	1.25x
200K	64	3.50	4.29	1.23x

Actually I tested binary search method if h->size is greater than T(8, 16, 32) and SIMD implementation. SIMD(especially NEON in arm is a little bit faster than this PR). but memchr is stable and robust and it is more readable.

Thanks.

Replace the open-coded byte-by-byte loop in raxLowWalk() with memchr(). libc implementations of memchr() on common platforms are SIMD-optimized (SSE2/AVX2 on x86_64, NEON on arm64), which significantly outperforms a scalar loop while remaining faster than a binary search at the small fan-out sizes (<= 256) that rax nodes can have. Microbenchmark on Apple Silicon (ns/op for finding a byte in a sorted N-byte array): N scalar memchr speedup 16 4.46 1.56 2.86x 32 8.43 1.86 4.53x 64 15.32 2.30 6.66x 128 31.27 3.11 10.05x 256 38.09 4.03 9.45x The change is purely algorithmic: tree structure and memory layout are unchanged. Workloads where every node has small fan-out still benefit because memchr() is faster than the scalar loop even for short scans. Signed-off-by: charsyam <charsyam@naver.com>

codecov · 2026-04-14T01:21:09Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.26%. Comparing base (a3a8399) to head (7ffe43f).
⚠️ Report is 1 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #3472      +/-   ##
============================================
- Coverage     76.40%   76.26%   -0.14%     
============================================
  Files           159      159              
  Lines         79809    79809              
============================================
- Hits          60977    60868     -109     
- Misses        18832    18941     +109

Files with missing lines	Coverage Δ
src/rax.c	`83.55% <100.00%> (ø)`

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

madolson

Wanted to actually validate your claim, and micro benchmarking did show it's faster. Neat!

ahmadbelb · 2026-04-14T07:53:18Z

Nice work on the benchmarks and getting this merged @charsyam ! I proposed the same memchr() change in #3386 but didn't get around to opening the PR, glad to see it land !!

github-actions Bot assigned charsyam Apr 9, 2026

Merge branch 'unstable' into feature/enhance_rax

7ffe43f

madolson approved these changes Apr 14, 2026

View reviewed changes

madolson merged commit c4db208 into valkey-io:unstable Apr 14, 2026
58 of 59 checks passed

madolson added the release-notes This issue should get a line item in the release notes label Apr 14, 2026

ahmadbelb mentioned this pull request Apr 14, 2026

Optimize raxLowWalk child lookup with memchr() #3386

Closed

zuiderkwast mentioned this pull request Apr 16, 2026

Merge unstable into 9.1 #3507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes#3472

enhance raxLowWalk: use memchr() for child-edge lookup in non-compressed nodes#3472
madolson merged 2 commits intovalkey-io:unstablefrom
charsyam:feature/enhance_rax

charsyam commented Apr 9, 2026

Uh oh!

codecov Bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

madolson left a comment

Uh oh!

Uh oh!

ahmadbelb commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

charsyam commented Apr 9, 2026

in macOS ARM(M4 pro)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

Insert Performance (Mops/s, Avg with 5 Repeats)

in Linux(Linux x86_64 (Ryzen 7 8845HS, GCC 13.3, Avg with 5 Repeats)

Lookup Performance (Mops/s, , Avg with 5 Repeats)

Insert Performance (Mops/s, Avg with 5 Repeats)

Uh oh!

codecov Bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahmadbelb commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Apr 14, 2026 •

edited

Loading