Skip to content

ByteString: extract shared private helpers to reduce code duplication#2859

Closed
pjfanning wants to merge 5 commits intoapache:mainfrom
pjfanning:copilot/optimize-code-reuse-private-functions
Closed

ByteString: extract shared private helpers to reduce code duplication#2859
pjfanning wants to merge 5 commits intoapache:mainfrom
pjfanning:copilot/optimize-code-reuse-private-functions

Conversation

@pjfanning
Copy link
Copy Markdown
Member

@pjfanning pjfanning commented Apr 13, 2026

ByteString1C and ByteString1 contained near-identical SWAR search loops and unrolled byte-scan helpers duplicated verbatim across both classes. ByteStrings similarly had identical find inner functions in its generic and Byte-typed indexOf/lastIndexOf overloads.

New shared helpers in object ByteString

  • unrolledFirstIndexOf(bytes, fromIndex, byteCount, value) / unrolledLastIndexOf(...) — unrolled 1–7 byte scans, previously duplicated in both ByteString1C and ByteString1
  • swarFirstIndexOf(bytes, fromOffset, searchLength, elem) — SWAR forward-search loop; eliminates duplication between indexOf(Byte, Int) and indexOf(Byte, Int, Int) in both classes; returns absolute index
  • swarLastIndexOf(bytes, baseOffset, searchLength, elem) — SWAR reverse-search loop; same rationale; returns absolute index (consistent with swarFirstIndexOf)

Per-class changes

  • ByteString1 gains a private absToLogical(absIdx) helper to eliminate the repeated if (absResult == -1) -1 else absResult - startIndex pattern in all four index-search methods
  • ByteString1C / ByteString1 / ByteStrings: indexOf[B >: Byte] now pattern-matches on elem and dispatches to the Byte-specific SWAR overload for Byte inputs — removing the duplicate find inner function in ByteStrings and improving performance for Byte inputs (avoids boxing)
  • ByteStrings: lastIndexOf[B >: Byte] follows the same dispatch pattern

All public APIs are unchanged. Net: −80 lines. All helpers are small and static, keeping HotSpot inlining characteristics equivalent to the original inlined code.

Copilot AI and others added 5 commits April 13, 2026 20:34
… reuse

- Add `unrolledFirstIndexOf(bytes, fromIndex, byteCount, value)` and
  `unrolledLastIndexOf(bytes, fromIndex, byteCount, value)` shared helpers to
  `object ByteString`, replacing identical private methods that were duplicated in
  `ByteString1C` and `ByteString1`.
- Add `swarFirstIndexOf(bytes, fromOffset, searchLength, elem)` helper that
  contains the SWAR search loop, eliminating duplication between
  `indexOf(Byte, Int)` and `indexOf(Byte, Int, Int)` in both `ByteString1C`
  and `ByteString1`.
- Add `swarLastIndexOf(bytes, baseOffset, searchLength, elem)` helper that
  contains the reverse SWAR search loop, eliminating duplication between the
  `lastIndexOf(Byte, Int)` implementations in `ByteString1C` and `ByteString1`.
- Make `indexOf[B >: Byte]` in `ByteString1C`, `ByteString1`, and `ByteStrings`
  dispatch to the Byte-specific SWAR overload for Byte inputs (pattern match on
  elem), eliminating the near-identical `find` inner function that was duplicated
  between `indexOf[B]` and `indexOf(Byte, Int)` in `ByteStrings`, and improving
  performance for Byte searches in all three classes.
- Make `lastIndexOf[B >: Byte]` in `ByteStrings` likewise dispatch to
  `lastIndexOf(Byte, Int)` for Byte inputs, removing the duplicate `find`
  function.
- Net result: -89 lines of code with no behaviour change; all public APIs
  unchanged; HotSpot-friendly (small, static, easily-inlineable helpers).

Agent-Logs-Url: https://github.com/pjfanning/incubator-pekko/sessions/3ba37c44-b65b-4094-9f11-7541417d7c37

Co-authored-by: pjfanning <11783444+pjfanning@users.noreply.github.com>
…l helper in ByteString1

Agent-Logs-Url: https://github.com/pjfanning/incubator-pekko/sessions/3ba37c44-b65b-4094-9f11-7541417d7c37

Co-authored-by: pjfanning <11783444+pjfanning@users.noreply.github.com>
@pjfanning pjfanning marked this pull request as draft April 13, 2026 20:50
@pjfanning
Copy link
Copy Markdown
Member Author

pjfanning commented Apr 14, 2026

My computer is getting a bit old and is not great at providing consistent conditions during JMH runs.
I got these results and with the changes shows some variability than I'm willing to put down to the machine having some other stuff running during the early benchmarks. Ultimately, I think these results are close enough to say that it is unlikely that there is a slowdown caused by this PR's changes.

With Changes
[info] Benchmark                                                           Mode  Cnt          Score          Error  Units
[info] ByteString_indexOf_Benchmark.bs1_indexOf_from                      thrpt    3  150359434.050 ±  59507989.846  ops/s
[info] ByteString_indexOf_Benchmark.bs1_indexOf_from_byte                 thrpt    3  174120693.736 ± 64645037.735  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_best_case            thrpt    3  217852446.284 ± 28875894.270  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_far_index_case       thrpt    3   18300391.307 ±  4958872.389  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_far_index_case_byte  thrpt    3   16102869.070 ±  9775298.895  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_worst_case           thrpt    3    5574890.131 ±   573776.254  ops/s

Without Changes
[info] Benchmark                                                           Mode  Cnt          Score          Error  Units
[info] ByteString_indexOf_Benchmark.bs1_indexOf_from                      thrpt    3  153101714.175 ± 29261792.393  ops/s
[info] ByteString_indexOf_Benchmark.bs1_indexOf_from_byte                 thrpt    3  186251713.877 ± 88143603.248  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_best_case            thrpt    3  218559185.909 ± 36443255.263  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_far_index_case       thrpt    3   17241268.500 ±  3343535.671  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_far_index_case_byte  thrpt    3   16227186.055 ±  2458904.387  ops/s
[info] ByteString_indexOf_Benchmark.bss_indexOf_from_worst_case           thrpt    3    5570419.443 ±  1216122.560  ops/s

@He-Pin
Copy link
Copy Markdown
Member

He-Pin commented Apr 14, 2026

I can verify it later

@@ -1141,6 +975,92 @@ object ByteString {
def SerializationIdentity: Byte
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Consider @inline for swarFirstIndexOf / swarLastIndexOf / absToLogical

These helpers are on the hot path for every indexOf/lastIndexOf call. After extraction from in-class methods to companion-object private methods, the JVM JIT may still inline them eventually, but:

  1. absToLogical is tiny (if (absIdx == -1) -1 else absIdx - startIndex) — a perfect candidate for @inline to avoid the call overhead in tight loops
  2. swarFirstIndexOf and swarLastIndexOf are called from three different ByteString subclasses — without @inline, each call site adds a frame

The original code had these methods inline within each class (effectively monomorphic), so the JIT had an easier time. Extracting to shared helpers introduces polymorphic call sites.

Suggestion:

@inline private def absToLogical(absIdx: Int): Int = ...
@inline private def swarFirstIndexOf(...): Int = ...
@inline private def swarLastIndexOf(...): Int = ...

This ensures zero overhead from the deduplication — best of both worlds.

@pjfanning
Copy link
Copy Markdown
Member Author

  • Fair point about the extra stack frame.
  • Scala 2 and Scala 3 declare inlining differently (@inline def vs inline def) and I don't want to go back to having separate Scala 2 and 3 code so I think it is better to keep the code duplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants