Skip to content

perf(core): cut candidate scoring and hash sequence overhead#105

Merged
danielplohmann merged 2 commits into
danielplohmann:masterfrom
r0ny123:codex/upstream-pr-46-perf
May 28, 2026
Merged

perf(core): cut candidate scoring and hash sequence overhead#105
danielplohmann merged 2 commits into
danielplohmann:masterfrom
r0ny123:codex/upstream-pr-46-perf

Conversation

@r0ny123
Copy link
Copy Markdown
Contributor

@r0ny123 r0ny123 commented May 26, 2026

Summary

Three related, behavior-preserving hot-path tweaks. Single optimization class: avoidable per-call Python overhead in Intel candidate scoring and PIC/OPC hash sequence assembly.

1. PIC/OPC hash byte conversion (SmdaFunction, SmdaBasicBlock)

Replace bytes([ord(c) for c in "".join(seqs)]) with "".join(seqs).encode("ascii") in the four hash-sequence helpers (getPicHashSequence, getOpcHashSequence, getPicBlockHashSequence, getOpcBlockHashSequence).

The escaper only emits ASCII (hex chars + ?), so the encoded bytes are bit-identical. The per-character Python loop is gone.

2. Hoist sorted prologue lengths (FunctionCandidate)

sorted([int(length_str) for length_str in COMMON_PROLOGUES], reverse=True) was rebuilt on every call to hasCommonFunctionStart / getFunctionStartScore. Both are reached from calculateScore, getCharacteristics, __str__, and toJson, so every candidate touched the list many times. Lifted to module-level _COMMON_PROLOGUE_LENGTHS.

3. call_ref_sources list → set (FunctionCandidate)

The inner CFG-recovery loop does addr not in call_ref_sources on every call instruction; with a list that's O(n) per call and quadratic for hot targets (popular runtime stubs accumulate many sources). Set makes add/discard/membership O(1). Verified no external order dependence (grep across src/ and tests/): only len() and truthiness are used externally. The single [0] read was inside __str__'s len == 1 branch and is now next(iter(...)).

Measurements (focused bench, asprox fixture, Python 3.11.15)

Path Before After Δ
calculateScore × 200_000 322.2 ms 179.9 ms −44%
getPicBlockHashSequence × 2140 blocks 36.3 ms 31.1 ms −14%
getOpcHashSequence × 105 funcs 47.3 ms 43.7 ms −8%
getOpcBlockHashSequence × 2140 blocks 50.3 ms 47.6 ms −5%
getPicHashSequence × 105 funcs 31.0 ms 30.1 ms −3%
disassembleBuffer (e2e, asprox) 1617 ms 1579 ms −2.4%

Microbench on a representative 3.7 KB escaped sequence: bytes([ord(c) for c in s]) vs s.encode("ascii") is ~600x slower per call — the e2e gain looks small because Capstone/escaper work dominates a single call, but the cost scales linearly with function/block count, so larger binaries see proportionally larger absolute savings.

Behavior compatibility

  • pic_hash, opc_hash, and serialized report sha256 unchanged on asprox.
  • Public report schema unchanged.
  • All 90 tests + 43 subtests pass.
  • make lint (ruff check + format check) clean.

Test plan

  • python -m pytest tests/test* — 90 passed, 43 subtests passed in 13.56 s
  • python -m ruff check . — All checks passed
  • python -m ruff format --check . — 94 files already formatted
  • Before/after benchmark on asprox fixture (numbers above)
  • Verified report serialization assertions in testIntegration.testAsproxMarshalling / testCutwailMarshalling still hold

Residual perf risk

None expected. Set iteration order isn't insertion order, but no caller reads call_ref_sources in an order-dependent way. The only read of an element by index was the __str__ single-element case, which by definition has one element.

Out of scope for this branch (noted during the sweep, deferred)

  • IndirectCallAnalyzer.searchBlock is O(blocks × instructions) inside a 3-deep recursion — a one-shot addr → block index would help, but is more invasive.
  • SmdaFunction.getNormalizedBlockRefs is recomputed unconditionally; safe caching requires tracking architecture_metadata mutation.
  • BinaryInfo.getImportedFunctions calls PeSymbolProvider(None).parseSymbols(lief_result) and discards the result before re-instantiating for parseImports — looks like wasted work but warrants a separate audit of provider side effects.
  • Loaders re-lief.parse() the same buffer across getArchitecture / getCodeAreas calls. BinaryInfo already caches, but the static *FileLoader accessors do not.

claude added 2 commits May 26, 2026 18:35
…nces

Three related hot-path tweaks, all behavior-preserving:

1. SmdaFunction / SmdaBasicBlock: replace
   `bytes([ord(c) for c in "".join(seqs)])` with
   `"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence
   helpers. The output is byte-identical (the escaper emits ASCII-only
   strings), but the per-character Python loop is gone. Microbench on a
   ~3.7 KB escaped sequence shows ~600x speedup for the conversion step
   alone; on the asprox fixture (105 funcs, 2140 blocks) block hash
   sequence assembly drops ~15%.

2. FunctionCandidate: hoist
   `sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of
   `hasCommonFunctionStart` / `getFunctionStartScore` to a module-level
   constant. Both methods are called from `calculateScore` /
   `getCharacteristics` / `__str__` / `toJson`, so on every candidate
   the prologue length list was being rebuilt and re-sorted from
   scratch. calculateScore over 200k iterations drops from 322ms to
   180ms (~44%) in a focused bench.

3. FunctionCandidate.call_ref_sources: switch from list to set. The
   inner CFG-recovery loop does `addr not in call_ref_sources` on every
   call instruction; with a list this is O(n) per call and quadratic
   for hot targets (popular runtime stubs can accumulate many sources).
   With a set, add/discard/membership are O(1). The only order-sensitive
   read was the single-element branch in `__str__`, which now uses
   `next(iter(...))`. No external code depends on ordering — only `len`
   and truthiness (verified across src/ and tests/).

Validation:
- `make lint` (ruff check + format check) clean.
- `pytest tests/test*` 90 passed, 43 subtests passed.
- pic_hash / opc_hash / serialized report sha256 unchanged on asprox.
…lRefs

Follow-up to the call_ref_sources list-to-set switch: collapse the
Python-level "for addr in source_addrs: discard()" loop into a single
set.difference_update() call. The semantics are identical (both no-op on
missing elements) but the work is now done in optimized C, which matters
because removeCallRefs is called from FunctionCandidateManager.updateCandidates
whenever HIGH_ACCURACY is on (the default) and conflicts are detected
during CFG recovery.

Validation:
- pytest tests/test* -> 90 passed, 43 subtests passed
- ruff check + format --check clean
@r0ny123 r0ny123 marked this pull request as ready for review May 26, 2026 13:23
@danielplohmann danielplohmann merged commit b52d429 into danielplohmann:master May 28, 2026
23 checks passed
@r0ny123 r0ny123 deleted the codex/upstream-pr-46-perf branch May 28, 2026 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants