perf(core): cut candidate scoring and hash sequence overhead#105
Merged
danielplohmann merged 2 commits intoMay 28, 2026
Merged
Conversation
…nces
Three related hot-path tweaks, all behavior-preserving:
1. SmdaFunction / SmdaBasicBlock: replace
`bytes([ord(c) for c in "".join(seqs)])` with
`"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence
helpers. The output is byte-identical (the escaper emits ASCII-only
strings), but the per-character Python loop is gone. Microbench on a
~3.7 KB escaped sequence shows ~600x speedup for the conversion step
alone; on the asprox fixture (105 funcs, 2140 blocks) block hash
sequence assembly drops ~15%.
2. FunctionCandidate: hoist
`sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of
`hasCommonFunctionStart` / `getFunctionStartScore` to a module-level
constant. Both methods are called from `calculateScore` /
`getCharacteristics` / `__str__` / `toJson`, so on every candidate
the prologue length list was being rebuilt and re-sorted from
scratch. calculateScore over 200k iterations drops from 322ms to
180ms (~44%) in a focused bench.
3. FunctionCandidate.call_ref_sources: switch from list to set. The
inner CFG-recovery loop does `addr not in call_ref_sources` on every
call instruction; with a list this is O(n) per call and quadratic
for hot targets (popular runtime stubs can accumulate many sources).
With a set, add/discard/membership are O(1). The only order-sensitive
read was the single-element branch in `__str__`, which now uses
`next(iter(...))`. No external code depends on ordering — only `len`
and truthiness (verified across src/ and tests/).
Validation:
- `make lint` (ruff check + format check) clean.
- `pytest tests/test*` 90 passed, 43 subtests passed.
- pic_hash / opc_hash / serialized report sha256 unchanged on asprox.
…lRefs Follow-up to the call_ref_sources list-to-set switch: collapse the Python-level "for addr in source_addrs: discard()" loop into a single set.difference_update() call. The semantics are identical (both no-op on missing elements) but the work is now done in optimized C, which matters because removeCallRefs is called from FunctionCandidateManager.updateCandidates whenever HIGH_ACCURACY is on (the default) and conflicts are detected during CFG recovery. Validation: - pytest tests/test* -> 90 passed, 43 subtests passed - ruff check + format --check clean
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three related, behavior-preserving hot-path tweaks. Single optimization class: avoidable per-call Python overhead in Intel candidate scoring and PIC/OPC hash sequence assembly.
1. PIC/OPC hash byte conversion (
SmdaFunction,SmdaBasicBlock)Replace
bytes([ord(c) for c in "".join(seqs)])with"".join(seqs).encode("ascii")in the four hash-sequence helpers (getPicHashSequence,getOpcHashSequence,getPicBlockHashSequence,getOpcBlockHashSequence).The escaper only emits ASCII (hex chars +
?), so the encoded bytes are bit-identical. The per-character Python loop is gone.2. Hoist sorted prologue lengths (
FunctionCandidate)sorted([int(length_str) for length_str in COMMON_PROLOGUES], reverse=True)was rebuilt on every call tohasCommonFunctionStart/getFunctionStartScore. Both are reached fromcalculateScore,getCharacteristics,__str__, andtoJson, so every candidate touched the list many times. Lifted to module-level_COMMON_PROLOGUE_LENGTHS.3.
call_ref_sourceslist → set (FunctionCandidate)The inner CFG-recovery loop does
addr not in call_ref_sourceson every call instruction; with a list that's O(n) per call and quadratic for hot targets (popular runtime stubs accumulate many sources). Set makes add/discard/membership O(1). Verified no external order dependence (grepacrosssrc/andtests/): onlylen()and truthiness are used externally. The single[0]read was inside__str__'slen == 1branch and is nownext(iter(...)).Measurements (focused bench, asprox fixture, Python 3.11.15)
calculateScore × 200_000getPicBlockHashSequence × 2140 blocksgetOpcHashSequence × 105 funcsgetOpcBlockHashSequence × 2140 blocksgetPicHashSequence × 105 funcsdisassembleBuffer(e2e, asprox)Microbench on a representative 3.7 KB escaped sequence:
bytes([ord(c) for c in s])vss.encode("ascii")is ~600x slower per call — the e2e gain looks small because Capstone/escaper work dominates a single call, but the cost scales linearly with function/block count, so larger binaries see proportionally larger absolute savings.Behavior compatibility
pic_hash,opc_hash, and serialized reportsha256unchanged on asprox.make lint(ruff check + format check) clean.Test plan
python -m pytest tests/test*— 90 passed, 43 subtests passed in 13.56 spython -m ruff check .— All checks passedpython -m ruff format --check .— 94 files already formattedtestIntegration.testAsproxMarshalling/testCutwailMarshallingstill holdResidual perf risk
None expected. Set iteration order isn't insertion order, but no caller reads
call_ref_sourcesin an order-dependent way. The only read of an element by index was the__str__single-element case, which by definition has one element.Out of scope for this branch (noted during the sweep, deferred)
IndirectCallAnalyzer.searchBlockis O(blocks × instructions) inside a 3-deep recursion — a one-shotaddr → blockindex would help, but is more invasive.SmdaFunction.getNormalizedBlockRefsis recomputed unconditionally; safe caching requires trackingarchitecture_metadatamutation.BinaryInfo.getImportedFunctionscallsPeSymbolProvider(None).parseSymbols(lief_result)and discards the result before re-instantiating forparseImports— looks like wasted work but warrants a separate audit of provider side effects.lief.parse()the same buffer acrossgetArchitecture/getCodeAreascalls.BinaryInfoalready caches, but the static*FileLoaderaccessors do not.