perf(core): cut candidate scoring and hash sequence overhead by r0ny123 · Pull Request #105 · danielplohmann/smda

r0ny123 · 2026-05-26T13:05:26Z

Summary

Three related, behavior-preserving hot-path tweaks. Single optimization class: avoidable per-call Python overhead in Intel candidate scoring and PIC/OPC hash sequence assembly.

1. PIC/OPC hash byte conversion (`SmdaFunction`, `SmdaBasicBlock`)

Replace bytes([ord(c) for c in "".join(seqs)]) with "".join(seqs).encode("ascii") in the four hash-sequence helpers (getPicHashSequence, getOpcHashSequence, getPicBlockHashSequence, getOpcBlockHashSequence).

The escaper only emits ASCII (hex chars + ?), so the encoded bytes are bit-identical. The per-character Python loop is gone.

2. Hoist sorted prologue lengths (`FunctionCandidate`)

sorted([int(length_str) for length_str in COMMON_PROLOGUES], reverse=True) was rebuilt on every call to hasCommonFunctionStart / getFunctionStartScore. Both are reached from calculateScore, getCharacteristics, __str__, and toJson, so every candidate touched the list many times. Lifted to module-level _COMMON_PROLOGUE_LENGTHS.

3. `call_ref_sources` list → set (`FunctionCandidate`)

The inner CFG-recovery loop does addr not in call_ref_sources on every call instruction; with a list that's O(n) per call and quadratic for hot targets (popular runtime stubs accumulate many sources). Set makes add/discard/membership O(1). Verified no external order dependence (grep across src/ and tests/): only len() and truthiness are used externally. The single [0] read was inside __str__'s len == 1 branch and is now next(iter(...)).

Measurements (focused bench, asprox fixture, Python 3.11.15)

Path	Before	After	Δ
`calculateScore × 200_000`	322.2 ms	179.9 ms	−44%
`getPicBlockHashSequence × 2140 blocks`	36.3 ms	31.1 ms	−14%
`getOpcHashSequence × 105 funcs`	47.3 ms	43.7 ms	−8%
`getOpcBlockHashSequence × 2140 blocks`	50.3 ms	47.6 ms	−5%
`getPicHashSequence × 105 funcs`	31.0 ms	30.1 ms	−3%
`disassembleBuffer` (e2e, asprox)	1617 ms	1579 ms	−2.4%

Microbench on a representative 3.7 KB escaped sequence: bytes([ord(c) for c in s]) vs s.encode("ascii") is ~600x slower per call — the e2e gain looks small because Capstone/escaper work dominates a single call, but the cost scales linearly with function/block count, so larger binaries see proportionally larger absolute savings.

Behavior compatibility

pic_hash, opc_hash, and serialized report sha256 unchanged on asprox.
Public report schema unchanged.
All 90 tests + 43 subtests pass.
make lint (ruff check + format check) clean.

Test plan

python -m pytest tests/test* — 90 passed, 43 subtests passed in 13.56 s
python -m ruff check . — All checks passed
python -m ruff format --check . — 94 files already formatted
Before/after benchmark on asprox fixture (numbers above)
Verified report serialization assertions in testIntegration.testAsproxMarshalling / testCutwailMarshalling still hold

Residual perf risk

None expected. Set iteration order isn't insertion order, but no caller reads call_ref_sources in an order-dependent way. The only read of an element by index was the __str__ single-element case, which by definition has one element.

Out of scope for this branch (noted during the sweep, deferred)

IndirectCallAnalyzer.searchBlock is O(blocks × instructions) inside a 3-deep recursion — a one-shot addr → block index would help, but is more invasive.
SmdaFunction.getNormalizedBlockRefs is recomputed unconditionally; safe caching requires tracking architecture_metadata mutation.
BinaryInfo.getImportedFunctions calls PeSymbolProvider(None).parseSymbols(lief_result) and discards the result before re-instantiating for parseImports — looks like wasted work but warrants a separate audit of provider side effects.
Loaders re-lief.parse() the same buffer across getArchitecture / getCodeAreas calls. BinaryInfo already caches, but the static *FileLoader accessors do not.

…nces Three related hot-path tweaks, all behavior-preserving: 1. SmdaFunction / SmdaBasicBlock: replace `bytes([ord(c) for c in "".join(seqs)])` with `"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence helpers. The output is byte-identical (the escaper emits ASCII-only strings), but the per-character Python loop is gone. Microbench on a ~3.7 KB escaped sequence shows ~600x speedup for the conversion step alone; on the asprox fixture (105 funcs, 2140 blocks) block hash sequence assembly drops ~15%. 2. FunctionCandidate: hoist `sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of `hasCommonFunctionStart` / `getFunctionStartScore` to a module-level constant. Both methods are called from `calculateScore` / `getCharacteristics` / `__str__` / `toJson`, so on every candidate the prologue length list was being rebuilt and re-sorted from scratch. calculateScore over 200k iterations drops from 322ms to 180ms (~44%) in a focused bench. 3. FunctionCandidate.call_ref_sources: switch from list to set. The inner CFG-recovery loop does `addr not in call_ref_sources` on every call instruction; with a list this is O(n) per call and quadratic for hot targets (popular runtime stubs can accumulate many sources). With a set, add/discard/membership are O(1). The only order-sensitive read was the single-element branch in `__str__`, which now uses `next(iter(...))`. No external code depends on ordering — only `len` and truthiness (verified across src/ and tests/). Validation: - `make lint` (ruff check + format check) clean. - `pytest tests/test*` 90 passed, 43 subtests passed. - pic_hash / opc_hash / serialized report sha256 unchanged on asprox.

…lRefs Follow-up to the call_ref_sources list-to-set switch: collapse the Python-level "for addr in source_addrs: discard()" loop into a single set.difference_update() call. The semantics are identical (both no-op on missing elements) but the work is now done in optimized C, which matters because removeCallRefs is called from FunctionCandidateManager.updateCandidates whenever HIGH_ACCURACY is on (the default) and conflicts are detected during CFG recovery. Validation: - pytest tests/test* -> 90 passed, 43 subtests passed - ruff check + format --check clean

claude added 2 commits May 26, 2026 18:35

r0ny123 marked this pull request as ready for review May 26, 2026 13:23

danielplohmann merged commit b52d429 into danielplohmann:master May 28, 2026
23 checks passed

r0ny123 deleted the codex/upstream-pr-46-perf branch May 28, 2026 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(core): cut candidate scoring and hash sequence overhead#105

perf(core): cut candidate scoring and hash sequence overhead#105
danielplohmann merged 2 commits into
danielplohmann:masterfrom
r0ny123:codex/upstream-pr-46-perf

r0ny123 commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

r0ny123 commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. PIC/OPC hash byte conversion (SmdaFunction, SmdaBasicBlock)

2. Hoist sorted prologue lengths (FunctionCandidate)

3. call_ref_sources list → set (FunctionCandidate)

Measurements (focused bench, asprox fixture, Python 3.11.15)

Behavior compatibility

Test plan

Residual perf risk

Out of scope for this branch (noted during the sweep, deferred)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

r0ny123 commented May 26, 2026 •

edited

Loading

1. PIC/OPC hash byte conversion (`SmdaFunction`, `SmdaBasicBlock`)

2. Hoist sorted prologue lengths (`FunctionCandidate`)

3. `call_ref_sources` list → set (`FunctionCandidate`)