perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108
Draft
r0ny123 wants to merge 31 commits into
Draft
perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108r0ny123 wants to merge 31 commits into
r0ny123 wants to merge 31 commits into
Conversation
Updates the requirements on [setuptools](https://github.com/pypa/setuptools) to permit the latest version. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst) - [Commits](pypa/setuptools@v64.0.0...v82.0.1) --- updated-dependencies: - dependency-name: setuptools dependency-version: 82.0.1 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>
- Restore explanatory comment on 21h sign-extension semantics. - Use 0x0F instead of 15 for nibble masks and parenthesize shifts (11n, 12x, 22c, 22s, 22t). - Switch 45cc/4rcc operand formatting to f-strings for consistency. - Drop redundant operands="" reassignment in _decode_fmt_10x. - Remove dead local initialization in decode_instruction; read fields directly from the decoder result dict.
Each decoder previously initialized seven locals to defaults and returned
a full dict, regardless of which fields it actually populated. Switch the
contract so decoders return only the keys they set; decode_instruction
reads each field from the result with dict.get and a sensible default.
_decode_fmt_10x collapses to `return {}`.
Also drop the manual sign branch in _decode_fmt_21h — hex() already emits
the leading '-' for negative literals, so the output is unchanged.
Net effect: 340 fewer lines, identical behaviour (80/80 tests pass).
- Pass base_addr through _calculate_boundaries so the bogus-section check is consistent with _map_sections and avoids 32-bit overflow edge cases on systems where sys.maxsize is relative to the base. - Drop the redundant sys.maxsize warning in mapBinary; virtual_size is already validated against SmdaConfig.MAX_IMAGE_SIZE, and the prior warning text was misleading after the refactor. - Remove the outdated Python 2.x cStringIO comment in mapBinary.
Harden the CodeRabbit-reported runtime-safety and API-correctness issues across the disassembly facade, Intel/CIL backends, label providers, loaders, report DTOs, and shared utilities. Grouped fixes: - guard optional timeout callbacks and malformed parser/provider inputs - correct byte/string comparisons, alignment ordering, and provider call contracts - make report DTO construction, statistics addition, and mutable defaults safer - replace fragile/private APIs, direct prints, deprecated logger calls, and assert-only validation - convert malformed Rust demangler inputs to demangler-specific errors - preserve explicit backend selection during DEX autodetect - move regression coverage into subsystem test suites instead of a CodeRabbit-specific test file Validation: - python -m pytest tests/testDalvikDisassembler.py tests/testEscaper.py tests/testRustSymbolProvider.py tests/testFileFormatParsers.py tests/testIntelDisassembler.py tests/testIndirectCallAnalyzer.py -q - make test - make lint - python -m ruff check src/smda tests/testDalvikDisassembler.py - git diff --check
…nces
Three related hot-path tweaks, all behavior-preserving:
1. SmdaFunction / SmdaBasicBlock: replace
`bytes([ord(c) for c in "".join(seqs)])` with
`"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence
helpers. The output is byte-identical (the escaper emits ASCII-only
strings), but the per-character Python loop is gone. Microbench on a
~3.7 KB escaped sequence shows ~600x speedup for the conversion step
alone; on the asprox fixture (105 funcs, 2140 blocks) block hash
sequence assembly drops ~15%.
2. FunctionCandidate: hoist
`sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of
`hasCommonFunctionStart` / `getFunctionStartScore` to a module-level
constant. Both methods are called from `calculateScore` /
`getCharacteristics` / `__str__` / `toJson`, so on every candidate
the prologue length list was being rebuilt and re-sorted from
scratch. calculateScore over 200k iterations drops from 322ms to
180ms (~44%) in a focused bench.
3. FunctionCandidate.call_ref_sources: switch from list to set. The
inner CFG-recovery loop does `addr not in call_ref_sources` on every
call instruction; with a list this is O(n) per call and quadratic
for hot targets (popular runtime stubs can accumulate many sources).
With a set, add/discard/membership are O(1). The only order-sensitive
read was the single-element branch in `__str__`, which now uses
`next(iter(...))`. No external code depends on ordering — only `len`
and truthiness (verified across src/ and tests/).
Validation:
- `make lint` (ruff check + format check) clean.
- `pytest tests/test*` 90 passed, 43 subtests passed.
- pic_hash / opc_hash / serialized report sha256 unchanged on asprox.
…lRefs Follow-up to the call_ref_sources list-to-set switch: collapse the Python-level "for addr in source_addrs: discard()" loop into a single set.difference_update() call. The semantics are identical (both no-op on missing elements) but the work is now done in optimized C, which matters because removeCallRefs is called from FunctionCandidateManager.updateCandidates whenever HIGH_ACCURACY is on (the default) and conflicts are detected during CFG recovery. Validation: - pytest tests/test* -> 90 passed, 43 subtests passed - ruff check + format --check clean
resolveRegisterCalls() resolves each "call <register>" by walking the
CFG backward through up to block_depth (=3) levels of incoming refs.
At every level, searchBlock was doing a linear scan over every block in
the function and, for each block, a list comprehension over every
instruction:
for block in analysis_state.getBlocks():
if address in [i[0] for i in block]:
return block
So one call to searchBlock is O(B*I) — and the recursive descent into
processBlock calls it once per incoming ref at every depth. Functions
with many register calls (the file already mentions a Go sample with
130k of them) hit this hot.
This commit:
* Seeds an {instruction_addr: containing_block} dict once at the start
of resolveRegisterCalls(), so every searchBlock lookup is O(1).
* Preserves "first matching block wins" by using `if addr not in index`
during construction — important because FunctionAnalysisState.getBlocks
can place the same instruction in multiple overlapping blocks via
the sorted potential_starts walk.
* Clears the index in a finally so a reused analyzer instance never
serves a stale index after the function completes.
* Keeps a slim linear-scan fallback in searchBlock for direct callers
(e.g. existing unit tests that drive processBlock without going
through resolveRegisterCalls).
Microbench (80 blocks × 15 instructions, 1200 lookups):
legacy linear scan: 17.04 ms
indexed O(1) lookup: 0.18 ms
-> 92x faster, bit-identical block-object references returned.
End-to-end on asprox is unchanged (it has few register calls); the win
scales with the number of indirect calls in the binary.
Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
- asprox sha256 / num_instructions / function count unchanged
Address Gemini review on PR #47: stash the {instruction_addr: block} index on analysis_state instead of self. analysis_state has the right lifetime (one per function analysis) so the cache is naturally re-entrancy-safe and can't outlive what it indexes, the analyzer keeps no transient state, and the explicit seed + try/finally in resolveRegisterCalls goes away. searchBlock now lazy-builds on first call, so the legacy fallback branch is also gone — every caller (including direct unit-test callers) gets the O(1) path automatically. contextlib.suppress(AttributeError) guards the cache write so that test doubles or hypothetical __slots__-locked states still work; the freshly built dict is returned in that case. Re-ran the focused micro-bench (80 blocks x 15 instructions, 1200 lookups): ~107x faster than the legacy scan, 0/1200 parity mismatches. End-to-end asprox sha256/num_instructions/function count unchanged. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean
…rtedFunctions getImportedFunctions() instantiated PeSymbolProvider(None) and called parseSymbols(lief_result) only to throw the return value away, then spun up a second PeSymbolProvider(None) just to call parseImports. parseSymbols walks lief_binary.symbols and builds a function_symbols dict; with no `self` mutation (it returns the dict), that work is purely wasted whenever getImportedFunctions is called. This removes the dead call. parseSymbols still runs from getSymbols() and from PeSymbolProvider.update(), so symbol parsing is unchanged for the callers that actually use it. ELF path is untouched. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean
…nager _logCandidateStats() has no callers (grep src/ tests/) and silently contained a latent bug — the third list comprehension counted scores == 2 again instead of == 0, so the dead-code stats would have been wrong if anyone wired it in. Each list comprehension also called getScore() once per candidate and ran five separate passes over candidates.values(), so reviving it would be performance-hostile. Removed the method outright. The reraise_non_operational_exception import is still used elsewhere in the file, so the import stays. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean
PE/ELF/MachO file loaders previously each called lief.parse(binary) inside several of their static accessors. FileLoader._loadFile calls 6 of these accessors in sequence (mapBinary, getBaseAddress, getBitness, getCodeAreas, getArchitecture, getAbi), so a single PE binary was re-parsed up to 2 times, ELF up to 4 times, and MachO up to 5 times on every load. lief.parse is the single most expensive step for any non-trivial binary. Refactor: - Each LIEF-aware loader now exposes parseBinary(binary) -> lief obj | None as a single parse entry point. - Every accessor on PE / ELF / MachO accepts an optional parsed=None kwarg and reuses it when supplied (falling back to lief.parse(binary) when None, preserving the legacy direct-call contract). - ELF's existing elffile= and MachO's macho_file= kwargs are renamed to parsed= for uniform API. grep across src/ and tests/ confirms no external caller passed those by name; the only callers were in-file (mapBinary -> getBaseAddress) and were updated. - FileLoader._loadFile parses once via loader.parseBinary(self._raw_data) and threads the parsed object through every accessor via **kw. Delphi/Dex loaders don't have parseBinary, so kw stays empty for them. - PE accessors that never used lief (mapBinary/getBaseAddress/getBitness/ getAbi) accept parsed= for API uniformity and `del parsed` it. Cheap to pass, keeps the FileLoader call site uniform. Net effect: 1 lief.parse per binary load instead of 2/4/5. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - testFileFormatParsers (PE/ELF/MachO/CIL paths) all pass - testPeFileLoader passes - ruff check + format --check clean - No external callers reference elffile= or macho_file= by keyword
Addresses code-review finding on the lief-parse-sharing refactor:
when loader.parseBinary(...) returns None (corrupt or unsupported
binary), pre-fix we still constructed kw={"parsed": None} and
threaded it through every accessor, which then fell back to calling
lief.parse(binary) again — one wasted parse attempt on every accessor
in the failure path.
Now we only seed kw["parsed"] when the parse succeeded. Behaviour is
identical on the happy path; failure path no longer racks up
re-parses.
…ure path Address Gemini review on PR #50: when loader.parseBinary() returns None (corrupt/unsupported binary), the previous fix dropped the kwarg entirely so each accessor's parsed=None default sent it back into a fresh lief.parse(binary) attempt — N pointless re-parses per failed load. Switch each LIEF-aware loader to a module-private sentinel default: _NOT_PROVIDED = object() def getCodeAreas(binary, parsed=_NOT_PROVIDED): macho_file = lief.parse(binary) if parsed is _NOT_PROVIDED else parsed Now parsed=None is a distinct, valid signal meaning "I already tried and got None, do not retry"; the accessor sees None and returns its documented failure default. Legacy direct callers (no parsed=) still trigger an internal lief.parse exactly as before. FileLoader._loadFile is simplified to always pass kw = {"parsed": loader.parseBinary(self._raw_data)} when the loader is LIEF-aware. No conditional drop. Verified instrumentation: on bogus PE bytes (MZ + zeros), parseBinary returns None and the 6 accessors do **0** further lief.parse calls (previously: 6). Legacy direct calls (no parsed=) still parse — confirmed with a spy. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean
…omments The PE accessors that don't use lief had a 2-line comment each explaining why they accept and discard parsed=. The signature already matches the other loaders (uniform API) and `del parsed` already says "intentionally unused" — the comments restated both without adding information. Removed. The load-bearing comments stay: the _NOT_PROVIDED sentinel docstring (explains why a sentinel instead of None) and the FileLoader._loadFile note about always threading parsed= through (prevents a future maintainer from reintroducing the failure-path re-parse).
SmdaFunction.__init__ and SmdaFunction.fromDict each call getNormalizedBlockRefs(); the result then gets recomputed two more times if CALCULATE_SCC and CALCULATE_NESTING are on (both default to True) because _calculateSccs and _calculateNestingDepth each call getNormalizedBlockRefs again. The normalization is idempotent (re-normalizing already-normalized blockrefs returns the same dict), so the redundant passes are pure wasted work for every analyzed function. This memoizes the result keyed by (id(self.blocks), id(self.blockrefs), id(self.architecture_metadata)). Reassigning any of those input attributes invalidates the cache automatically because the id changes; SMDA only ever assigns those during __init__ / _parseBlocks / fromDict, so the cache is built once per function and reused. Callers that need to recompute can clear self._normalized_blockrefs_cache explicitly. Microbench on the asprox fixture (105 functions, 200 normalization passes each): cached: 7.0 ms total forced recompute: 154.3 ms total -> 22x speedup, same dict object returned on subsequent calls Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean - asprox sha256 / num_instructions unchanged
…tion Address Gemini review on PR #51: __init__ does self.blockrefs = self.getNormalizedBlockRefs() right after the first call. The id() of self.blockrefs then changes to the id() of the cached return value. The original cache key included id(self.blockrefs), so subsequent calls (from _calculateSccs and _calculateNestingDepth inside __init__) saw a cache miss and recomputed. Loosen the cache check: accept a hit when (id(self.blocks), id(self.architecture_metadata)) match AND self.blockrefs is either the original input dict OR the cached normalized dict itself. Any other reassignment still misses and recomputes — the contract for external mutation is unchanged. Verified with an instrumentation spy: across the 3-call sequence (__init__, _calculateSccs, _calculateNestingDepth), the recompute counter is now 1 (was 2 before the fix). Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean
…and Macho
Both ElfFileLoader and MachoFileLoader used a while-loop that spliced
merged_code_areas with list slicing on every adjacent-pair merge:
merged_code_areas = merged[:i] + [[a, b]] + merged[i+2:]
Each splice copies the list, so k merges cost O(n·k) ≤ O(n²). The loop
also built a `result` list that was appended to but never returned (dead
code). PeFileLoader already used the correct O(n) linear pattern: seed
result with the first area, walk forward, extend result[-1] in place on
adjacency, otherwise append.
Both loaders are replaced with that pattern. Behaviour is identical for
all inputs (empty, single, all-adjacent, no-adjacent, unsorted); verified
by trace and reviewer subagent.
Pattern: OC-2 mergeCodeAreas-O(n²)
Instances: ElfFileLoader.py, MachoFileLoader.py (2 files, same signature)
Validation: make lint, make test (111 passed)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Branch and base
Branch: claude/perf-sweep-20260527-merge-code-areas
Commit: 8433f00
Base: master (c1227d7)
Optimization class
OC-2 — mergeCodeAreas-O(n²)
Root cause
ElfFileLoader.mergeCodeAreasandMachoFileLoader.mergeCodeAreasused an identicalwhile-loop that merged adjacent intervals by list slicing on every merge step:
Each splice copies the entire list, making k merges cost O(n·k) ≤ O(n²). The loop also
maintained a
resultlist that was appended to but never returned — dead code.PeFileLoader.mergeCodeAreasalready used the correct O(n) pattern: seedresultwith thefirst area, walk forward linearly, extend
result[-1]in-place on adjacency, otherwiseappend.
Change
Both ELF and Macho implementations are replaced with the PE pattern. No logic changes —
same merge condition (
area_a[1] == area_b[0]), same output for all inputs. The deadresultvariable is removed.Before / after evidence
Targeted test suite (testFileFormatParsers + testIntegration, 22 tests):
The O(n²) pathology only manifests when k adjacent-section merges occur (k ≥ 2).
Typical binaries have 1–10 executable sections, so the practical improvement is in
allocation pressure (no temporary list copies) rather than raw iteration count.
Behavior compatibility
Reviewer subagent (fresh context) returned PASS. Confirmed:
getCodeAreasunchanged. ✓Validation commands
Residual risk
None. Pure algorithmic replacement of a static utility method with a semantically identical
but asymptotically better implementation. Not serialized; not part of the public report schema.