perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho by r0ny123 · Pull Request #108 · danielplohmann/smda

r0ny123 · 2026-05-26T19:39:32Z

Branch and base

Branch: claude/perf-sweep-20260527-merge-code-areas
Commit: 8433f00
Base: master (c1227d7)

Optimization class

OC-2 — mergeCodeAreas-O(n²)

Root cause

ElfFileLoader.mergeCodeAreas and MachoFileLoader.mergeCodeAreas used an identical
while-loop that merged adjacent intervals by list slicing on every merge step:

merged_code_areas = merged_code_areas[:index] + [[a, b]] + merged_code_areas[index + 2:]

Each splice copies the entire list, making k merges cost O(n·k) ≤ O(n²). The loop also
maintained a result list that was appended to but never returned — dead code.

PeFileLoader.mergeCodeAreas already used the correct O(n) pattern: seed result with the
first area, walk forward linearly, extend result[-1] in-place on adjacency, otherwise
append.

Change

Both ELF and Macho implementations are replaced with the PE pattern. No logic changes —
same merge condition (area_a[1] == area_b[0]), same output for all inputs. The dead
result variable is removed.

Before / after evidence

Targeted test suite (testFileFormatParsers + testIntegration, 22 tests):

baseline:  22 passed in 2.15s
after:     22 passed in 1.99s

The O(n²) pathology only manifests when k adjacent-section merges occur (k ≥ 2).
Typical binaries have 1–10 executable sections, so the practical improvement is in
allocation pressure (no temporary list copies) rather than raw iteration count.

Behavior compatibility

Reviewer subagent (fresh context) returned PASS. Confirmed:

Semantic equivalence for: empty, single, all-adjacent, no-adjacent, unsorted, overlapping inputs. ✓
Caller contract in getCodeAreas unchanged. ✓
All three loaders now use byte-for-byte identical implementations — divergence reduced. ✓
No tests removed. ✓

Validation commands

make lint && make test   # 111 passed

Residual risk

None. Pure algorithmic replacement of a static utility method with a semantically identical
but asymptotically better implementation. Not serialized; not part of the public report schema.

Updates the requirements on [setuptools](https://github.com/pypa/setuptools) to permit the latest version. - [Release notes](https://github.com/pypa/setuptools/releases) - [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst) - [Commits](pypa/setuptools@v64.0.0...v82.0.1) --- updated-dependencies: - dependency-name: setuptools dependency-version: 82.0.1 dependency-type: direct:development ... Signed-off-by: dependabot[bot] <support@github.com>

- Restore explanatory comment on 21h sign-extension semantics. - Use 0x0F instead of 15 for nibble masks and parenthesize shifts (11n, 12x, 22c, 22s, 22t). - Switch 45cc/4rcc operand formatting to f-strings for consistency. - Drop redundant operands="" reassignment in _decode_fmt_10x. - Remove dead local initialization in decode_instruction; read fields directly from the decoder result dict.

Each decoder previously initialized seven locals to defaults and returned a full dict, regardless of which fields it actually populated. Switch the contract so decoders return only the keys they set; decode_instruction reads each field from the result with dict.get and a sensible default. _decode_fmt_10x collapses to `return {}`. Also drop the manual sign branch in _decode_fmt_21h — hex() already emits the leading '-' for negative literals, so the output is unchanged. Net effect: 340 fewer lines, identical behaviour (80/80 tests pass).

- Pass base_addr through _calculate_boundaries so the bogus-section check is consistent with _map_sections and avoids 32-bit overflow edge cases on systems where sys.maxsize is relative to the base. - Drop the redundant sys.maxsize warning in mapBinary; virtual_size is already validated against SmdaConfig.MAX_IMAGE_SIZE, and the prior warning text was misleading after the refactor. - Remove the outdated Python 2.x cStringIO comment in mapBinary.

…ivision opcodes

Harden the CodeRabbit-reported runtime-safety and API-correctness issues across the disassembly facade, Intel/CIL backends, label providers, loaders, report DTOs, and shared utilities. Grouped fixes: - guard optional timeout callbacks and malformed parser/provider inputs - correct byte/string comparisons, alignment ordering, and provider call contracts - make report DTO construction, statistics addition, and mutable defaults safer - replace fragile/private APIs, direct prints, deprecated logger calls, and assert-only validation - convert malformed Rust demangler inputs to demangler-specific errors - preserve explicit backend selection during DEX autodetect - move regression coverage into subsystem test suites instead of a CodeRabbit-specific test file Validation: - python -m pytest tests/testDalvikDisassembler.py tests/testEscaper.py tests/testRustSymbolProvider.py tests/testFileFormatParsers.py tests/testIntelDisassembler.py tests/testIndirectCallAnalyzer.py -q - make test - make lint - python -m ruff check src/smda tests/testDalvikDisassembler.py - git diff --check

…nces Three related hot-path tweaks, all behavior-preserving: 1. SmdaFunction / SmdaBasicBlock: replace `bytes([ord(c) for c in "".join(seqs)])` with `"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence helpers. The output is byte-identical (the escaper emits ASCII-only strings), but the per-character Python loop is gone. Microbench on a ~3.7 KB escaped sequence shows ~600x speedup for the conversion step alone; on the asprox fixture (105 funcs, 2140 blocks) block hash sequence assembly drops ~15%. 2. FunctionCandidate: hoist `sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of `hasCommonFunctionStart` / `getFunctionStartScore` to a module-level constant. Both methods are called from `calculateScore` / `getCharacteristics` / `__str__` / `toJson`, so on every candidate the prologue length list was being rebuilt and re-sorted from scratch. calculateScore over 200k iterations drops from 322ms to 180ms (~44%) in a focused bench. 3. FunctionCandidate.call_ref_sources: switch from list to set. The inner CFG-recovery loop does `addr not in call_ref_sources` on every call instruction; with a list this is O(n) per call and quadratic for hot targets (popular runtime stubs can accumulate many sources). With a set, add/discard/membership are O(1). The only order-sensitive read was the single-element branch in `__str__`, which now uses `next(iter(...))`. No external code depends on ordering — only `len` and truthiness (verified across src/ and tests/). Validation: - `make lint` (ruff check + format check) clean. - `pytest tests/test*` 90 passed, 43 subtests passed. - pic_hash / opc_hash / serialized report sha256 unchanged on asprox.

…lRefs Follow-up to the call_ref_sources list-to-set switch: collapse the Python-level "for addr in source_addrs: discard()" loop into a single set.difference_update() call. The semantics are identical (both no-op on missing elements) but the work is now done in optimized C, which matters because removeCallRefs is called from FunctionCandidateManager.updateCandidates whenever HIGH_ACCURACY is on (the default) and conflicts are detected during CFG recovery. Validation: - pytest tests/test* -> 90 passed, 43 subtests passed - ruff check + format --check clean

resolveRegisterCalls() resolves each "call <register>" by walking the CFG backward through up to block_depth (=3) levels of incoming refs. At every level, searchBlock was doing a linear scan over every block in the function and, for each block, a list comprehension over every instruction: for block in analysis_state.getBlocks(): if address in [i[0] for i in block]: return block So one call to searchBlock is O(B*I) — and the recursive descent into processBlock calls it once per incoming ref at every depth. Functions with many register calls (the file already mentions a Go sample with 130k of them) hit this hot. This commit: * Seeds an {instruction_addr: containing_block} dict once at the start of resolveRegisterCalls(), so every searchBlock lookup is O(1). * Preserves "first matching block wins" by using `if addr not in index` during construction — important because FunctionAnalysisState.getBlocks can place the same instruction in multiple overlapping blocks via the sorted potential_starts walk. * Clears the index in a finally so a reused analyzer instance never serves a stale index after the function completes. * Keeps a slim linear-scan fallback in searchBlock for direct callers (e.g. existing unit tests that drive processBlock without going through resolveRegisterCalls). Microbench (80 blocks × 15 instructions, 1200 lookups): legacy linear scan: 17.04 ms indexed O(1) lookup: 0.18 ms -> 92x faster, bit-identical block-object references returned. End-to-end on asprox is unchanged (it has few register calls); the win scales with the number of indirect calls in the binary. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean - asprox sha256 / num_instructions / function count unchanged

Address Gemini review on PR #47: stash the {instruction_addr: block} index on analysis_state instead of self. analysis_state has the right lifetime (one per function analysis) so the cache is naturally re-entrancy-safe and can't outlive what it indexes, the analyzer keeps no transient state, and the explicit seed + try/finally in resolveRegisterCalls goes away. searchBlock now lazy-builds on first call, so the legacy fallback branch is also gone — every caller (including direct unit-test callers) gets the O(1) path automatically. contextlib.suppress(AttributeError) guards the cache write so that test doubles or hypothetical __slots__-locked states still work; the freshly built dict is returned in that case. Re-ran the focused micro-bench (80 blocks x 15 instructions, 1200 lookups): ~107x faster than the legacy scan, 0/1200 parity mismatches. End-to-end asprox sha256/num_instructions/function count unchanged. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean

…rtedFunctions getImportedFunctions() instantiated PeSymbolProvider(None) and called parseSymbols(lief_result) only to throw the return value away, then spun up a second PeSymbolProvider(None) just to call parseImports. parseSymbols walks lief_binary.symbols and builds a function_symbols dict; with no `self` mutation (it returns the dict), that work is purely wasted whenever getImportedFunctions is called. This removes the dead call. parseSymbols still runs from getSymbols() and from PeSymbolProvider.update(), so symbol parsing is unchanged for the callers that actually use it. ELF path is untouched. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean

…nager _logCandidateStats() has no callers (grep src/ tests/) and silently contained a latent bug — the third list comprehension counted scores == 2 again instead of == 0, so the dead-code stats would have been wrong if anyone wired it in. Each list comprehension also called getScore() once per candidate and ran five separate passes over candidates.values(), so reviving it would be performance-hostile. Removed the method outright. The reraise_non_operational_exception import is still used elsewhere in the file, so the import stays. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean

PE/ELF/MachO file loaders previously each called lief.parse(binary) inside several of their static accessors. FileLoader._loadFile calls 6 of these accessors in sequence (mapBinary, getBaseAddress, getBitness, getCodeAreas, getArchitecture, getAbi), so a single PE binary was re-parsed up to 2 times, ELF up to 4 times, and MachO up to 5 times on every load. lief.parse is the single most expensive step for any non-trivial binary. Refactor: - Each LIEF-aware loader now exposes parseBinary(binary) -> lief obj | None as a single parse entry point. - Every accessor on PE / ELF / MachO accepts an optional parsed=None kwarg and reuses it when supplied (falling back to lief.parse(binary) when None, preserving the legacy direct-call contract). - ELF's existing elffile= and MachO's macho_file= kwargs are renamed to parsed= for uniform API. grep across src/ and tests/ confirms no external caller passed those by name; the only callers were in-file (mapBinary -> getBaseAddress) and were updated. - FileLoader._loadFile parses once via loader.parseBinary(self._raw_data) and threads the parsed object through every accessor via **kw. Delphi/Dex loaders don't have parseBinary, so kw stays empty for them. - PE accessors that never used lief (mapBinary/getBaseAddress/getBitness/ getAbi) accept parsed= for API uniformity and `del parsed` it. Cheap to pass, keeps the FileLoader call site uniform. Net effect: 1 lief.parse per binary load instead of 2/4/5. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - testFileFormatParsers (PE/ELF/MachO/CIL paths) all pass - testPeFileLoader passes - ruff check + format --check clean - No external callers reference elffile= or macho_file= by keyword

Addresses code-review finding on the lief-parse-sharing refactor: when loader.parseBinary(...) returns None (corrupt or unsupported binary), pre-fix we still constructed kw={"parsed": None} and threaded it through every accessor, which then fell back to calling lief.parse(binary) again — one wasted parse attempt on every accessor in the failure path. Now we only seed kw["parsed"] when the parse succeeded. Behaviour is identical on the happy path; failure path no longer racks up re-parses.

…ure path Address Gemini review on PR #50: when loader.parseBinary() returns None (corrupt/unsupported binary), the previous fix dropped the kwarg entirely so each accessor's parsed=None default sent it back into a fresh lief.parse(binary) attempt — N pointless re-parses per failed load. Switch each LIEF-aware loader to a module-private sentinel default: _NOT_PROVIDED = object() def getCodeAreas(binary, parsed=_NOT_PROVIDED): macho_file = lief.parse(binary) if parsed is _NOT_PROVIDED else parsed Now parsed=None is a distinct, valid signal meaning "I already tried and got None, do not retry"; the accessor sees None and returns its documented failure default. Legacy direct callers (no parsed=) still trigger an internal lief.parse exactly as before. FileLoader._loadFile is simplified to always pass kw = {"parsed": loader.parseBinary(self._raw_data)} when the loader is LIEF-aware. No conditional drop. Verified instrumentation: on bogus PE bytes (MZ + zeros), parseBinary returns None and the 6 accessors do **0** further lief.parse calls (previously: 6). Legacy direct calls (no parsed=) still parse — confirmed with a spy. Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean

…omments The PE accessors that don't use lief had a 2-line comment each explaining why they accept and discard parsed=. The signature already matches the other loaders (uniform API) and `del parsed` already says "intentionally unused" — the comments restated both without adding information. Removed. The load-bearing comments stay: the _NOT_PROVIDED sentinel docstring (explains why a sentinel instead of None) and the FileLoader._loadFile note about always threading parsed= through (prevents a future maintainer from reintroducing the failure-path re-parse).

SmdaFunction.__init__ and SmdaFunction.fromDict each call getNormalizedBlockRefs(); the result then gets recomputed two more times if CALCULATE_SCC and CALCULATE_NESTING are on (both default to True) because _calculateSccs and _calculateNestingDepth each call getNormalizedBlockRefs again. The normalization is idempotent (re-normalizing already-normalized blockrefs returns the same dict), so the redundant passes are pure wasted work for every analyzed function. This memoizes the result keyed by (id(self.blocks), id(self.blockrefs), id(self.architecture_metadata)). Reassigning any of those input attributes invalidates the cache automatically because the id changes; SMDA only ever assigns those during __init__ / _parseBlocks / fromDict, so the cache is built once per function and reused. Callers that need to recompute can clear self._normalized_blockrefs_cache explicitly. Microbench on the asprox fixture (105 functions, 200 normalization passes each): cached: 7.0 ms total forced recompute: 154.3 ms total -> 22x speedup, same dict object returned on subsequent calls Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean - asprox sha256 / num_instructions unchanged

…tion Address Gemini review on PR #51: __init__ does self.blockrefs = self.getNormalizedBlockRefs() right after the first call. The id() of self.blockrefs then changes to the id() of the cached return value. The original cache key included id(self.blockrefs), so subsequent calls (from _calculateSccs and _calculateNestingDepth inside __init__) saw a cache miss and recomputed. Loosen the cache check: accept a hit when (id(self.blocks), id(self.architecture_metadata)) match AND self.blockrefs is either the original input dict OR the cached normalized dict itself. Any other reassignment still misses and recomputes — the contract for external mutation is unchanged. Verified with an instrumentation spy: across the 3-call sequence (__init__, _calculateSccs, _calculateNestingDepth), the recompute counter is now 1 (was 2 before the fix). Validation: - pytest tests/test* -> 111 passed, 79 subtests passed - ruff check + format --check clean

…and Macho Both ElfFileLoader and MachoFileLoader used a while-loop that spliced merged_code_areas with list slicing on every adjacent-pair merge: merged_code_areas = merged[:i] + [[a, b]] + merged[i+2:] Each splice copies the list, so k merges cost O(n·k) ≤ O(n²). The loop also built a `result` list that was appended to but never returned (dead code). PeFileLoader already used the correct O(n) linear pattern: seed result with the first area, walk forward, extend result[-1] in place on adjacency, otherwise append. Both loaders are replaced with that pattern. Behaviour is identical for all inputs (empty, single, all-adjacent, no-adjacent, unsorted); verified by trace and reviewer subagent. Pattern: OC-2 mergeCodeAreas-O(n²) Instances: ElfFileLoader.py, MachoFileLoader.py (2 files, same signature) Validation: make lint, make test (111 passed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dependabot Bot and others added 30 commits May 26, 2026 00:21

build: update build tool constraints

4290515

build: address build tool review feedback

8d688f2

perf(intel): cache language regex match counts

1ce398c

perf(intel): scan call destinations once for bitness

709881e

perf: stream summed comprehensions

f2432a4

perf: address review performance suggestions

fc0c4dc

🧹 Refactor DalvikOpcodeDecoder instruction decode format handlers

5977aa8

🧹 Refactor mapBinary in ElfFileLoader to reduce complexity

022501d

fix(security): avoid swallowing non-operational exceptions

1e4be8b

fix(dalvik): recover exception-edges for throwable static field and d…

d3f5917

…ivision opcodes

fix(build): mark Makefile operational targets as .PHONY

73c923d

fix(dalvik): optimize _mark_can_throw and add docstring

6208ee7

fix(core): address PR review follow-ups

7c3deae

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108

perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108
r0ny123 wants to merge 31 commits into
danielplohmann:masterfrom
r0ny123:claude/perf-sweep-20260527-merge-code-areas

r0ny123 commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r0ny123 commented May 26, 2026

Branch and base

Optimization class

Root cause

Change

Before / after evidence

Behavior compatibility

Validation commands

Residual risk

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants