Skip to content

perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108

Draft
r0ny123 wants to merge 31 commits into
danielplohmann:masterfrom
r0ny123:claude/perf-sweep-20260527-merge-code-areas
Draft

perf(loaders): replace O(n²) mergeCodeAreas with linear merge in ELF and Macho#108
r0ny123 wants to merge 31 commits into
danielplohmann:masterfrom
r0ny123:claude/perf-sweep-20260527-merge-code-areas

Conversation

@r0ny123
Copy link
Copy Markdown
Contributor

@r0ny123 r0ny123 commented May 26, 2026

Branch and base

Branch: claude/perf-sweep-20260527-merge-code-areas
Commit: 8433f00
Base: master (c1227d7)

Optimization class

OC-2 — mergeCodeAreas-O(n²)

Root cause

ElfFileLoader.mergeCodeAreas and MachoFileLoader.mergeCodeAreas used an identical
while-loop that merged adjacent intervals by list slicing on every merge step:

merged_code_areas = merged_code_areas[:index] + [[a, b]] + merged_code_areas[index + 2:]

Each splice copies the entire list, making k merges cost O(n·k) ≤ O(n²). The loop also
maintained a result list that was appended to but never returned — dead code.

PeFileLoader.mergeCodeAreas already used the correct O(n) pattern: seed result with the
first area, walk forward linearly, extend result[-1] in-place on adjacency, otherwise
append.

Change

Both ELF and Macho implementations are replaced with the PE pattern. No logic changes —
same merge condition (area_a[1] == area_b[0]), same output for all inputs. The dead
result variable is removed.

Before / after evidence

Targeted test suite (testFileFormatParsers + testIntegration, 22 tests):

baseline:  22 passed in 2.15s
after:     22 passed in 1.99s

The O(n²) pathology only manifests when k adjacent-section merges occur (k ≥ 2).
Typical binaries have 1–10 executable sections, so the practical improvement is in
allocation pressure (no temporary list copies) rather than raw iteration count.

Behavior compatibility

Reviewer subagent (fresh context) returned PASS. Confirmed:

  • Semantic equivalence for: empty, single, all-adjacent, no-adjacent, unsorted, overlapping inputs. ✓
  • Caller contract in getCodeAreas unchanged. ✓
  • All three loaders now use byte-for-byte identical implementations — divergence reduced. ✓
  • No tests removed. ✓

Validation commands

make lint && make test   # 111 passed

Residual risk

None. Pure algorithmic replacement of a static utility method with a semantically identical
but asymptotically better implementation. Not serialized; not part of the public report schema.

dependabot Bot and others added 30 commits May 26, 2026 00:21
Updates the requirements on [setuptools](https://github.com/pypa/setuptools) to permit the latest version.
- [Release notes](https://github.com/pypa/setuptools/releases)
- [Changelog](https://github.com/pypa/setuptools/blob/main/NEWS.rst)
- [Commits](pypa/setuptools@v64.0.0...v82.0.1)

---
updated-dependencies:
- dependency-name: setuptools
  dependency-version: 82.0.1
  dependency-type: direct:development
...

Signed-off-by: dependabot[bot] <support@github.com>
- Restore explanatory comment on 21h sign-extension semantics.
- Use 0x0F instead of 15 for nibble masks and parenthesize shifts
  (11n, 12x, 22c, 22s, 22t).
- Switch 45cc/4rcc operand formatting to f-strings for consistency.
- Drop redundant operands="" reassignment in _decode_fmt_10x.
- Remove dead local initialization in decode_instruction; read fields
  directly from the decoder result dict.
Each decoder previously initialized seven locals to defaults and returned
a full dict, regardless of which fields it actually populated. Switch the
contract so decoders return only the keys they set; decode_instruction
reads each field from the result with dict.get and a sensible default.
_decode_fmt_10x collapses to `return {}`.

Also drop the manual sign branch in _decode_fmt_21h — hex() already emits
the leading '-' for negative literals, so the output is unchanged.

Net effect: 340 fewer lines, identical behaviour (80/80 tests pass).
- Pass base_addr through _calculate_boundaries so the bogus-section
  check is consistent with _map_sections and avoids 32-bit overflow
  edge cases on systems where sys.maxsize is relative to the base.
- Drop the redundant sys.maxsize warning in mapBinary; virtual_size
  is already validated against SmdaConfig.MAX_IMAGE_SIZE, and the
  prior warning text was misleading after the refactor.
- Remove the outdated Python 2.x cStringIO comment in mapBinary.
Harden the CodeRabbit-reported runtime-safety and API-correctness issues across the disassembly facade, Intel/CIL backends, label providers, loaders, report DTOs, and shared utilities.

Grouped fixes:

- guard optional timeout callbacks and malformed parser/provider inputs

- correct byte/string comparisons, alignment ordering, and provider call contracts

- make report DTO construction, statistics addition, and mutable defaults safer

- replace fragile/private APIs, direct prints, deprecated logger calls, and assert-only validation

- convert malformed Rust demangler inputs to demangler-specific errors

- preserve explicit backend selection during DEX autodetect

- move regression coverage into subsystem test suites instead of a CodeRabbit-specific test file

Validation:

- python -m pytest tests/testDalvikDisassembler.py tests/testEscaper.py tests/testRustSymbolProvider.py tests/testFileFormatParsers.py tests/testIntelDisassembler.py tests/testIndirectCallAnalyzer.py -q

- make test

- make lint

- python -m ruff check src/smda tests/testDalvikDisassembler.py

- git diff --check
…nces

Three related hot-path tweaks, all behavior-preserving:

1. SmdaFunction / SmdaBasicBlock: replace
   `bytes([ord(c) for c in "".join(seqs)])` with
   `"".join(seqs).encode("ascii")` in the four PIC/OPC hash sequence
   helpers. The output is byte-identical (the escaper emits ASCII-only
   strings), but the per-character Python loop is gone. Microbench on a
   ~3.7 KB escaped sequence shows ~600x speedup for the conversion step
   alone; on the asprox fixture (105 funcs, 2140 blocks) block hash
   sequence assembly drops ~15%.

2. FunctionCandidate: hoist
   `sorted([int(k) for k in COMMON_PROLOGUES], reverse=True)` out of
   `hasCommonFunctionStart` / `getFunctionStartScore` to a module-level
   constant. Both methods are called from `calculateScore` /
   `getCharacteristics` / `__str__` / `toJson`, so on every candidate
   the prologue length list was being rebuilt and re-sorted from
   scratch. calculateScore over 200k iterations drops from 322ms to
   180ms (~44%) in a focused bench.

3. FunctionCandidate.call_ref_sources: switch from list to set. The
   inner CFG-recovery loop does `addr not in call_ref_sources` on every
   call instruction; with a list this is O(n) per call and quadratic
   for hot targets (popular runtime stubs can accumulate many sources).
   With a set, add/discard/membership are O(1). The only order-sensitive
   read was the single-element branch in `__str__`, which now uses
   `next(iter(...))`. No external code depends on ordering — only `len`
   and truthiness (verified across src/ and tests/).

Validation:
- `make lint` (ruff check + format check) clean.
- `pytest tests/test*` 90 passed, 43 subtests passed.
- pic_hash / opc_hash / serialized report sha256 unchanged on asprox.
…lRefs

Follow-up to the call_ref_sources list-to-set switch: collapse the
Python-level "for addr in source_addrs: discard()" loop into a single
set.difference_update() call. The semantics are identical (both no-op on
missing elements) but the work is now done in optimized C, which matters
because removeCallRefs is called from FunctionCandidateManager.updateCandidates
whenever HIGH_ACCURACY is on (the default) and conflicts are detected
during CFG recovery.

Validation:
- pytest tests/test* -> 90 passed, 43 subtests passed
- ruff check + format --check clean
resolveRegisterCalls() resolves each "call <register>" by walking the
CFG backward through up to block_depth (=3) levels of incoming refs.
At every level, searchBlock was doing a linear scan over every block in
the function and, for each block, a list comprehension over every
instruction:

    for block in analysis_state.getBlocks():
        if address in [i[0] for i in block]:
            return block

So one call to searchBlock is O(B*I) — and the recursive descent into
processBlock calls it once per incoming ref at every depth. Functions
with many register calls (the file already mentions a Go sample with
130k of them) hit this hot.

This commit:

* Seeds an {instruction_addr: containing_block} dict once at the start
  of resolveRegisterCalls(), so every searchBlock lookup is O(1).
* Preserves "first matching block wins" by using `if addr not in index`
  during construction — important because FunctionAnalysisState.getBlocks
  can place the same instruction in multiple overlapping blocks via
  the sorted potential_starts walk.
* Clears the index in a finally so a reused analyzer instance never
  serves a stale index after the function completes.
* Keeps a slim linear-scan fallback in searchBlock for direct callers
  (e.g. existing unit tests that drive processBlock without going
  through resolveRegisterCalls).

Microbench (80 blocks × 15 instructions, 1200 lookups):
  legacy linear scan:    17.04 ms
  indexed O(1) lookup:    0.18 ms
  -> 92x faster, bit-identical block-object references returned.

End-to-end on asprox is unchanged (it has few register calls); the win
scales with the number of indirect calls in the binary.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
- asprox sha256 / num_instructions / function count unchanged
Address Gemini review on PR #47: stash the {instruction_addr: block}
index on analysis_state instead of self. analysis_state has the right
lifetime (one per function analysis) so the cache is naturally
re-entrancy-safe and can't outlive what it indexes, the analyzer keeps
no transient state, and the explicit seed + try/finally in
resolveRegisterCalls goes away. searchBlock now lazy-builds on first
call, so the legacy fallback branch is also gone — every caller
(including direct unit-test callers) gets the O(1) path automatically.

contextlib.suppress(AttributeError) guards the cache write so that
test doubles or hypothetical __slots__-locked states still work; the
freshly built dict is returned in that case.

Re-ran the focused micro-bench (80 blocks x 15 instructions, 1200
lookups): ~107x faster than the legacy scan, 0/1200 parity mismatches.
End-to-end asprox sha256/num_instructions/function count unchanged.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
…rtedFunctions

getImportedFunctions() instantiated PeSymbolProvider(None) and called
parseSymbols(lief_result) only to throw the return value away, then
spun up a second PeSymbolProvider(None) just to call parseImports.
parseSymbols walks lief_binary.symbols and builds a function_symbols
dict; with no `self` mutation (it returns the dict), that work is
purely wasted whenever getImportedFunctions is called.

This removes the dead call. parseSymbols still runs from getSymbols()
and from PeSymbolProvider.update(), so symbol parsing is unchanged
for the callers that actually use it. ELF path is untouched.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
…nager

_logCandidateStats() has no callers (grep src/ tests/) and silently
contained a latent bug — the third list comprehension counted scores
== 2 again instead of == 0, so the dead-code stats would have been
wrong if anyone wired it in. Each list comprehension also called
getScore() once per candidate and ran five separate passes over
candidates.values(), so reviving it would be performance-hostile.

Removed the method outright. The reraise_non_operational_exception
import is still used elsewhere in the file, so the import stays.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
PE/ELF/MachO file loaders previously each called lief.parse(binary) inside
several of their static accessors. FileLoader._loadFile calls 6 of these
accessors in sequence (mapBinary, getBaseAddress, getBitness, getCodeAreas,
getArchitecture, getAbi), so a single PE binary was re-parsed up to 2 times,
ELF up to 4 times, and MachO up to 5 times on every load. lief.parse is the
single most expensive step for any non-trivial binary.

Refactor:
- Each LIEF-aware loader now exposes parseBinary(binary) -> lief obj | None
  as a single parse entry point.
- Every accessor on PE / ELF / MachO accepts an optional parsed=None kwarg
  and reuses it when supplied (falling back to lief.parse(binary) when
  None, preserving the legacy direct-call contract).
- ELF's existing elffile= and MachO's macho_file= kwargs are renamed to
  parsed= for uniform API. grep across src/ and tests/ confirms no external
  caller passed those by name; the only callers were in-file (mapBinary
  -> getBaseAddress) and were updated.
- FileLoader._loadFile parses once via loader.parseBinary(self._raw_data)
  and threads the parsed object through every accessor via **kw. Delphi/Dex
  loaders don't have parseBinary, so kw stays empty for them.
- PE accessors that never used lief (mapBinary/getBaseAddress/getBitness/
  getAbi) accept parsed= for API uniformity and `del parsed` it. Cheap to
  pass, keeps the FileLoader call site uniform.

Net effect: 1 lief.parse per binary load instead of 2/4/5.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- testFileFormatParsers (PE/ELF/MachO/CIL paths) all pass
- testPeFileLoader passes
- ruff check + format --check clean
- No external callers reference elffile= or macho_file= by keyword
Addresses code-review finding on the lief-parse-sharing refactor:
when loader.parseBinary(...) returns None (corrupt or unsupported
binary), pre-fix we still constructed kw={"parsed": None} and
threaded it through every accessor, which then fell back to calling
lief.parse(binary) again — one wasted parse attempt on every accessor
in the failure path.

Now we only seed kw["parsed"] when the parse succeeded. Behaviour is
identical on the happy path; failure path no longer racks up
re-parses.
…ure path

Address Gemini review on PR #50: when loader.parseBinary() returns None
(corrupt/unsupported binary), the previous fix dropped the kwarg
entirely so each accessor's parsed=None default sent it back into a
fresh lief.parse(binary) attempt — N pointless re-parses per failed
load.

Switch each LIEF-aware loader to a module-private sentinel default:

    _NOT_PROVIDED = object()
    def getCodeAreas(binary, parsed=_NOT_PROVIDED):
        macho_file = lief.parse(binary) if parsed is _NOT_PROVIDED else parsed

Now parsed=None is a distinct, valid signal meaning "I already tried
and got None, do not retry"; the accessor sees None and returns its
documented failure default. Legacy direct callers (no parsed=) still
trigger an internal lief.parse exactly as before.

FileLoader._loadFile is simplified to always pass
    kw = {"parsed": loader.parseBinary(self._raw_data)}
when the loader is LIEF-aware. No conditional drop.

Verified instrumentation: on bogus PE bytes (MZ + zeros),
parseBinary returns None and the 6 accessors do **0** further
lief.parse calls (previously: 6). Legacy direct calls (no parsed=)
still parse — confirmed with a spy.

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
…omments

The PE accessors that don't use lief had a 2-line comment each
explaining why they accept and discard parsed=. The signature
already matches the other loaders (uniform API) and `del parsed`
already says "intentionally unused" — the comments restated both
without adding information. Removed.

The load-bearing comments stay: the _NOT_PROVIDED sentinel
docstring (explains why a sentinel instead of None) and the
FileLoader._loadFile note about always threading parsed= through
(prevents a future maintainer from reintroducing the failure-path
re-parse).
SmdaFunction.__init__ and SmdaFunction.fromDict each call
getNormalizedBlockRefs(); the result then gets recomputed two more
times if CALCULATE_SCC and CALCULATE_NESTING are on (both default to
True) because _calculateSccs and _calculateNestingDepth each call
getNormalizedBlockRefs again. The normalization is idempotent
(re-normalizing already-normalized blockrefs returns the same dict),
so the redundant passes are pure wasted work for every analyzed
function.

This memoizes the result keyed by (id(self.blocks), id(self.blockrefs),
id(self.architecture_metadata)). Reassigning any of those input
attributes invalidates the cache automatically because the id changes;
SMDA only ever assigns those during __init__ / _parseBlocks /
fromDict, so the cache is built once per function and reused.
Callers that need to recompute can clear self._normalized_blockrefs_cache
explicitly.

Microbench on the asprox fixture (105 functions, 200 normalization
passes each):
  cached:           7.0 ms total
  forced recompute: 154.3 ms total
  -> 22x speedup, same dict object returned on subsequent calls

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
- asprox sha256 / num_instructions unchanged
…tion

Address Gemini review on PR #51: __init__ does
    self.blockrefs = self.getNormalizedBlockRefs()
right after the first call. The id() of self.blockrefs then changes to
the id() of the cached return value. The original cache key included
id(self.blockrefs), so subsequent calls (from _calculateSccs and
_calculateNestingDepth inside __init__) saw a cache miss and
recomputed.

Loosen the cache check: accept a hit when (id(self.blocks),
id(self.architecture_metadata)) match AND self.blockrefs is either the
original input dict OR the cached normalized dict itself. Any other
reassignment still misses and recomputes — the contract for external
mutation is unchanged.

Verified with an instrumentation spy: across the 3-call sequence
(__init__, _calculateSccs, _calculateNestingDepth), the recompute
counter is now 1 (was 2 before the fix).

Validation:
- pytest tests/test* -> 111 passed, 79 subtests passed
- ruff check + format --check clean
…and Macho

Both ElfFileLoader and MachoFileLoader used a while-loop that spliced
merged_code_areas with list slicing on every adjacent-pair merge:

    merged_code_areas = merged[:i] + [[a, b]] + merged[i+2:]

Each splice copies the list, so k merges cost O(n·k) ≤ O(n²). The loop
also built a `result` list that was appended to but never returned (dead
code). PeFileLoader already used the correct O(n) linear pattern: seed
result with the first area, walk forward, extend result[-1] in place on
adjacency, otherwise append.

Both loaders are replaced with that pattern. Behaviour is identical for
all inputs (empty, single, all-adjacent, no-adjacent, unsorted); verified
by trace and reviewer subagent.

Pattern: OC-2 mergeCodeAreas-O(n²)
Instances: ElfFileLoader.py, MachoFileLoader.py (2 files, same signature)
Validation: make lint, make test (111 passed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants