Skip to content

Call-edge provenance mislabels CodeQL-resolved edges as "jedi" #28

@rahlk

Description

@rahlk

Summary

jedi_call_graph_edges() hardcodes provenance=["jedi"] on every edge it emits, regardless of which resolver actually filled the underlying PyCallsite.callee_signature. As a result, edges that were only resolvable because of CodeQL (or the constructor heuristic) are attributed to Jedi. CodeQL's true contribution to the call graph is systematically under-reported.

Observed at commit 7392fed.

Pipeline (as built in core.py:379-398)

  1. _build_symbol_table() (core.py:379) — SymbolTableBuilder runs Jedi per file (syntactic_analysis/symbol_table_builder.py:9-11,119) and writes callee_signature into each PyCallsite during symbol-table construction.
  2. _get_call_graph(symbol_table, augment_sites=True) (core.py:395) — CodeQL runs, filling callee_signature in-place for sites Jedi left empty, and separately emits an explicit edge list with provenance=["codeql"] (semantic_analysis/codeql/codeql_analysis.py:321).
  3. resolve_unresolved_constructors() (core.py:396) — heuristic pass fills more callee_signatures in-place.
  4. jedi_call_graph_edges(symbol_table) (core.py:397) — reader only: emits an edge for every site with a non-empty callee_signature (semantic_analysis/call_graph.py:181-183) and unconditionally tags it provenance=["jedi"] (semantic_analysis/call_graph.py:186).
  5. merge_edges(jedi_edges, codeql_edges) (core.py:398) — unions provenance for shared (source, target) (semantic_analysis/call_graph.py:263).

Root cause

Step 3 derives edges from a symbol table whose callee_signatures were filled by three different mechanisms (Jedi in step 0, CodeQL in step 1, constructor heuristic in step 2), but stamps all of them ["jedi"]. An edge only surfaces codeql in its provenance if CodeQL also emitted it as a standalone object in step 1's codeql_edges list, which step 4 then unions in. CodeQL contributions that manifest only as in-place callee_signature fills are mislabeled ["jedi"].

This makes the provenance field answer "which backend emitted this exact edge object" rather than the intended "which backend's resolution made this edge possible."

Reproduction

Running call-graph analysis with using_codeql=True over the codellm-devkit/python-sdk cldk/ package (618 edges):

provenance edges
["jedi"] 532
["codeql"] 80
["codeql","jedi"] 6

The 86 codeql-tagged edges are a lower bound on CodeQL's real contribution; an unknown share of the 532 ["jedi"] edges are only resolvable because CodeQL (step 1) or the constructor heuristic (step 2) filled their callee_signature.

Suggested fix

Track provenance at the point callee_signature is set, not at edge-emission time. Options:

  • Record the resolving backend on PyCallsite when each pass fills callee_signature (Jedi in symbol-table build, CodeQL in augment_sites, heuristic in resolve_unresolved_constructors), and have jedi_call_graph_edges read provenance from the site instead of hardcoding ["jedi"].
  • At minimum, rename jedi_call_graph_edges to reflect that it derives edges from the combined symbol table, so the hardcoded ["jedi"] tag isn't mistaken for an actual Jedi attribution.

Impact

Any downstream consumer using provenance to measure or compare resolver coverage (e.g. the Java/Python parity work in python-sdk) gets misleading numbers — CodeQL looks far less effective than it is.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions