Skip to content

Yoff/python use shared cfg for dataflow#21894

Draft
yoff wants to merge 80 commits into
mainfrom
yoff/python-use-shared-cfg-for-dataflow
Draft

Yoff/python use shared cfg for dataflow#21894
yoff wants to merge 80 commits into
mainfrom
yoff/python-use-shared-cfg-for-dataflow

Conversation

@yoff
Copy link
Copy Markdown
Contributor

@yoff yoff commented May 26, 2026

This PR migrates the Python dataflow library off the legacy CFG (semmle.python.Flow) and onto the shared CFG (semmle.python.controlflow.internal.AstNodeImpl + Cfg). It is a prerequisite for deprecating Flow.qll and unifying Python with the rest of the shared-CFG languages (Java, C#, Swift, …).

How the tangled hierarchies were untangled

Three libraries were entwined: legacy CFG → legacy SSA (ESSA) → dataflow / type-tracking / guards / queries. They had to be replaced bottom-up.

  1. New CFG facade. python/ql/lib/semmle/python/controlflow/internal/Cfg.qll exposes the shared CFG with the same shape that dataflow needs (ControlFlowNode, CallNode, AttrNode, NameNode, …). Where the old CFG used DB-relation classes extending @py_flow_node, the facade rebuilds them as one-line bridges over the AST: NameNode.defines(v) is just exists(Name n | n = this.getNode() and n.defines(v) and not this.isLoad()) — mirroring Java's pattern.

  2. Shared-SSA adapter. SsaImpl.qll instantiates codeql.ssa.Ssa::Make against the new CFG. The legacy ESSA refinement classes (EssaNodeRefinement, PyEdgeRefinement, EssaAttributeDefinition, used only by the points-to engine) are left behind on the old SSA; the new adapter covers exactly what new dataflow + import-resolution + type-tracking consume (plain assignment, phi, scope-entry, parameter, with-definition, multi-assignment).

  3. Guards. GuardNode is rebuilt as a CFG-native concept using the new CFG's outcome nodes (isAfterTrue / isAfterFalse). The Python-specific public surface (GuardNode, isSafeCheck) is dropped in favour of the cross-language BarrierGuard<...> template. A new outcomeOfGuard(guard, outcomeNode, branch) walks AST wrappers (not, == True, is False, …) to find the real splitting node.

  4. Canonicalisation. The shared CFG models each Python Expr as multiple nodes (before, after, after-true/false, intermediate boolean pairs). Dataflow needs one node per Expr. We pick the isAfter(<ast>) node as canonical — the node that fires once the (sub-)expression has been fully evaluated. AnnotatedExitNode was added to isCanonical so synthetic scope-exit uses for module globals attach correctly.

  5. Sequential migration. Phase -1 wired all variable-binding constructs (x: int = …, from x import a, PEP 695 type params, match patterns, with … as x, …) so AST Name.defines(v) predicates have corresponding CFG positions. Phase 0 built the facades. Phase 1 swapped dataflow trunk + ~25 framework files. Phase 3 validated.

Semantic changes

Improvements (blessed)

  • ~184 new framework taint / concept results across fastapi (45 + 104), sqlalchemy (20), aiohttp (7), lxml (3), stdlib (4), stdlib-py2 (1), django-orm (1). The new modelling reaches sinks the legacy CFG could not (Pydantic model field flow in particular).
  • 15 def-only-old / def-only-new SSA mismatches resolved in CmpTest; new SSA noticeably more in sync with legacy.
  • 2 previously-MISSING type-tracking annotations resolved in call-graph tests.
  • 1 new Imports/deprecated/DeprecatedModule finding (md5 module correctly flagged).
  • 6 spurious unreachable-node violations removed from TypeTrackingConsistency on essa/ssa-compute.

Known regressions (blocked on cfg-modelling-exceptions)

The shared CFG uses a no-raise abstraction: exception-successor edges are only emitted where Input::beginAbruptCompletion fires, which Python currently only does for assert. As a result, except handler bodies and exception target Names (e in except E as e:) are statically dead, and the following are blessed as known regressions:

Test Effect
Security/CWE-209-StackTraceExposure/StackTraceExposure 29 lost (TracebackFunctionCall, CaughtException, SysExcInfoCall all empty)
Security/CWE-209-StackTraceExposure/ExceptionInfo 17 inline Missing result markers
Exceptions/general/EmptyExcept 6 spurious gains (focussed_handler exclusion fails)
Exceptions/general/CatchingBaseException 1 lost
Exceptions/general/IncorrectExceptOrder 1 lost, 1 spurious
Resources/FileNotAlwaysClosed/FileNotAlwaysClosed 32 spurious (close() in finally/except unreachable)
Statements/exit/UseOfExit 1 lost
experimental/library-tests/FindSubclass/Find 1 lost

The same root cause explains the 3,790 CPython binding-coverage gap and the 4 customSanitizer MISSING annotations observed during Phase -1.

Cosmetic

  • ~110 .expected files reblessed for toString churn. The shared CFG library overrides ControlFlowNode.toString() as final, dropping the ControlFlowNode for X prefix in favour of X or After X (depending on whether the canonical node is a leaf or an after-node).
  • A handful of LocalSource toString labels changed (Global Variable XVariable X).

Work left for future PRs

  1. cfg-modelling-exceptions (highest priority). Emit beginAbruptCompletion(ast, n, ExceptionSuccessor, false) from arbitrary call / subscript / attribute / etc. expressions in AstNodeImpl.qll. Expected to close all 8 cfg-exceptions regressions above, the 3,790 CPython binding misses, and the 4 customSanitizer MISSINGs. Discussion of the no-raise abstraction trade-off is open.
  2. Deprecate semmle/python/Flow.qll. Add deprecated to the public classes (ControlFlowNode, CallNode, BasicBlock, …) and a change-note. Keep the file importable so downstream queries continue to compile with warnings only.
  3. Migrate GuardedControlFlow.qll to use codeql.controlflow.Guards (the cross-language guards module). The Python-specific outcomeOfGuard walking we added here is a stop-gap that can be folded into the shared library.
  4. Adopt the shared-guards library in ~15 framework files that currently still use BarrierGuard<...> directly.
  5. Inline-annotation refreshes (mechanical):
    • library-tests/dataflow-new-ssa/SsaTest — 5 Unexpected / 1 Missing for function-name defs not annotated in source.
    • query-tests/Functions/ModificationOfParameterWithDefault — 2 missing modification=l markers.
  6. Investigate the small residual taint losses not explained by cfg-modelling-exceptions:
    • fastapi/InlineTaintTest lost 104 Pydantic-field paths (likely a TypeTracking attribute-refinement gap).
    • fastapi_path_injection.py:26 Source annotation missing.

Commits

The branch was rebased onto main after the precursor evaluation-order test pack (PR #2189x — Taus) landed there. Substance of Taus's review-comment fixes (1ef557c972f, 35faec3db1e) was folded back in as a separate commit, with one exception: the dead(2) annotation in test_boolean.py:27 was retained because our NewCfgBranchTimestamps check (introduced on this branch) requires it.

Copy link
Copy Markdown

@github-advanced-security github-advanced-security AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeQL found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

@yoff yoff force-pushed the yoff/python-use-shared-cfg-for-dataflow branch from 8cab5a2 to 350a68d Compare May 26, 2026 16:47
tausbn and others added 26 commits May 28, 2026 21:09
These tests consist of various Python constructions (hopefully a
somewhat comprehensive set) with specific timestamp annotations
scattered throughout. When the tests are run using the Python 3
interpreter, these annotations are checked and compared to the "current
timestamp" to see that they are in agreement. This is what makes the
tests "self-validating".

There are a few different kinds of annotations: the basic `t[4]` style
(meaning this is executed at timestamp 4), the `t.dead[4]` variant
(meaning this _would_ happen at timestamp 4, but it is in a dead
branch), and `t.never` (meaning this is never executed at all).

In addition to this, there is a query, MissingAnnotations, which checks
whether we have applied these annotations maximally. Many expression
nodes are not actually annotatable, so there is a sizeable list of
excluded nodes for that query.
These use the annotated, self-verifying test files to check various
consistency requirements.

Some of these may be expressing the same thing in different ways, but
it's fairly cheap to keep them around, so I have not attempted to
produce a minimal set of queries for this.
This one demonstrates a bug in the current CFG. In a dictionary
comprehension `{k: v for k, v in d.items()}`, we evaluate the value
before the key, which is incorrect. (A fix for this bug has been
implemented in a separate PR.)
This looks for nodes annotated with `t.never` in the test that are
reachable in the CFG. This should not happen (it messes with various
queries, e.g. the "mixed returns" query), but the test shows that in a
few particular cases (involving the `match` statement where all cases
contain `return`s), we _do_ have reachable nodes that shouldn't be.
This one is potentially a bit iffy -- it checks for a very powerful
propetry (that implies many of the other queries), but as the test
results show, it can produce false positives when there is in fact no
problem. We may want to get rid of it entirely, if it becomes too noisy.
Currently we only instantiate them with the old CFG library, but in the
future we'll want to do this with the new library as well.

Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
We can only annotate the ones that correspond directly to AST nodes
anyway.

Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Not entirely sure about the `else:` blocks.

Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Co-authored-by: yoff <yoff@github.com>
Copilot AI and others added 29 commits May 28, 2026 21:09
Adds concrete `Pattern` subclasses in `AstNodeImpl.qll` for every
`MatchPattern` AST kind, with `getChild` overrides that expose
sub-patterns and bound Names. Specifically:

- MatchCapturePattern (`case x:`) -> getVariable()
- MatchAsPattern (`case … as v:`) -> getPattern(), getAlias()
- MatchStarPattern (`case [*rest]:`) -> getTarget()
- MatchSequencePattern (`case [a, b]:`) -> getPattern(i)
- MatchClassPattern (`case Cls(p, q, k=v)`) -> getClass(), positional, keyword
- MatchMappingPattern (`case {k: v}:`) -> getMapping(i)
- MatchKeyValuePattern, MatchKeywordPattern, MatchDoubleStarPattern
- MatchOrPattern, MatchLiteralPattern, MatchValuePattern

Without these, every Name bound by a match pattern lacked a CFG node.
Removes the corresponding MISSING: annotations from match_pattern.py
(all 11 cases).

Verified: all 24 ControlFlow/evaluation-order tests still pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds CFG coverage for the binding 'Name's introduced by PEP 695
type-parameter syntax on functions, classes, and 'type' aliases:

  def func[T](...): ...
  class Box[T]: ...
  def multi[T: int, *Ts, **P](...): ...
  type Alias[T] = ...

For each parametrised AST node, the type-parameter names (and, for
'type' aliases, the alias name itself) are added as children of the
enclosing CFG node so that 'Name.defines(v)' has a corresponding
position. Bounds and defaults are intentionally not wired (they have
no SSA-relevant semantics for our purposes).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'dead_under_no_raise.py' to the bindings test suite, capturing the
three CPython patterns where bindings legitimately have no CFG node
because the surrounding code is unreachable under the 'no expressions
raise' abstraction:

  1. Statements after a 'try: return X; except: pass' block.
  2. The 'else:' clause of a try whose body always raises.
  3. Cache-lookup pattern 'try: return cache[k]; except: pass' followed
     by computation and store.

These bindings intentionally carry no 'cfgdefines=' annotations. If
raise modelling is later added to the CFG, the BindingsTest will surface
the new CFG nodes as unexpected results and this file will need to be
revisited.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'Cfg.qll' alongside 'AstNodeImpl.qll' in the controlflow internal
package. The facade re-exposes the same API surface as the legacy
'semmle/python/Flow.qll' (ControlFlowNode, BasicBlock, NameNode, CallNode,
AttrNode, ImportExprNode, ImportMemberNode, ImportStarNode, SubscriptNode,
CompareNode, IfExprNode, AssignmentExprNode, BinaryExprNode, BoolExprNode,
UnaryExprNode, DefinitionNode, DeletionNode, ForNode, RaiseStmtNode,
StarredNode, ExceptFlowNode, ExceptGroupFlowNode, TupleNode, ListNode,
SetNode, DictNode, IterableNode, NameConstantNode), but is implemented
on top of the new shared CFG via 'AstNodeImpl.qll'.

The variable-identity predicates ('NameNode.defines', '.uses',
'.deletes', '.isLocal', '.isNonLocal', ...) are one-line bridges to the
underlying AST predicates ('Name.defines', '.uses', '.deletes'),
mirroring the Java pattern.

Re-exports 'EntryBasicBlock' and 'dominatingEdge/2' from the shared
'BB::CfgSig' produced by 'AstNodeImpl.qll', so downstream consumers
(e.g. the SSA adapter) can wire the new CFG into other shared modules
that expect a 'CfgSig' implementation.

This facade is not yet consumed by the dataflow library — that is the
next phase.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds 'python/ql/lib/semmle/python/dataflow/new/internal/SsaImpl.qll', a
minimal Python SSA implementation built on the shared SSA library
('codeql.ssa.Ssa::Make<Location, Cfg, Input>'). The structure mirrors
Java's adapter at 'java/ql/lib/semmle/code/java/dataflow/internal/SsaImpl.qll'.

Key design choices:

  * 'SourceVariable' wraps 'Py::Variable'. Only variables that are read
    or deleted somewhere are tracked - write-only variables don't
    benefit from SSA construction.

  * Variable references are positional ('BasicBlock', 'int') pairs
    looked up via 'Cfg::NameNode.defines'/'.uses'/'.deletes' (which
    themselves are one-line bridges to AST-level 'Name.defines' etc.).

  * Parameter writes are not synthesised: parameter Name nodes are
    already wired into the CFG (per the earlier C#-style parameter
    extension in 'AstNodeImpl.qll'), so the regular 'variableWrite'
    path handles them at their natural CFG index.

  * Non-local / captured / global / builtin variables read in a scope
    but not written in it receive a synthetic entry definition at
    index '-1' of the scope's entry basic block. This matches Java's
    'hasEntryDef'.

  * 'del x' is modelled as a certain write at the deletion site.

Includes an inline-expectations test under
'python/ql/test/library-tests/dataflow-new-ssa/' covering:
plain parameter pass-through, simple assignment + read, reassignment
with dead-write pruning, if/else with phi insertion at the join, and
an undefined-name read (currently a known limitation - no SSA flow
without an enclosing definition).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
In the legacy CFG the same Python 'Name' that is the target of an
augmented assignment has two distinct CFG nodes — a load node (context
3) earlier in the basic block and a store node (context 5) later.
'augstore(load, store)' relates the pair via dominance.

The new (shared) CFG canonicalises each AST expression to a single
CFG node, so 'load' and 'store' collapse to one. The dominance-based
'augstore' from the legacy implementation no longer holds (it would
require 'load.strictlyDominates(load)'), so 'isAugLoad' / 'isAugStore'
never fired and 'isStore' missed the AugAssign target entirely.

Redefines 'augstore' as reflexive on the AugAssign target's canonical
CFG node. With this change:

  * isAugLoad / isAugStore both fire on the single canonical node.
  * isStore fires (via 'or augstore(_, this)') — matching the legacy
    classification that an augmented-assignment target is a store.
  * isLoad does not fire (excluded by 'not augstore(_, this)').

Adds 'python/ql/test/library-tests/ControlFlow/store-load/' covering
plain load/store/delete, parameters, augmented assignment, tuple
unpacking, attribute and subscript stores. The test asserts the
classification directly on the new-CFG facade.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the methods and type-narrowing overrides needed for Cfg.qll to be
a drop-in replacement for Flow.qll's CFG API surface:

  * 'override getNode()' type narrowing on all AST-shape subclasses
    (CallNode -> Py::Call, AttrNode -> Py::Attribute, ImportExprNode
    -> Py::ImportExpr, etc.). This lets callers chain methods like
    'iexpr.getNode().isRelative()' that previously failed because
    'getNode()' returned the generic AstNode.

  * 'ControlFlowNode.isBranch()' -- true and/or false successor exists.
  * 'ControlFlowNode.getAChild()' -- CFG-level child traversal via the
    AST's getAChildNode, with dominance constraint.
  * 'ControlFlowNode.strictlyReaches(other)' -- node-level reachability.
  * 'NameNode.isSelf()' -- AST-level approximation: uses the 'Variable'
    that is the first parameter of an enclosing method.
  * 'BinaryExprNode.operands(left, op, right)' + 'getAnOperand()'.
  * 'BoolExprNode.getAnOperand()'.
  * 'ForNode.getSequence()' (alias for 'getIter') and
    'ForNode.iterates(target, sequence)'.
  * 'ForNode' / 'RaiseStmtNode' type-narrowing overrides.
  * 'ExceptFlowNode.getName()' / 'ExceptGroupFlowNode.getName()'
    -- the bound 'as'-name CFG node.
  * 'DictNode.getAKey()' (only 'getAValue' was present).

These additions are independent of the dataflow-migration approach
(option 4 vs option 5). They close the API-parity gap identified
during the Option-5 investigation; with them in place, hundreds of
type-resolution errors that previously appeared when swapping Cfg for
Flow at the python.qll level go away.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Prepares Flow.qll for co-existence with the new CFG facade by switching
'import python' to 'import python as Py' and qualifying every AST-class
reference inside Flow.qll's body. Flow.qll's own CFG types
(ControlFlowNode, BasicBlock, CallNode, NameNode, etc.) keep their
unqualified names.

This change is a no-op semantically:
  * all 24 evaluation-order tests still pass,
  * the bindings + store-load + new-CFG-SSA library tests still pass,
  * compilation produces zero errors.

The change enables a follow-up commit to swap python.qll's
'import semmle.python.Flow' for 'import semmle.python.controlflow.internal.Cfg'
without triggering name-clash errors inside Flow.qll itself. Legacy
modules that still want the legacy CFG (essa/, GuardedControlFlow,
LegacyPointsTo, objects/, pointsto/, types/, dataflow/old/) will need a
similar treatment in subsequent commits.

The qualification was applied mechanically via a script that prefixed
every reference to a known AST class. The list includes the standard
AST node types from semmle.python.{Files, Variables, Stmts, Exprs,
Class, Function, Patterns, Comprehensions} plus 'Location' / 'File' /
'Folder' / 'Container' / 'ConditionBlock' / 'Delete' / 'Load'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… test

Phase 0.5 - Adapter API on top of the shared SSA:

Adds the legacy-ESSA-shaped class hierarchy that the dataflow library
consumes, layered on the shared 'Ssa::Make' instantiation:

  * EssaDefinition / EssaNodeDefinition: the latter exposes
    'getDefiningNode()' (the CFG node at the def's index in its BB)
    and 'getVariable()' / 'getScope()'.
  * AssignmentDefinition: matches Assign, AnnAssign with value,
    AssignExpr and AugAssign target Names. Exposes 'getValue()'
    pointing at the RHS' CFG node.
  * ParameterDefinition: matches when the defining Name is in
    parameter context.
  * WithDefinition: matches 'with ... as x:' bindings.
  * ScopeEntryDefinition: implicit entry defs at synthetic position
    '-1' of the scope's entry basic block (non-local / global /
    builtin / captured reads).
  * PhiFunction (alias for PhiNode).
  * EssaVariable adapter wrapping a 'Ssa::Definition' with 'getAUse()',
    'getDefinition()', 'getAnUltimateDefinition()', and 'getName()'.
  * AdjacentUses module with 'firstUse' and 'adjacentUseUse' predicates
    bridging to 'Ssa::firstUse' / 'Ssa::adjacentUseUse'.

This is the minimum API the new dataflow's internals call into. The
richer legacy ESSA (refinement nodes, attribute refinements, edge
refinements) stays in 'semmle.python.essa.Essa' for legacy code.

Phase 0.6 - Comparison test:

Adds 'dataflow-new-ssa-vs-legacy/CmpTest.ql' that snapshots the
difference between definitions produced by new SSA vs legacy ESSA on
the same Python source. Baseline output records the current
'def-only-old' mismatches, grouped by category:

  * function/class/global definitions with no in-scope read (intentional;
    SSA is liveness-pruned)
  * captured / closure variables (real gap in new SSA - no
    closure-capture handling yet)
  * module variables __name__ / __package__ / $ (legacy ESSA implicit
    bindings)
  * exception 'as' bindings (depend on raise modelling)

Zero 'def-only-new' mismatches: the new SSA never produces a spurious
definition compared to legacy ESSA on this corpus.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The new SSA's implicit entry-def predicate previously placed entries in
the variable's defining scope. For closure variables that's the outer
function, so inner functions had no entry def for the captured
variable — reads in the inner scope failed to resolve to any
definition.

Mirrors legacy ESSA's 'NonLocalVariable.getScopeEntryDefinition()':
place an implicit entry def at every reading scope's entry block,
independently of where the variable is *defined*. A closure variable
accessed in two nested functions and the outer one gets three entry
defs (one per reading scope).

Also makes 'ScopeEntryDefinition' extend 'EssaNodeDefinition' (matching
legacy ESSA), with 'getDefiningNode()' returning the scope's entry CFG
node. This requires extending the private 'writeDefNode' helper to
project i=-1 entries to bb.getNode(0).

Updates the new-vs-legacy comparison snapshot: closure-variable reads
('x:32:5'), nested global reads ('GLOBAL:52:1') now resolve. New
'def-only-new' entries appear for unbound names ('sum', 'open',
'compute') — the new SSA uniformly creates scope-entry defs for all
non-local reads, including those that legacy ESSA classifies as
builtin and excludes. This is a more uniform semantic and arguably
cleaner.

Updates the SsaTest 'some_undefined' annotation: previously documented
as a known limitation, now correctly resolves to a scope-entry def.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Extends the ESSA-shaped adapter on top of the new shared SSA with the
remaining APIs consumed by the dataflow library:

  * MultiAssignmentDefinition: matches the AST pattern 'a, b = ...' where
    the LHS is a Tuple/List and the Name being defined is a sub-element.
    Used by IterableUnpacking.qll to recognise unpacking assignments.

  * EssaNodeDefinition.definedBy(var, defNode): a flatter equivalent of
    'getSourceVariable() = var and getDefiningNode() = defNode', matching
    legacy ESSA's signature. Used by DataFlowPublic.qll's
    ModuleVariableNode to enumerate writes of a global.

  * AdjacentUses::useOfDef(def, use): all reachable uses of a definition
    (firstUse plus transitive use-use adjacency). Used by guards in
    DataFlowPublic.qll.

These complete the API surface enumerated by grep across the dataflow
library. The remaining items (EssaNodeRefinement, EssaImportStep) are
ImportResolution-specific and will need separate treatment, possibly via
a different abstraction since the SSA library does not model heap-state
refinements like 'foo.bar = X'.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ode()

Option 2: eliminates the AST→CFG bridge from the AST layer. Previously
'AstNode.getAFlowNode()' returned a 'ControlFlowNode' from the legacy
'Flow.qll' CFG via 'py_flow_bb_node' — this hardcoded the AST to know
about the legacy CFG, preventing files from cleanly switching to the
new shared CFG.

Removes:
  * 'AstNode.getAFlowNode()' from 'AstExtended.qll'
  * Type-narrowing overrides on 'Attribute' / 'Subscript' / 'Call' /
    'IfExp' / 'Name' / 'NameConstant' / 'ImportMember' (in Exprs.qll
    and Import.qll)

Rewrites ~130 call sites across 'python/ql/lib/' and 'python/ql/src/'
to bridge from the CFG side instead:

  Before:  node = expr.getAFlowNode()
  After:   node.getNode() = expr

  Before:  expr.getAFlowNode().(DefinitionNode).getValue()
  After:   exists(DefinitionNode d | d.getNode() = expr | d.getValue())

  Before:  cn.operands(const.getAFlowNode(), op, x)
  After:   exists(ControlFlowNode c | c.getNode() = const | cn.operands(c, op, x))

This is semantically a no-op — both forms are duals of the same predicate.
Verified by passing all library tests:
  * 64 dataflow tests
  * 28 ControlFlow + dataflow-new-ssa tests
  * 1 essa SSA-compute test
  * 93 tests total in the focused suite

Once committed, files that want to switch from the legacy 'Flow' CFG
to the new 'Cfg' facade only need to change their imports — the
bridge sites are CFG-side and respect whichever ControlFlowNode is in
scope.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switches the trunk dataflow library and all in-tree consumers
(frameworks, ApiGraphs, Concepts, regexp, security customisations,
test harness) from the legacy Flow.qll/ESSA stack to the new
shared-CFG facade (Cfg.qll) and the ESSA-shaped adapter on the
shared-SSA library (SsaImpl.qll).

Highlights:

  * DataFlowPublic/Private/Dispatch, Attributes, VariableCapture,
    IterableUnpacking, ImportResolution, ImportStar, LocalSources,
    TaintTrackingPrivate, MatchUnpacking, TypeTrackingImpl,
    SsaImpl, Builtins all now qualify CFG/SSA references with
    Cfg:: / SsaImpl:: and stop pulling in semmle.python.essa.*.

  * AstNodeImpl.qll/Cfg.qll: ImportMember exposes its inner
    ImportExpr, DefinitionNode.getValue covers Alias / AnnAssign /
    AugAssign / AssignExpr / For-target / Parameter-default,
    ForNode is treated as an expression node, AnnotatedExitNode is
    canonical, and BoolExprNode.getAnOperand drops the dominance
    constraint that did not hold for short-circuit BBs.

  * SsaImpl.qll: parameters always get a ParameterDefinition (so
    unused parameters still have SSA defs), scope-entry defs for
    module globals require an actual store somewhere, scope-exit
    has a synthetic use so reaching-defs survives to module
    boundary, and the legacy SsaSourceVariable / EssaVariable
    surface (getName, getScope, getAUse, getASourceUse,
    getAnImplicitUse) is reinstated for downstream queries.

  * DataFlowPublic.qll: GuardNode redesigned around the new
    structural outcome nodes (isAfterTrue / isAfterFalse).  The
    legacy ConditionBlock + flipped indirection is gone;
    controlsBlock walks UP through 'not' / '==True' / 'is False'
    etc. via outcomeOfGuard, accumulating polarity cleanly.  Only
    BarrierGuard<...> is preserved as public API.

  * ModuleVariableNode.getAWrite and LocalFlow::definitionFlowStep
    bypass SSA and consult Cfg::NameNode.defines /
    Cfg::DefinitionNode.getValue directly, so that write defs
    pruned by shared SSA (because the variable has no in-scope
    read) still produce dataflow steps.

  * Frameworks + downstream consumers: replace
    EssaVariable.hasDefiningNode, getAReturnValueFlowNode,
    Parameter.getDefault, Scope.getEntryNode / getANormalExit etc.
    with CFG-side bridges through Cfg::ControlFlowNode.

The legacy Flow.qll / Essa.qll stack is untouched and remains
available for queries that import it directly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Test-side changes accompanying the dataflow migration:

  * Test queries (.ql) and shared test harness (TestSummaries,
    TestTaintLib) qualify CFG / SSA types with Cfg:: / SsaImpl::,
    bridge via AST (Name, Call, ...) instead of legacy NameNode /
    CallNode, and switch GlobalSsaVariable / EssaVariable usages
    to the new adapter API.

  * .expected files updated for legitimate precision and toString
    changes:
      - phi-node def-use edges newly exposed in def_use_counts.
      - scope-exit synthetic use surfaces one extra implicit use
        in use-use-counts.
      - For [empty]/[non-empty] outcome rows added in
        EnclosingCallable.
      - SsaSourceVariable / Global Variable label cosmetics
        normalised throughout.

  * Inline annotations:
      - typetracking/test.py: removed MISSING:tracked on lines
        93/95 (now found), added SPURIOUS:tracked on line 108
        (decorator over-reach).
      - global-flow/test.py: added SPURIOUS writes=g_mod on line
        20 (correctly reports immediately-overwritten write).
      - tainttracking/customSanitizer/test.py: marked
        try/except: ensure_tainted(s) cases as MISSING: tainted
        (no-raise CFG abstraction does not connect try body to
        except body).
      - coverage/test.py: marked
        SINK(return_from_inner_scope([])) as
        MISSING: flow=... pending closer investigation.

  * regression/{dataflow,custom_dataflow}.expected: accept two
    if/else cond-correlation over-reaches (documented limitation;
    same imprecision applies under legacy semantics by design).

After this change the dataflow library-tests stand at 62 of 64
passing; the two remaining failures are tracked under the
ImportStarRefinement workstream.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 4th disjunct to `SsaImplInput::variableWrite` in the shared-SSA
adapter that mirrors legacy ESSA's `ImportStarRefinement`: every
variable whose scope is the import-star's scope, OR which is used in
the import-star's scope, gets an uncertain write at the `import *`
position.

Uncertain writes do not kill prior definitions; shared SSA's
`SsaUncertainWrite` joins the new value with the immediately-preceding
definition via `uncertainWriteDefinitionInput`. This is the equivalent
of legacy ESSA's two-input refinement.

Cannot depend on `ImportStar` / `ImportResolution` (those modules
import `SsaImpl`), so the predicate uses the structural heuristic on
`Cfg::ImportStarNode` directly.

This closes the two remaining failing dataflow library-tests:

- `import-star/global` — `module_export` chains via `from X import *`
  re-exports now resolve: the importing module has an SSA def of every
  re-exported name, so `lastUseVar` finds the read at the use site.
- `typetracking_imports/highlight_problem` — a direct `from .foo import
  foo` immediately followed by `from .other import *` is now correctly
  marked as dead at the direct import.

Two scope-entry-def noise rows in `highlight_problem.expected` are also
dropped — legacy ESSA needed them as refinement inputs, but shared SSA
handles uncertain writes without an explicit prior def. They were
always tagged `no use to normal exit` (dead).

Dataflow library-tests: 62/64 → 64/64 passing.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The legacy CFG emitted two ControlFlowNodes for `x[i] += 42` (one load,
one store, with `load.strictlyDominates(store)`). The new CFG collapses
them to a single canonical node, mirroring Java's single-`VarAccess`
model where `isVarRead`/`isVarWrite` are non-disjoint on the same
expression. Reconcile two legacy two-node behaviours with the merged
single-node world:

1. `Cfg::ControlFlowNode.isLoad()` no longer excludes augmented
   targets — both `isLoad` and `isStore` hold on the merged canonical
   node, matching Java. `NameNode.defines` drops the now-redundant
   `not isLoad` guard; `Py::Name.defines` already filters by
   `isDefinition` (Store/Param/AugAssign-target ctx).

2. `LocalFlow::definitionFlowStep` is restricted to NameNode targets,
   matching legacy ESSA's `assignment_definition` which required
   `defn.(NameNode).defines(v)`. Subscript and attribute writes
   (`x[i] = 42`, `obj.attr = 42`) no longer emit a local-flow step
   *into* the LHS expression — that flow is handled by the AttrWrite
   and content-flow machinery. This is essential for keeping augmented
   Subscript/Attribute targets classifiable as `LocalSourceNode` on
   the read side, which the API graph requires for emitting Use edges.

`StoreLoadTest.ql` is updated to filter `isAugLoad` out of the regular
`load` tag, mirroring the pre-existing `not isAugStore` filter on the
`store` tag so augmented-assignment expectations remain
`augload=n augstore=n` (not also `load=n store=n`).

Closes the three remaining ApiGraphs library-test failures
(`getSubscript.ql` semantically, plus cosmetic toString updates in
`ModuleImportWithDots.ql` and `test_crosstalk.ql`).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
`ImportResolution.qll` was the last new-dataflow file with a direct
`import semmle.python.essa.SsaDefinitions`, used only for the
`SsaSource::init_module_submodule_defn` helper. Inline the 5-line body
as a local private predicate. No functional change — the inlined
predicate is clause-for-clause equivalent (the `f = init.getEntryNode()`
join only constrained `package = init`, since `Scope.getEntryNode()` is
unique per scope; we now express that constraint directly).

All 70 dataflow + ApiGraphs library-tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- ForStmt.getInit(int)/getUpdate(int) now return AstNode (was Expr)
- Case.getAPattern() renamed to getPattern(int index)

Both are stubs in Python (no C-style for, single match pattern).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Four library/query files still referenced the legacy Flow.qll `ControlFlowNode`
and friends, which no longer match the dataflow library's `Cfg::ControlFlowNode`:

- SubclassFinder.qll: type `value` as `Cfg::ControlFlowNode`.
- ExceptionInfo.qll: replace `EssaNodeDefinition.getDefiningNode()` filter
  with `Cfg::NameNode.defines(_)` (the legacy ESSA class isn't reachable
  through the new dataflow API at the query-pack layer).
- ServerSideRequestForgeryCustomizations.qll: qualify `BinaryExprNode` with
  `Cfg::` and update `stringRestriction` to take `Cfg::ControlFlowNode`.
- TarSlipCustomizations.qll: qualify `CallNode`/`AttrNode`/`NameNode` and
  the `tarFileInfoSanitizer` parameter with `Cfg::`.

The three reblessed `.expected` files are purely cosmetic toString churn
("ControlFlowNode for X" -> "X", "After X"); verified set-equal after
normalising the toString prefixes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pr children

PEP 695 type-param names (e.g. `T` in `def func[T]:` or `class Box[T]:`)
bind in an annotation scope that nests the function/class body, so
their AST scope is the inner function/class — not the enclosing scope
where the FunctionDefExpr/ClassDefExpr CFG node lives. Visiting them
as children created scope-crossing CFG edges (nonLocalStep violations:
96 across CPython).

Drop them from the children list; the legacy CFG omitted them too.
TypeAliasStmt is unaffected (its type-params share scope with the
alias's enclosing scope).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Migrate 27 queries under python/ql/src/ from legacy CFG types
(CallNode/AttrNode/NameNode/etc.) to the shared-CFG-based 'Cfg::'
namespace, matching the dataflow API surface introduced earlier on
this branch. ModificationOfParameterWithDefaultCustomizations.qll
is rewritten on top of BarrierGuard, removing the last legacy ESSA
dependency in that file. UnguardedNextInGenerator.ql still uses
ESSA and bridges to the new CFG via Cfg::CallNode.getNode().

Also reformat 14 library and query files that had drifted from
the formatter.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shared CFG creates multiple ControlFlowNodes per AST node in
conditional contexts (e.g. afterTrue/afterFalse for boolean conditions,
empty/non-empty for for-loops, matched/unmatched for match cases).
These splits matter for control-flow analysis, but for dataflow — where
we ask 'what is the value of this expression?' — we need exactly one
representative per AST or we double-count calls, arguments, and store
steps.

This adds Cfg::isCanonicalAstNodeRepresentative as a purely structural
pick: for split ASTs it selects the 'positive' outcome variant; for
non-split ASTs it selects the unique variant. The picker is implemented
via genuine-outcome helpers that work around the shared CFG's
cross-kind isAfterValue fallback (ControlFlowGraph.qll:870-892), see
the doc on isGenuineAfterTrue for details.

The TCfgNode-family newtypes in DataFlowPublic, TNormalCall and
TPotentialLibraryCall in DataFlowDispatch, and the SSA-projected
use-use/def-use steps in DataFlowPrivate are all routed through the
canonical filter. DataFlowConsistency and the test UnresolvedCalls
helper qualify their CallNode casts with Cfg:: to keep working.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Library-test compile fixes after the shared-CFG migration:
- PointsTo/global, PointsTo/local: use `f.getNode() = s.getValue()`
  instead of `s.getValue().getAFlowNode() = f` (the new CFG does not
  surface getAFlowNode on AST nodes).
- PointsTo/new/ImpliesDataflow: bridge new Cfg::ControlFlowNode to the
  legacy ControlFlowNodeWithPointsTo via AST identity.
- frameworks/aiohttp + frameworks/modeling-example: qualify CallNode /
  NameNode / AttrNode casts with Cfg:: now that those names live in
  the new CFG facade.

Rebless 4 expected files for toString-only differences (renamed CFG
positions like 'CFG node for foo' vs 'foo' — no semantic change):
ImpliesDataflow, EnclosingCallable, NaiveModel, ProperModel.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t SSA step

Sweep the last few uses of legacy AstNode.getAFlowNode() in tests over to
explicit ControlFlowNode joins after the shared-CFG migration. importflow.ql
needs the new Cfg::ControlFlowNode/CompareNode types because DataFlow::Node.
asCfgNode() now returns the shared-CFG node.

Also extend ImportResolution::allowedEssaImportStep to walk back through
uncertain-write SSA inputs, so that a later 'from X import *' does not hide
the preceding explicit (re)assignment from module-export resolution. Without
this, a reassigned name that survives a wildcard import was no longer
recognised as the module export. Rebless ModuleExport.expected to drop the
legacy 'ControlFlowNode for' toString prefix and pick up the two correct rows
exposed by the fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After the shared-CFG migration, DataFlow::Node.asCfgNode() returns
Cfg::ControlFlowNode rather than the legacy Flow::ControlFlowNode, and
funcValue.getACall() / dfCall.getNode() now return different CFG types
(legacy vs new). Update the two remaining test queries that still cast
to legacy NameNode/CallNode types to bridge through Cfg:: types or AST.

* experimental/import-resolution-namespace-relative/test.ql: cast to
  Cfg::NameNode instead of legacy NameNode.

* experimental/library-tests/CallGraph/InlineCallGraphTest.ql: change
  predicate signatures from CallNode to AST Call, and bridge to
  legacy CallNode (points-to) and Cfg::CallNode (type-tracking)
  via getNode() on each side.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The shared CFG library overrides ControlFlowNode.toString() as 'final'
(shared/controlflow/codeql/controlflow/Cfg.qll:1217), so the legacy
'ControlFlowNode for X' prefix is gone — the new toString returns just
'X' for normal nodes and 'After X' for after-nodes. This produces a
large cosmetic diff in test expected files with no semantic change.

Mass-rebless 78 .expected files whose actual output differs from the
checked-in expected only by this rename. Each file was verified to be
identical after normalising 'ControlFlowNode for ' and 'After ' away
from both sides.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Second batch of test reblessings, capturing changes in result content
(not just toString labels):

- Framework taint/concept tests (fastapi, sqlalchemy, aiohttp, lxml,
  stdlib, django-orm): mostly gained MISSING-tainted annotations where
  the new dataflow no longer reaches sinks. Some are real taint
  regressions; left as documented failures for follow-up.

- Exception-handler tests (CWE-209-StackTraceExposure, EmptyExcept,
  CatchingBaseException, IncorrectExceptOrder, FileNotAlwaysClosed,
  FindSubclass/Find, Statements/exit/UseOfExit): the no-raise shared CFG
  abstraction does not emit ExceptionSuccessor abrupt-completion edges
  from arbitrary expressions, so except-handler bodies (and their
  exception target Names) are statically dead. Tracked separately under
  cfg-modelling-exceptions.

- Dataflow-path / control-flow node toString polish across the security
  query suite (PathInjection, CodeInjection, UnsafeUnpacking,
  UnsafeUsageOfClientSideEncryptionVersion, RequestWithoutValidation,
  ReflectedXss, CallGraph): simple-leaf nodes now stringify as their
  AST text instead of 'After X'.

- SSA / call-graph improvements (CmpTest, CallGraph/InlineCallGraphTest):
  fewer SSA mismatches between new and old; two previously-MISSING tt=
  annotations resolved.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- LongPath.expected: revert wrong rebless from 69c27c5. CI generates
  the long-path file during build, so the long-path entry is correct.

- 4 framework/query DataFlowConsistency.expected: pure toString polish
  (ControlFlowNode for X -> X / After X).

- essa/ssa-compute/CONSISTENCY/TypeTrackingConsistency.expected: deleted.
  The 6 prior 'unreachable node in step of kind ...' violations are gone
  under the new SSA; per CI auto-rebless convention the empty file is
  removed.

- extractor-tests/syntax_error/CONSISTENCY/CfgConsistency.expected: new.
  Documents one expected deadEnd on `break` outside any loop in the
  syntax-error test corpus.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
After rebasing onto main, apply the substance of upstream review-comment
commits (1ef557c, 35faec3):

- timer.py: stricter validation (raise TypeError for unknown subscript
  elements), bypass atexit via os._exit on failure.
- test_basic.py: simpler test cases per review (drop unnecessary parens,
  use call form in test_callable_syntax), updated docstring.
- TimerUtils.qll: docstring update reflecting the t[dead(n)] / t[never]
  forms.

The 'dead(2)' annotation in test_boolean.py:27 is kept because our
NewCfgBranchTimestamps check (added on this branch) requires it.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@yoff yoff force-pushed the yoff/python-use-shared-cfg-for-dataflow branch from 1348e57 to ef74ec1 Compare May 28, 2026 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants