Skip to content

feat: Implement two-sided verification check with check modes#487

Open
MikaelMayer wants to merge 115 commits intomainfrom
feat/two-sided-verification-check
Open

feat: Implement two-sided verification check with check modes#487
MikaelMayer wants to merge 115 commits intomainfrom
feat/two-sided-verification-check

Conversation

@MikaelMayer
Copy link
Contributor

@MikaelMayer MikaelMayer commented Feb 26, 2026

Summary

Replaces the single-sided reachCheck flag with a two-sided verification framework using orthogonal check mode and check amount flags. Each proof obligation now produces a VCOutcome with independent satisfiability and validity properties, enabling richer diagnostic feedback.

Problem

We want to perform richer checks on assert statements beyond simple validity. Covers are existential checks where forking into two means the results are linked by an OR, so they are not suitable for detecting assertions that surely fail along a path. To find such failures, checks must be encoded as assertions, and we need extended diagnostics for them.

A previous PR opened the way by adding a reachability check, demonstrating that two checks per command are feasible. However, the reachability check missed an important case for bug-finding mode: from reachability + validity alone, we cannot derive the result of reachability + satisfiability. By testing both P ∧ Q (satisfiability) and P ∧ ¬Q (validity) where P is the path condition and Q is the property, we get two checks that together determine the validity and satisfiability of Q given P, and also derive reachability.

Solution

Two orthogonal flags replace reachCheck:

  • Check Mode (--check-mode): deductive (default) or bugFinding
  • Check Amount (--check-amount): minimal (default) or full

A per-statement @[fullCheck] annotation can override the global check amount.

Possible outcomes by mode

Default mode (deductive, minimal): validity check only for asserts, satisfiability check only for covers.

For assert statements (validity only, satisfiability masked to unknown):

Emoji Label Meaning
✔️ always true if reachable Validity passed, property always true if reachable
can be false and is reachable Validity failed, solver found a reachable counterexample (with model)
unknown Solver could not determine validity

For cover statements (satisfiability only, validity masked to unknown):

Emoji Label Meaning
can be true and is reachable Satisfiability passed, property can be true and is reachable from declaration entry
✖️ refuted if reachable Satisfiability failed, property always false if reachable
unknown Solver could not determine satisfiability

Bug-finding mode (bugFinding, minimal): satisfiability check only for all statement types. Same as the cover table above.

Full mode (full): both checks run, all 9 outcomes possible. The last two columns show the error reporting level in SARIF output for each mode (✅ = pass, 🔴 = error, 🟡 = warning, 🔵 = note).

Emoji Label P ∧ Q P ∧ ¬Q Inferred reachability Meaning Deductive mode error level Bug finding mode error level
always true and is reachable sat unsat Property always true, reachable from declaration entry
always false and is reachable unsat sat Property always false, reachable from declaration entry 🔴 🔴
🔶 indecisive and reachable sat sat Reachable from declaration entry, solver found models for both the property and its negation 🔴 🔵
⛔/❌ unreachable unsat unsat Dead code, path unreachable (⛔ warning for assert, ❌ error for cover) 🟡/🔴 🟡/🔴
can be true and is reachable sat unknown Property can be true and is reachable from declaration entry, validity unknown 🔴 🔵
✖️ refuted if reachable unsat unknown Property always false if reachable, reachability unknown 🔴 🔴
can be false and is reachable unknown sat Solver found a model for P ∧ ¬Q: path is reachable from declaration entry and Q can be false, but satisfiability of Q is unknown 🔴 🔵
✔️ always true if reachable unknown unsat Property always true if reachable, reachability unknown
unknown unknown unknown Both checks inconclusive 🔴 🔵

Testing

All existing tests updated. New tests cover the full outcome matrix including per-statement @[fullCheck] annotations. BoogieToStrata integration tests, Python analysis tests, and SARIF output all updated.

Implement the two-sided verification check design that distinguishes between
'always true', 'always false', 'indecisive', and 'unreachable' outcomes.

Key changes:
- Add checkSatAssuming to SMT Solver for assumption-based queries
- Replace Outcome inductive with VCOutcome structure containing two SMT.Result fields
- Add CheckMode enum (full/validity/satisfiability) to Options
- Update encoder to emit two check-sat-assuming commands
- Update SARIF output to handle nine possible outcome combinations
- Default to validity mode for backward compatibility

The two-sided check asks:
1. Can the property be true? (satisfiability check)
2. Can the property be false? (validity check)

This enables distinguishing:
- pass (sat, unsat): always true and reachable
- refuted (unsat, sat): always false and reachable
- indecisive (sat, sat): true or false depending on inputs
- unreachable (unsat, unsat): path condition contradictory
- Five partial outcomes when one check returns unknown

Breaking change: VCResult API changed, all consumers must be updated.
Tests need updating to reflect new default behavior (validity mode only).

See TWO_SIDED_CHECK_IMPLEMENTATION.md for complete implementation details.
- Add CLI parsing for --check-mode flag (full/validity/satisfiability)
- Remove deprecated --reach-check flag
- Update help message with check mode documentation
- Fix StrataVerify to use 'outcome' field instead of 'result'
- Update emoji symbols for better visual distinction:
  - ✅ for pass (valid and reachable)
  - ✔️ for always true if reachable
  - ✖️ for refuted if reachable
  - ❌ for refuted (always false and reachable)
  - ⛔ for unreachable
  - 🔶 for indecisive
  - ➕ for satisfiable
  - ➖ for reachable and can be false
- Add metadata fields: fullCheck, validityCheck, satisfiabilityCheck
- Add helper methods to check for these annotations
- Update verifySingleEnv to check metadata before using global checkMode
- Annotations override global --check-mode flag for specific statements
- Add VCOutcomeTests.lean with all 9 outcome combinations
- Test both predicate methods and emoji/label rendering
- Use named arguments for clarity
- Update SMTEncoderTests to use full check mode for existing tests
- Ensures backward compatibility with expected 'pass' outcome
- Add VCOutcomeTests.lean with all 9 outcome combinations
- Each test shows emoji and label in output for easy verification
- Use named arguments for clarity
- Update SMTEncoderTests to use full check mode for existing tests
- Ensures backward compatibility with expected 'pass' outcome
- Add VCOutcomeTests.lean with all 9 outcome combinations
- Use formatOutcome helper to avoid repetition
- Each test shows emoji and label in output
- Use named arguments for clarity
- Update SMTEncoderTests to use full check mode
- Ensures backward compatibility with expected 'pass' outcome
- Document CLI flag integration
- Document per-statement annotations
- Document emoji updates
- Document comprehensive test suite
- Document test fixes for backward compatibility
- Fix StrataVerify to properly format Except String VCOutcome
- Update StrataMain to use vcResult.outcome instead of vcResult.result
- Use isRefuted/isRefutedIfReachable predicates for failure detection
- Format outcomes with emoji and label
Clarifies that refuted outcome means reachable and always false
…ters

- Rename isRefuted -> isRefutedAndReachable
- Rename isIndecisive -> isIndecisiveAndReachable
- Rename isRefutedIfReachable -> isAlwaysFalseIfReachable
- Add backward compatibility aliases
- Add cross-cutting predicates: isAlwaysFalse, isAlwaysTrue, isReachable
- Enables filtering outcomes by properties across multiple cases
…ariants

- isPass: true if validityProperty is unsat (always true), regardless of reachability
- isPassAndReachable: true if (sat, unsat) - proven reachable and always true
- isPassIfReachable: true if (unknown, unsat) - always true if reachable
- Update label/emoji to use isPassAndReachable and isPassIfReachable
- Update test comments to reflect new naming
- Add backward compatibility alias isAlwaysTrueIfReachable
…overs all sat cases

- isSatisfiable: true for any sat satisfiabilityProperty
- isSatisfiableValidityUnknown: specific case (sat, unknown)
- Rename isPassIfReachable -> isPassReachabilityUnknown
- Rename isAlwaysFalseIfReachable -> isAlwaysFalseReachabilityUnknown
- Rename isReachableAndCanBeFalse -> isCanBeFalseAndReachable
- All predicates now have reachability info at the end
- Add backward compatibility aliases for all old names
- Nine base cases without 'is': passAndReachable, refutedAndReachable, etc.
- Derived predicates with 'is': isPass, isSatisfiable, isReachable, etc.
- Base cases represent exact outcome combinations
- Derived predicates check properties across multiple outcomes
- Update SarifOutput to use base cases in outcomeToLevel/outcomeToMessage
- Update label/emoji functions to use base cases
- Maintain backward compatibility aliases for all old names
- Add VerificationMode enum: deductive vs bugFinding
- Deductive mode: only pass is success, anything not proven is error/warning
- Bug finding mode: refuted is error, unknown is acceptable warning
- Group outcomes by severity (one .none, one .error, one .warning, one .note per mode)
- Default to deductive mode for backward compatibility
…e isAlwaysFalse

- Deductive mode: only pass/unreachable are success/note, everything else is error
- Bug finding mode: use isAlwaysFalse predicate instead of listing base cases
- Cleaner and more maintainable
…achable is warning in deductive

- Consistent naming: use 'alwaysFalse' instead of 'refuted' in base cases
- Deductive mode: unreachable is warning (dead code detection)
- Update all references in Verifier.lean and SarifOutput.lean
- Maintain backward compatibility aliases
- Replace isAlwaysFalse with explicit base cases: alwaysFalseAndReachable, alwaysFalseReachabilityUnknown
- Add comment listing all error cases in deductive mode
- Clearer mapping from base cases to severity levels
- Remove 'Verification succeeded/failed' language
- Use neutral descriptions: 'Always true and reachable', 'Always false and reachable'
- Messages work for any property type (assertion, invariant, requires, etc.)
- Shorter and clearer messages
…nknown outcomes

- alwaysFalseReachabilityUnknown has validityProperty = unknown (not sat), no counterexample
- unknown outcome can have models from either satisfiability or validity property
- Show models from both properties when available for unknown outcome
- alwaysFalseReachabilityUnknown has validityProperty = unknown (no model)
- unknown outcome also has no models (Result.unknown carries no data)
- Only Result.sat carries counterexample models
…rties

- Eliminates redundant predicate checks in outcomeToMessage
- Single exhaustive match covers all 9 base cases plus error cases
- More concise and easier to verify correctness
- Test predicates, messages, and severity levels for each outcome
- Verify deductive and bug finding modes produce correct SARIF levels
- Self-contained test outputs with no numbered comments
- Tests ensure SARIF output matches predicate semantics
- Add missing validityCheck parameter (now takes satisfiabilityCheck and validityCheck)
- Use Except.ok/Except.error to avoid ambiguity
@andrewmwells-amazon
Copy link
Contributor

I think we should copy the explanation table somewhere more permanent than the PR description.

@MikaelMayer
Copy link
Contributor Author

I have two questions about the table contents:

Emoji Label Sat P ∧ Q Val P ∧ ¬Q Inferred reachability Meaning Deductive mode error level Bug finding mode error level
❌ refuted and reachable unsat sat ✅ Property always false, reachable from declaration entry 🔴 🔴
Shouldn't "Property always false" be "Property can be false" ?

On this line, we proved that P ∧ Q is unsat, and P ∧ ¬Q is sat. Hence, P is sat, meaning the path condition makes Q reachable.
That P ∧ Q is unsat means none of these assignments can make Q be true, so Q is proved to always be false. This is a bug for bug finding we will want to report.

Emoji Label Sat P ∧ Q Val P ∧ ¬Q Inferred reachability Meaning Deductive mode error level Bug finding mode error level
🔶 indecisive and reachable sat sat ✅ Reachable from declaration entry, solver found models for both the property and its negation 🔴 🔵
How can that happen? Seems impossible to me

Here is the following.

procedure m(x: int) {
   assert x > 0
}

In this one, we have P == true, Q == (x > 0), ¬Q == !(x > 0).
Both P ∧ Q on one side and P ∧ ¬Q are satisfiable.
Funny, on my side, I thought that (unsat, unsat) was the impossible case two weeks ago until I realized it encoded path condition reachability

Emoji Label Sat P ∧ Q Val P ∧ ¬Q Inferred reachability Meaning Deductive mode error level Bug finding mode error level
✖️ refuted if reachable unsat unknown ❓ Property always false if reachable, reachability unknown 🔴 🔴
Given that we don't know whether the check is even reachable, should this really be red for bug finding mode? A check that can't be reached is not a bug.

That's an excellent question. What this case proves is that, if the line is reached, it will surely cause a failure in the assertion. Since we assume in a well-formed codebase that every line is eventually reachable, and that assertions should never fail if reachable, I consider it as a bug, but I'm open to other narratives.

Emoji Label Sat P ∧ Q Val P ∧ ¬Q Inferred reachability Meaning Deductive mode error level Bug finding mode error level
➖ can be false if reachable unknown sat ✅ Solver found a counterexample reachable from declaration entry, reachability unknown 🔴 🔵
Why is this not double red? I would think that any sat result for P ∧ ¬Q proves there is a bug. Also, why does the text say reachability unknown but the icon says it is known? I would think reachability is proven here. Instead of "can be false if reachable" shouldn't this say "refuted and reachable" ?

The problem is modularity. If we prove sat for P ∧ ¬Q , we have proved that the path condition is reachable from the procedure entry for a particular set of inputs and that with these inputs, Q will be false. This is a problem for deductive verification since it's the only place where we have to prove that Q is always true.
However, for bug finding mode, we cannot guarantee based on the verification condition that P will ever be called with these inputs, perhaps at every call site there are guards that will prevent these inputs.

You are right there was an issue it has to be "can be false and reachable". Excellent finding. Let me fix that.

I'm rather confused about the meaning of both the --check-mode and --check-amount flags. IMO the user-facing output should answer these two questions:

Let's continue the conversation to see how I could best rephrase those

What would your table look like if you asked not P ^ Q and P ^ !Q, but instead P ^ !Q (the first user-facing question) and P (the second user-facing question) ? It seems simpler to me to encode the user-facing questions as directly as you can to SMT questions.

If you check P and P ^ !Q, you will not get the satisfiability of P ^ Q. At best, you can know that P ^ Q is unsat if P is unsat, but if P is sat, whatever the result of P ^ !Q, you cannot deduce the satisfiability of P ^ Q.
There is redundancy of checking the satisfiability of the two conjuncts you mentioned because if P ^ !Q is sat, you automatically get that P is sat.

@MikaelMayer
Copy link
Contributor Author

MikaelMayer commented Mar 3, 2026

I think this would be the table and I find it easier to interpret:

Emoji Label Val P ∧ ¬Q Reach P Inferred reachability Meaning Deductive mode error level Bug finding mode error level
✅ pass and reachable unsat sat ✅ Property always true, reachable from declaration entry ✅ ✅
✔️ pass if reachable unsat unknown ❓ Property always true if reachable, reachability unknown ✅ ✅
➕ satisfiable unknown sat ✅ Property can be true and is reachable from declaration entry 🔴 🔵
How can you state that the property Q can be true here?
⛔ unreachable unsat unsat ❌ Dead code, path unreachable 🟡 🟡
⛔ unreachable unknown unsat ❌ Dead code, path unreachable 🟡 🟡
❌ refuted and reachable sat sat ✅ Property can be false, reachable from declaration entry 🔴 🔴
❌ refuted and reachable sat unknown ✅ Property can be false, reachable from declaration entry 🔴 🔴
This is an unlikely outcome, as solvers tend to move from sat to unknown, not the other way round 👀
❓ unknown unknown unknown ❓ Both checks inconclusive 🔴 🔵
🔶 impossible sat unsat _ impossible _ _

your table misses one case for bug finding, which is the case when something is refuted if reachable, which indicates either dead code or wrong code, but that's exactly my point in the previous discussion and I'm happy to discuss it further.

Obligation: unreach_cover
Property: cover
Result: ❌ fail (❗path unreachable)
Result: unreachable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert and this cover should produce pass and fail respectively, even if they are both reachable.

Obligation: pe_assert_pass
Property: assert
Result: ✅ pass (❗path unreachable)
Result: ✅ pass and reachable from declaration entry
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about this. We say that the assertion is suddenly provably reachable when before this PR it was proved to be unreachable. Investigate the test and figure out what is wrong and what is right.


/--
info: #["assertion holds vacuously (path unreachable)", "cover property is unreachable"]
info: #[]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the test stops producing something that was meaningful. Why?

@keyboardDrummer
Copy link
Contributor

keyboardDrummer commented Mar 4, 2026

Thanks a lot for the detailed answer Mikael!

Btw, I find this PR really cool! I have no idea how you came up with all this. The table is impressive as well.

However, for bug finding mode, we cannot guarantee based on the verification condition that P will ever be called with these inputs, perhaps at every call site there are guards that will prevent these inputs.

I see, makes sense. I think I would phrase this as that the user is unsure whether they have all the requires clauses they need, so they want us to differentiate between a bug that occurs for every input and one that only occurs on some inputs.

Maybe the UX could be:

Flags: --fail-on: 0 (could not exclude bugs) | 1 (found a bug if preconditions are correct) | 2 (found a bug)

The success or failure of the verification then depends on the answer to: is there a runtime violation of an assertion?

  • Yes for every input. (sat P^!Q, unsat P^Q)
    • Always fail
  • Yes for an input. (sat P^!Q, sat/unknown P^Q)
    • Fail on fail-on < 2
  • Maybe. (unknown P^!Q)
    • Fail on fail-on < 1
  • No. (unsat P^!Q)
    • Never fail

Note that the above means we never report Yes .. if we're not sure whether the assertion can be reached. I think that makes sense for the bug finding modes. Users want us to find proven runtime issues, not situations where there is a runtime issue ONLY if the check can be reached.

@MikaelMayer
Copy link
Contributor Author

Maybe the UX could be:

Flags: --fail-on: 0 (could not exclude bugs) | 1 (found a bug if preconditions are correct) | 2 (found a bug)
The success or failure of the verification then depends on the answer to: is there a runtime violation of an assertion?

deductive mode has the notion of passing = all assertions are true at minimum if reachable.
bug finding mode has the notion of failing = assertions are provably false if reachable.
In the base case, it's all verified locally: if reachable.

But more information is usually useful to fix bugs (and also prove code), such as reachability (we need to fix the path) or refutability for validity (wait perhaps we were asserting the opposite of what is true!),

This is why there are 2 modes and 2 levels of reporting. Having only three levels means we are not offering one of the original 4 combinations:

  • bugFinding error reporting
  • bugFinding error reporting + more information about validity checks
  • deductive error reporting (stricter)
  • deductive error report + more information about satisfiability checks

Any tool using Strata could however skip one of these modes if they want to.

Maybe I'm understanding something else than what you are bringing thought. Is your remark as being about the error level of always false if reachable in the case of bugFinding mode? That it should be decided by the user?

I was hoping that if that is the case, since the strata tool would typically be invoked by tools, they would do the filtering themselves.

- pass and reachable → always true and is reachable
- refuted and reachable → always false and is reachable
- satisfiable → can be true and is reachable
- pass if reachable → always true if reachable
def defaultSolver : String := "cvc5"

/-- Check amount: how much information to gather -/
inductive CheckAmount where
Copy link
Contributor

@shigoel shigoel Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smallest of nits/feel free to ignore: can we say CheckLevel instead? Amount to me signals that something numeric is involved...

A model expressed as Core `LExpr` values, suitable for display
using Core's expression formatter and for future use as program metadata.
-/
abbrev LExprModel := List (Expression.Ident × LExpr CoreLParams.mono)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did LExprModel go away in this PR, or am I missing something?

@keyboardDrummer
Copy link
Contributor

keyboardDrummer commented Mar 5, 2026

But more information is usually useful to fix bugs (and also prove code), such as reachability (we need to fix the path) or refutability for validity (wait perhaps we were asserting the opposite of what is true!),

I agree that knowing that something is not reachable is useful information, but I think this information is significantly less important then answering the question "can my program violate an assertion at runtime?" and I would consider these as two separate questions.

This is why there are 2 modes and 2 levels of reporting. Having only three levels means we are not offering one of the original 4 combinations:

I suggested having 3 levels for the main question, and two levels for the reachability question (on/off), so a combination of 6 options. I think with the 4 modes you describe you're leaving out one of the three options of the first question: testing for all proven bugs assuming that the preconditions are complete. It's in between your deductive and bug finding mode. In this mode, the "indecisive and reachable" and "can be false and is reachable" rows from your table would report an error, and otherwise it would behave like the bug finding mode.

Do you think this is not a valid use-case? Maybe it's not. If you consider the preconditions to be part of the correctness proof, and the "bug finding mode" is one where you don't want to fail when the proof is incomplete, then indeed you don't care about that case.

However, if we think providing complete preconditions is simpler than providing a correctness proof, then I think "test for all proven bugs assuming that the preconditions are complete" could be useful. There are many bugs that only trigger on some inputs, so we could find a lot more bugs with this mode than with the bug finding mode that you're describing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants