Skip to content

Materialize ancestor information#2408

Draft
ctron wants to merge 14 commits into
guacsec:release/0.4.zfrom
ctron:feature/atlas_perf_1_mat
Draft

Materialize ancestor information#2408
ctron wants to merge 14 commits into
guacsec:release/0.4.zfrom
ctron:feature/atlas_perf_1_mat

Conversation

@ctron

@ctron ctron commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Note: this PR builds on top of #2407, only the last few commits are relevant.

It creates a table materializing the ancestor information of the RH model. Thus saving us from re-exploring that over and over again.

This brings down the "pypi request" query from days to ~13 mins:

➜  trustify git:(feature/atlas_perf_1_mat) time xh GET localhost:8080/api/v2/analysis/latest/component?q=purl=pkg%3Apypi%2Frequests (oidc token -Hf trusty)
HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Expose-Headers: content-type
Content-Encoding: br
Content-Type: application/json
Date: Fri, 19 Jun 2026 06:49:53 GMT
Transfer-Encoding: chunked
Vary: Origin, Access-Control-Request-Method, Access-Control-Request-Headers
Vary: accept-encoding

{
    "items": [],
    "total": 0
}



________________________________________________________
Executed in  747.02 secs      fish           external
   usr time   24.16 millis    0.00 micros   24.16 millis
   sys time   20.16 millis  929.00 micros   19.23 millis

However, this is a draft and it may break stuff. So let's be optimistic, but cautious.

Summary by Sourcery

Materialize SBOM ancestor and describing-CPE relationships to avoid expensive runtime graph walks and improve analysis query performance.

New Features:

  • Introduce materialized sbom_describing_cpe and sbom_ancestor tables plus corresponding ORM entities to store SBOM describing CPEs and cross-SBOM ancestor links.
  • Populate describing CPEs and ancestor links during SBOM ingestion for CycloneDX and SPDX documents.
  • Add a Containerfile and .containerignore for building and packaging the trustd binary into a container image.

Enhancements:

  • Refactor SBOM CPE and ancestor resolution to use batched queries and the new materialized tables instead of recursive in-memory graph traversal.
  • Optimize product advisory and context CPE queries to read from sbom_describing_cpe, simplifying SQL and reducing join cost.
  • Improve logging with truncated iterable formatting and richer instrumentation for graph loading and ranking.
  • Add a helper to migrate the database up to a specific migration and extend the trustd db CLI to support targeted migrations.
  • Update several dependencies (moka, petgraph, sea-orm, tokio, uuid) and align uuid usage in the storage module.
  • Adjust tests and entities to reflect the new ancestor and describing-CPE materialization behavior.

Tests:

  • Add tests verifying sbom_describing_cpe semantics and that ingesting linked SBOMs correctly populates the sbom_ancestor table.

Summary by Sourcery

Materialize cross-SBOM ancestor relationships and tighten read-only transaction handling to speed up analysis queries and avoid connection pool exhaustion.

New Features:

  • Introduce a materialized sbom_ancestor table and ORM entity to store cross-SBOM ancestor links between SBOMs.
  • Populate sbom_ancestor entries during CycloneDX and SPDX SBOM ingestion based on checksum-linked external nodes.

Enhancements:

  • Refactor SBOM CPE resolution to read ancestor CPEs via sbom_ancestor and sbom_describing_cpe instead of on-the-fly graph walking and checksum-based DFS.
  • Centralize the begin_read helper in the common db module and adopt consistent read-only transactions across analysis, SBOM, advisory, product, purl, vulnerability, organization, and license endpoints.
  • Improve graph loading logging with truncated iterator formatting and minor utility cleanups (e.g., timestamp helper, error wiring).

Tests:

  • Add tests ensuring sbom_ancestor entries are populated for linked SBOMs and that deep descendant queries only use a single lazy pool connection.
  • Add a LazyPool test context and new SPDX fixtures to simulate many descendant SBOMs for pool exhaustion testing.

@sourcery-ai

sourcery-ai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Reviewer's Guide

Materializes cross-SBOM ancestor and describing-CPE data into dedicated tables, refactors CPE resolution to use those tables instead of runtime graph walks, and introduces read-only transactional endpoints and tests to ensure performant, connection-efficient analysis queries.

Sequence diagram for resolve_sbom_cpes using materialized ancestors

sequenceDiagram
    participant Endpoint as AnalysisEndpoint
    participant Service as RankService
    participant DB as Database

    Endpoint->>Service: resolve_sbom_cpes(cpe_search, rows, connection)
    alt cpe_search == true
        Service->>DB: batch_resolve_direct_cpe_matches(rows)
        DB-->>Service: Vec<RankedSbom> (direct CPEs)
    end

    Service->>DB: batch_resolve_ancestor_cpes(sbom_ids)
    Note over Service,DB: Phase 2a: direct ancestors via sbom_ancestor + sbom_describing_cpe
    DB-->>Service: phase2a: Vec<AncestorCpeRow>

    Note over Service,DB: Phase 2b: recursive ancestors via sbom_ancestor + sbom_describing_cpe
    DB-->>Service: phase2b: Vec<AncestorCpeRow>

    Service->>Service: build rows_by_sbom
    Service->>Service: map phase2a + phase2b to RankedSbom (with dedup)

    Service-->>Endpoint: Vec<RankedSbom>
Loading

Entity relationship diagram for materialized sbom_ancestor and describing CPEs

erDiagram
    SBOM {
        uuid sbom_id PK
    }

    SBOM_ANCESTOR {
        uuid sbom_id PK
        uuid ancestor_sbom_id PK
    }

    SBOM_DESCRIBING_CPE {
        uuid sbom_id FK
        uuid cpe_id
    }

    SBOM_EXTERNAL_NODE {
        uuid sbom_id FK
        text external_node_ref
    }

    SBOM_NODE_CHECKSUM {
        uuid sbom_id FK
        text node_id
        text value
    }

    SBOM ||--o{ SBOM_ANCESTOR : has_ancestors
    SBOM ||--o{ SBOM_DESCRIBING_CPE : has_cpes
    SBOM ||--o{ SBOM_EXTERNAL_NODE : has_external_nodes
    SBOM ||--o{ SBOM_NODE_CHECKSUM : has_checksums

    SBOM_ANCESTOR }o--|| SBOM : ancestor_sbom
    SBOM_EXTERNAL_NODE ||--o{ SBOM_NODE_CHECKSUM : matches_by_node
    SBOM_NODE_CHECKSUM ||--o{ SBOM_NODE_CHECKSUM : shares_value
Loading

File-Level Changes

Change Details Files
Replace recursive ancestor/external SBOM resolution in analysis ranking with queries over a new materialized sbom_ancestor table plus existing sbom_describing_cpe data.
  • Remove recursive CTE-based ancestor walking, checksum-based external SBOM DFS, and their caching context from the ranking loader.
  • Introduce AncestorCpeRow and batch_resolve_ancestor_cpes to fetch direct and recursive ancestor CPEs via sbom_ancestor and sbom_describing_cpe.
  • Adjust resolve_sbom_cpes to run only batched direct CPE matching plus phase-2 ancestor CPE resolution, deduping recursive results against direct ones.
  • Simplify graph loading logs to use TruncatedIter and clean up unused debug helpers and cache structs.
modules/analysis/src/service/load/rank.rs
modules/analysis/src/service/load/mod.rs
Materialize cross-SBOM ancestor relationships during SBOM ingestion and expose them via new ORM entities and migrations.
  • Add sbom_ancestor entity model and relation from sbom to sbom_ancestor in the entity crate.
  • Create migration m0002130_sbom_ancestor to define sbom_ancestor table, indexes, and backfill data using sbom_external_node and sbom_node_checksum.
  • Implement SbomContext::populate_ancestors that inserts (child, ancestor) rows for the current SBOM using checksum matches around sbom_external_node references.
  • Call populate_ancestors from both CycloneDX and SPDX SBOM ingestion paths.
  • Add tests ensuring ingesting linked SBOMs populates sbom_ancestor with a single unidirectional child→ancestor link.
entity/src/lib.rs
entity/src/sbom.rs
entity/src/sbom_ancestor.rs
migration/src/lib.rs
migration/src/m0002130_sbom_ancestor.rs
modules/ingestor/src/graph/sbom/mod.rs
modules/ingestor/src/graph/sbom/cyclonedx.rs
modules/ingestor/src/graph/sbom/spdx.rs
modules/analysis/src/service/load/rank.rs
Introduce a shared DatabaseExt::begin_read helper and adopt read-only repeatable-read transactions across many HTTP endpoints.
  • Move DatabaseExt::begin_read trait into common db module returning DbErr and re-export from fundamental db module.
  • Update numerous analysis, SBOM, advisory, product, purl, vulnerability, organization, license, and label endpoints to use begin_read() and return the module Error type from handlers.
  • Wire ExternalReferenceQueryParseError and authorization errors into the fundamental Error enum and ResponseError implementation so handler signatures can use Error.
  • Add a test using a LazyPool context to assert deep descendant queries use a single pooled connection when using read-only transactions.
common/src/db/mod.rs
modules/fundamental/src/db.rs
modules/analysis/src/endpoints/mod.rs
modules/analysis/src/endpoints/tests/mod.rs
modules/fundamental/src/sbom/endpoints/mod.rs
modules/fundamental/src/sbom/endpoints/query.rs
modules/fundamental/src/error.rs
modules/fundamental/src/advisory/endpoints/mod.rs
modules/fundamental/src/advisory/endpoints/label.rs
modules/fundamental/src/product/endpoints/mod.rs
modules/fundamental/src/purl/endpoints/mod.rs
modules/fundamental/src/purl/endpoints/base.rs
modules/fundamental/src/sbom/endpoints/label.rs
modules/fundamental/src/vulnerability/endpoints/mod.rs
modules/fundamental/src/organization/endpoints/mod.rs
modules/fundamental/src/license/endpoints/mod.rs
Extend test infrastructure to support lazy connection pools and add data to simulate pool exhaustion scenarios.
  • Add LazyPool test wrapper that creates a separate Database with min_conn=0 and small max_conn and delegates teardown.
  • Expose TrustifyContext::lazy_pool constructor to build a new Database pointing to the same Postgres instance/database with adjusted pool settings.
  • Enable reqwest's stream feature in test-context to support new tests.
  • Add SPDX test data under etc/test-data/spdx/pool-exhaustion/* used by deep_descendants_lazy_pool test.
test-context/Cargo.toml
test-context/src/ctx/mod.rs
test-context/src/ctx/lazy_pool.rs
test-context/src/lib.rs
etc/test-data/spdx/pool-exhaustion/main.json
etc/test-data/spdx/pool-exhaustion/leaf-1.json
etc/test-data/spdx/pool-exhaustion/leaf-2.json
etc/test-data/spdx/pool-exhaustion/leaf-3.json
etc/test-data/spdx/pool-exhaustion/leaf-4.json
etc/test-data/spdx/pool-exhaustion/leaf-5.json
etc/test-data/spdx/pool-exhaustion/leaf-6.json
etc/test-data/spdx/pool-exhaustion/leaf-7.json
etc/test-data/spdx/pool-exhaustion/leaf-8.json
etc/test-data/spdx/pool-exhaustion/leaf-9.json
etc/test-data/spdx/pool-exhaustion/leaf-10.json
etc/test-data/spdx/pool-exhaustion/leaf-11.json
etc/test-data/spdx/pool-exhaustion/leaf-12.json
etc/test-data/spdx/pool-exhaustion/leaf-13.json
etc/test-data/spdx/pool-exhaustion/leaf-14.json
etc/test-data/spdx/pool-exhaustion/leaf-15.json
etc/test-data/spdx/pool-exhaustion/leaf-16.json
etc/test-data/spdx/pool-exhaustion/leaf-17.json
etc/test-data/spdx/pool-exhaustion/leaf-18.json
etc/test-data/spdx/pool-exhaustion/leaf-19.json
etc/test-data/spdx/pool-exhaustion/leaf-20.json

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@ctron ctron force-pushed the feature/atlas_perf_1_mat branch 5 times, most recently from 93b20d5 to d4c51fa Compare June 23, 2026 13:01
ctron and others added 7 commits June 23, 2026 15:02
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
@ctron ctron force-pushed the feature/atlas_perf_1_mat branch from d4c51fa to 5745bdc Compare June 23, 2026 13:02
Traversing the nodes and chasing external documents might require
loading those documents. So far, this used on DB connection each, as
we've been passing in the connection pool. This may lead to pool
exhaustion and the API call failing.

We now pass in a read-only transaction, which will lead to the DB
connection acting as a mutex, only consuming one connection.

It also adds a test ensuring that's the case.

Co-authored-by: Claude <noreply@anthropic.com>
@ctron ctron force-pushed the feature/atlas_perf_1_mat branch from 5745bdc to eb9d047 Compare June 23, 2026 13:09
Co-authored-by: Claude <noreply@anthropic.com>
@ctron ctron marked this pull request as ready for review June 23, 2026 13:26
@rh-jfuller rh-jfuller self-requested a review June 23, 2026 13:27

@sourcery-ai sourcery-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In resolve_sbom_cpes the top_ancestor_sbom for ancestor-derived CPEs is always set to Uuid::nil(), which drops information that previously came from the resolved ancestor SBOM; consider threading the actual ancestor SBOM id through batch_resolve_ancestor_cpes and into RankedSbom so callers can still distinguish which SBOM provided the CPE.
  • The recursive ancestor CPE CTE in batch_resolve_ancestor_cpes uses a hardcoded depth < 10 limit; it would be safer to either document this bound clearly and centralize it as a constant/config or make it configurable so future changes to graph depth expectations don’t require hunting for magic numbers in SQL strings.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `resolve_sbom_cpes` the `top_ancestor_sbom` for ancestor-derived CPEs is always set to `Uuid::nil()`, which drops information that previously came from the resolved ancestor SBOM; consider threading the actual ancestor SBOM id through `batch_resolve_ancestor_cpes` and into `RankedSbom` so callers can still distinguish which SBOM provided the CPE.
- The recursive ancestor CPE CTE in `batch_resolve_ancestor_cpes` uses a hardcoded `depth < 10` limit; it would be safer to either document this bound clearly and centralize it as a constant/config or make it configurable so future changes to graph depth expectations don’t require hunting for magic numbers in SQL strings.

## Individual Comments

### Comment 1
<location path="modules/fundamental/src/sbom/endpoints/mod.rs" line_range="95" />
<code_context>
     web::Query(paginated): web::Query<Paginated>,
     _: Require<ReadSbom>,
-) -> actix_web::Result<impl Responder> {
+) -> actix_web::Result<impl Responder, Error> {
     let query = OwnedComponentReference::try_from(key.as_str())?;
+    let tx = db.begin_read().await?;
</code_context>
<issue_to_address>
**issue (bug_risk):** Endpoint return types use `actix_web::Result<_, Error>`, which does not match Actix's `Result` alias and likely won't compile.

`actix_web::Result` is defined as `type Result<T> = std::result::Result<T, actix_web::Error>`, so it only accepts one type parameter. Writing `actix_web::Result<impl Responder, Error>` adds a second parameter and will not compile.

This pattern shows up across many endpoints (e.g. `get_unique_licenses`, `get_license_export`, `all`, `all_related`, `count_related`, `get`, `get_sbom_advisories`, `delete`, `packages`, `related`, `upload`, `download`, and similar advisory/product/purl/vulnerability/organization/analysis handlers).

To keep using your explicit `Error` type, either:
- change signatures to `-> Result<impl Responder, Error>` (without the `actix_web::Result` alias), or
- revert to `-> actix_web::Result<impl Responder>` and rely on `Error: ResponseError` to convert into `actix_web::Error`.

As written, the mixed form will fail to type-check.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

id: web::Path<String>,
_: Require<ReadSbom>,
) -> actix_web::Result<impl Responder> {
) -> actix_web::Result<impl Responder, Error> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Endpoint return types use actix_web::Result<_, Error>, which does not match Actix's Result alias and likely won't compile.

actix_web::Result is defined as type Result<T> = std::result::Result<T, actix_web::Error>, so it only accepts one type parameter. Writing actix_web::Result<impl Responder, Error> adds a second parameter and will not compile.

This pattern shows up across many endpoints (e.g. get_unique_licenses, get_license_export, all, all_related, count_related, get, get_sbom_advisories, delete, packages, related, upload, download, and similar advisory/product/purl/vulnerability/organization/analysis handlers).

To keep using your explicit Error type, either:

  • change signatures to -> Result<impl Responder, Error> (without the actix_web::Result alias), or
  • revert to -> actix_web::Result<impl Responder> and rely on Error: ResponseError to convert into actix_web::Error.

As written, the mixed form will fail to type-check.

ctron and others added 2 commits June 23, 2026 15:49
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
use trustify_test_context::TrustifyContext;

/// Compute the maximum depth of the descendant tree.
fn max_descendant_depth(nodes: &[model::Node]) -> u64 {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice ;)

@rh-jfuller rh-jfuller left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ... there are some minor corner cases (that became apparent during review) related to dedup and cycle detection but they have always been there and I do not think impacts.

@rh-jfuller

Copy link
Copy Markdown
Contributor

/scale-test

@github-actions

Copy link
Copy Markdown

🛠️ Scale test has started! Follow the progress here: Workflow Run

@rh-jfuller

Copy link
Copy Markdown
Contributor

note: scale-tests use datasets that are prob tied to main (and will not work against release branch)

ctron and others added 2 commits June 24, 2026 08:28
The node_map in batch_resolve_direct_cpe_matches was keyed by node_id
alone, causing silent HashMap collisions when multiple SBOMs shared
the same node_id string (e.g. common bom-refs like "shared-ref").

Collect all matching sbom_node entries per node_id and prefer a match
from a non-source SBOM to avoid picking up the source's own
external-node entry.

Co-authored-by: Claude <noreply@anthropic.com>
The field is only used in tests for debugging and assertions, not by
production code. Gate it behind #[cfg(test)] to avoid carrying unused
data in release builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
actix_web::Result is just an alias for std::result::Result with a
default error type. Using it with an explicit error argument is
redundant.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ctron ctron force-pushed the feature/atlas_perf_1_mat branch from fa0f7f5 to 97c4294 Compare June 24, 2026 08:07
@ctron ctron marked this pull request as draft June 24, 2026 08:10
@ctron

ctron commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

The idea is to merge this onto main. Including the changes from #2407 ... I'll leave this PR open, and create a new one, targeting main with the other changes as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants