Skip to content

demo, maybe: Subset lineage YAML files to observed values#6287

Draft
theosanderson wants to merge 1 commit intomainfrom
claude/filter-silo-lineage-import-tMpvU
Draft

demo, maybe: Subset lineage YAML files to observed values#6287
theosanderson wants to merge 1 commit intomainfrom
claude/filter-silo-lineage-import-tMpvU

Conversation

@theosanderson
Copy link
Copy Markdown
Member

@theosanderson theosanderson commented Apr 16, 2026

The below all written by claude, not intended for real use

Description

This PR implements intelligent subsetting of lineage definition YAML files to reduce their size and improve performance. Instead of loading complete lineage hierarchies, SILO now only loads the lineages actually present in the data, along with their parent lineages needed for hierarchy resolution.

Key Changes

  1. Lineage YAML Subsetting (lineage.py)

    • Added subset_lineage_yaml() function that filters lineage definitions to only include:
      • Lineages whose canonical names or aliases appear in the data
      • All transitive parent lineages (to preserve hierarchy queries)
    • Handles alias resolution to map aliases back to canonical names
    • Returns statistics on how many entries were kept vs. total
  2. Database Config Parsing (lineage.py)

    • Added read_lineage_field_mapping() to extract metadata field-to-lineage-system mappings from SILO's database config
    • Parses generateLineageIndex attributes to identify which metadata fields hold lineage values
    • Gracefully handles missing or malformed configs
  3. Data Scanning (decompressor.py)

    • Enhanced analyze_ndjson() to collect unique lineage values observed in records
    • Takes optional lineage_field_mapping parameter to know which fields to scan
    • Returns collected values per lineage system in NdjsonAnalysis.lineage_values
  4. Integration (runner.py, download_manager.py)

    • Updated download pipeline to pass lineage field mapping through to data analysis
    • Modified update_lineage_definitions() to accept observed lineage values and subset accordingly
    • Falls back to writing full upstream file if no field mapping is available (preserves backward compatibility)
  5. Dependencies

    • Added PyYAML dependency for YAML parsing and generation

Testing

Comprehensive test coverage added in test_lineage.py:

  • Unit tests for subset_lineage_yaml() covering alias resolution, parent traversal, and edge cases
  • Unit tests for read_lineage_field_mapping() with various config structures
  • Integration tests verifying the full pipeline from data import to subsetted lineage files
  • End-to-end test demonstrating the complete workflow

All existing tests updated to work with the refactored _download_lineage_text() function signature.

Backward Compatibility

The changes are fully backward compatible:

  • If no database config is available, lineage files are written unchanged
  • If a lineage system has no metadata field referencing it, the full upstream file is used
  • Existing configurations without lineage definitions continue to work as before

https://claude.ai/code/session_01NqAoS8sLdfSzTSz9ooo5zj

🚀 Preview: https://claude-filter-silo-lineag.loculus.org

When downloading the lineage system definition file, the SILO importer
now reduces it to only the entries actually referenced by the data,
together with the transitive closure of their parents. Aliases are
resolved to their canonical names, and unknown values are dropped.

The importer reads the SILO database config to discover which metadata
fields use each lineage system (via ``generateLineageIndex``), scans
the freshly downloaded NDJSON for the unique values appearing in those
fields, and rewrites each ``<lineage>.yaml`` to the minimal subtree
SILO needs. Lineage systems that no metadata field references fall
back to writing the upstream file unchanged so existing setups keep
working.
@claude claude bot added the performance label Apr 16, 2026
@theosanderson theosanderson changed the title Subset lineage YAML files to observed values demo, maybe: Subset lineage YAML files to observed values Apr 16, 2026
@theosanderson theosanderson added the preview Triggers a deployment to argocd label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants