demo, maybe: Subset lineage YAML files to observed values#6287
Draft
theosanderson wants to merge 1 commit intomainfrom
Draft
demo, maybe: Subset lineage YAML files to observed values#6287theosanderson wants to merge 1 commit intomainfrom
theosanderson wants to merge 1 commit intomainfrom
Conversation
When downloading the lineage system definition file, the SILO importer now reduces it to only the entries actually referenced by the data, together with the transitive closure of their parents. Aliases are resolved to their canonical names, and unknown values are dropped. The importer reads the SILO database config to discover which metadata fields use each lineage system (via ``generateLineageIndex``), scans the freshly downloaded NDJSON for the unique values appearing in those fields, and rewrites each ``<lineage>.yaml`` to the minimal subtree SILO needs. Lineage systems that no metadata field references fall back to writing the upstream file unchanged so existing setups keep working.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The below all written by claude, not intended for real use
Description
This PR implements intelligent subsetting of lineage definition YAML files to reduce their size and improve performance. Instead of loading complete lineage hierarchies, SILO now only loads the lineages actually present in the data, along with their parent lineages needed for hierarchy resolution.
Key Changes
Lineage YAML Subsetting (
lineage.py)subset_lineage_yaml()function that filters lineage definitions to only include:Database Config Parsing (
lineage.py)read_lineage_field_mapping()to extract metadata field-to-lineage-system mappings from SILO's database configgenerateLineageIndexattributes to identify which metadata fields hold lineage valuesData Scanning (
decompressor.py)analyze_ndjson()to collect unique lineage values observed in recordslineage_field_mappingparameter to know which fields to scanNdjsonAnalysis.lineage_valuesIntegration (
runner.py,download_manager.py)update_lineage_definitions()to accept observed lineage values and subset accordinglyDependencies
Testing
Comprehensive test coverage added in
test_lineage.py:subset_lineage_yaml()covering alias resolution, parent traversal, and edge casesread_lineage_field_mapping()with various config structuresAll existing tests updated to work with the refactored
_download_lineage_text()function signature.Backward Compatibility
The changes are fully backward compatible:
https://claude.ai/code/session_01NqAoS8sLdfSzTSz9ooo5zj
🚀 Preview: https://claude-filter-silo-lineag.loculus.org