feat(taxonomy): Host organism categories endpoint#6272
Draft
maverbiest wants to merge 7 commits intomainfrom
Draft
feat(taxonomy): Host organism categories endpoint#6272maverbiest wants to merge 7 commits intomainfrom
maverbiest wants to merge 7 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As a follow-up to host validation, we would like to add functionality to loculus to assign host organisms to configurable categories.
These categories could then be used for filtering sequences in the front-end; e.g., it would be nice for arboviruses if people are able to select "Mosquito" from a drop-down menu to get all sequences that were obtained from any mosquito species.
Implementation & configuration
This PR adds a
/taxa/{tax_id}/host-categoriesendpoint to the taxonomy-service. This endpoint returns a list of host category lables that apply to the providedtax_id.The labels are configured like this (obviously not labels we'd use):
The keys in
organism_categoriesare NCBI taxon IDs, the values are host category lables. When provided with atax_id, the endpoint will return all labels associated with taxa inorganism_categoriesthat are an ancestor oftax_id.For this to be useable in loculus/pathoplexus, we'd need to add functionality to assign host categories to each sequence during preprocessing (would be a separate PR).
Usage
When using the example configuration shown above, the following results are returned (taxonomy-service running locally in a docker container):
Aedes aegypti (7159) and Culex pipiens (7157) are mosquitos:
Humans (9606) are not mosquitos:
Yersinia pestis (632) is not a eukaryote:
➜ taxonomy_service git:(host-organism-categories) ✗ curl localhost:5000/taxa/632/host-categories ["I'm a cellular organism!"]%The root is nothing:
Alternatives
An alternative way to get this functionality may be to use custom taxonomic lineage files and filter via SILO/LAPIS. This is something I'm scoping out currently.
PR Checklist
🚀 Preview: Add
previewlabel to enable