SAMap v2: samap.io onboarding + samap.analysis interpretation#176
Open
atarashansky wants to merge 17 commits into
Open
SAMap v2: samap.io onboarding + samap.analysis interpretation#176atarashansky wants to merge 17 commits into
atarashansky wants to merge 17 commits into
Conversation
added 15 commits
April 30, 2026 20:58
SAMAP.__init__ now logs the per-species fraction of var_names that intersect the homology graph, and emits a WARNING (with 5 example unmatched IDs from each side) when overlap < 30%. This catches the single most common onboarding failure — FASTA-header / var_names namespace mismatch — at the moment both ID sets are in hand. Adds HOMOLOGY_OVERLAP_WARN_THRESHOLD constant and a focused unit test.
…pace
Classifies a sample of identifiers as ensembl_{gene,tx,protein},
refseq_{rna,protein}, ncbi_geneid, uniprot, flybase/wormbase/zfin/
mgi/hgnc/wbps, symbol (heuristic), unknown, or mixed. Reports
dominant flavor + confidence + per-pattern counts + version-suffix
flag + examples. Downstream io.fetch / io.match dispatch on this.
Backends: Ensembl REST (ensembl_gene/tx/protein, batch=50, longest- isoform-per-gene by default, version-suffix stripped for the request but preserved in output headers, tx/protein → gene names map via /lookup/id) and NCBI Datasets v2 (ncbi_geneid → protein.faa, headers parsed back via [GeneID=N]). Output FASTA headers ARE the input ids, so they match adata.var_names by construction. Stdlib urllib only — no new runtime deps. Adds `network` pytest marker; live smoke test passes against rest.ensembl.org.
Runs a transform cascade (identity, first_token, strip_version, uniprot sp|ACC|, lcl| strip, gene[:=]/gene_id[:=], transcript[:=], before/after pipe; plus optional GTF tx→gene and explicit mapping) and reports per-transform overlap fraction. Picks the best, builds the names[] array for SAMAP(names=...), and optionally writes a renamed FASTA whose headers are the var_names (longest seq per var_name). Tested against bundled planarian/schistosome data.
gnnm_from_pairs is the primitive: any (N,2) ortholog table → (csr_matrix, gns, gns_dict) for SAMAP(gnnm=...). homology_from_eggnog parses emapper TSVs, takes OG co-membership at a chosen taxon level (default Eukaryota 2759), and emits cross-species edges via gnnm_from_pairs. Works for non-model species — emapper takes raw sequences. Live-tested on bundled hy/pl/sc eggnog TSVs (358k edges, 0.7s); resulting gnnm plugs straight into SAMAP init. Compara is intentionally not REST-wrapped (one request per gene); the docstring points BioMart bulk-export users at gnnm_from_pairs.
run_blast ports map_genes.sh to Python: all N-choose-2 reciprocal alignments in one call, DIAMOND-first (very-sensitive, falls back to BLAST+ for tblastn/tblastx), -outfmt 6, output layout matches _calculate_blast_graph. Skip-existing, per-direction engine resolution, informative no-aligner error. save_gnnm/load_gnnm round-trip the (gnnm, gns, gns_dict) tuple to a single .npz so SAMAP init can skip the ~38s BLAST-table parse on repeat runs.
…a / blast Thin argparse shim around samap.io. Registers [project.scripts] samap = 'samap.__main__:main'. `samap blast --cache gnnm.npz` runs all reciprocal alignments and writes the npz cache in one shot.
engine='auto' now resolves DIAMOND for protein-DB targets and MMseqs2 for nucleotide-DB targets (--search-type 2 for nucl↔nucl), with NCBI BLAST+ as last resort. MMseqs2 path adds --gpu 1 when nvidia-smi is present (override via run_blast(gpu=...) / `samap blast --gpu/--no-gpu`). DBs are now built lazily per (species, engine). All three engines emit the same 12-col outfmt-6 table that _calculate_blast_graph reads. All aligners are conda-installable: `conda install -c bioconda diamond mmseqs2 blast`. Sensitivity flag maps to MMseqs2 -s (5.7/7.5/8.5).
…generacy, iter-0 scoring, joint_weights flag - get_mapping_scores(..., which_iter): score against any iteration's manifold; SAMAP.run now stores obsp['connectivities_iter0'] + sm.nnm_per_iter[] - analysis/null.py: permutation_null_scores — label-shuffle null, no rerun - analysis/homology_delta.py: per-edge seq-vs-corr residuals + paralog-sub finder - analysis/modules.py: gene_modules (Leiden on hom graph) + module_factored_scores - analysis/degeneracy.py: cluster_to_k (binary-search res→k) + mapping_degeneracy - SAMAP.run(joint_weights=True): experimental cross-species SAM weight recompute - tests: tiny_samap fixture (2-species planted structure, ~15s) + 16 unit tests - fix: hoist NUMBA_NUM_THREADS pinning to root conftest (collection-order race)
…ex order GenePairFinder.find_all lexically sorts each (ct1,ct2) pair by full species-prefixed label, so column-name species order need not match the keys-dict order. Surfaced on real data (nv↔dr: KeyError 'dr_neuronal'); the synthetic fixture's species codes (aa,bb) happen to be lex-ordered so unit tests didn't catch it.
…oexpression The reweighted homology graph is bipartite-per-species-pair (edges only between species), so Leiden fragments it into orthogroup-sized components (~3k size-2 modules on real pairs) and module_factored_scores degenerates to a gene-pair count. Two new densification options: - snn=k: Jaccard shared-nearest-neighbour over the homology adjacency (cheap, sequence-driven; insufficient alone — components stay disconnected) - with_coexpression=k: per-species gene-gene Pearson kNN on kNN-averaged expression (bridges within-species; produces program-sized modules, ~30-40 modules of median size ~300 on ml↔mm / nv↔mm) This is the gene-level analogue of SAMap's cell-level cross∪within-kNN manifold. With coexpr densification, ml:Colloblast_1→mm:cardiac is 76% one module (the sarcomere program: Myh*/Myl*/Tnn*/Ttn/Actc1/Ryr2/Nkx2-5) while nv:retractor→mm:cardiac is 26% top across 5 modules — the single-program-vs-multi-program discrimination the function was designed for now reads out.
… builder persist_pair / score_from_connectivities / build_union_graph / cluster_families. Recomputes get_mapping_scores' output from a persisted obsp + per-species labels in ~0.2s/pair, no SAM/SAMAP objects in memory. Validated to r=0.9999 over 105 corpus pairs. The persisted obs.parquet carries the species column (do NOT infer species from obs-name prefix — not guaranteed); the connectivities matrix is directed kNN, so both X[ia,ib] and X[ib,ia] are read separately (transpose is wrong when cell counts differ). Enables: rebuild a 600-node 15-species cell-type union graph at any label set in <30s; Leiden on RBH-only edges → cell-type families; housekeeper-edge filtering via the global gene-module decomposition (to be added in a follow-up) raises modularity 0.55→0.61 and gives each family a defining gene program (sarcomere, ciliated, ECM, lysosomal, ...).
…core_from_connectivities SAMAP's combined adata calls obs_names_make_unique(), which appends -1/-2/... to obs-names that collide across species (common when two 10x datasets share raw barcodes). score_from_connectivities reindexes user-supplied per-species labels against the *persisted* (suffixed) obs-names, so a direct reindex misses those cells. Now retry missed lookups with the trailing -N stripped (only for cells that missed on direct lookup, so legitimate '-1' suffixes are preserved). Found on the 21-species corpus: blsy had 4904 sy cells (WASP barcodes colliding with bl) and chha had 13602 ch cells unmatched.
…ram classifier For each cell-type family from cluster_families, correlate the per-species- pair within-family max alignment score with phylogenetic divergence. A family with negative ρ behaves like a conserved cell-type lineage; positive ρ marks a generic-program bin (deeper pairs score higher because the program is the only signal left). exclude_pairs control removes same-study clades that otherwise dominate the rank correlation. On the 21-species/210-pair corpus this classifies 4 families lineage-like (muscle, cnidocyte/secretory, gastrodermis, proliferative; ρ_ex −0.24 to −0.50), 3 program-bins (fibre/ECM, epithelial, gland-like; ρ_ex +0.16 to +0.29), 18 ambiguous. Validates to ρ=1.0000 against the in-kernel reference. 0.5s for 25 families × 12,731 edges.
08fa2df to
10a6852
Compare
Mirrors docs/io.md: workflow diagram, per-function rationale, and a quick-reference table covering iter-0 vs converged scores, permutation null, degeneracy/tile-ability, module-factored decomposition, paralog substitutions, disk-based re-scoring, and lineage-vs-program family classification. Updates CHANGELOG to reference the 21-species 210-pair benchmark and the new doc.
Replace recalled '~3,000 size-2 components' with measured numbers from the 3-species mm-hs-dr graph (6,697 components: 1,295 singletons, ~3,000 size 2-3, one giant of ~20k genes). The qualitative point (coexpression bridging is load-bearing) is unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SAMap v2:
samap.ioonboarding +samap.analysisinterpretationThis PR adds two modules that bracket the core SAMap pipeline:
samap.iomakes it easy to get into SAMap (resolving the var_names ↔ FASTA-header
mismatch that is the most common onboarding failure), and
samap.analysismakes it possible to get something out of SAMap (mappingdiagnostics, gene-program decomposition, and a disk-based ontology builder
for many-species corpora).
Both modules were driven by a 21-species, 210-pair pan-metazoan benchmark
(Ctenophora → Mammalia, ~3.9 M cells). See
docs/io.mdanddocs/analysis.mdfor full usage guides.samap.io— preparing inputsdetect_id_flavor— regex classifier forvar_namesnamespace(Ensembl/RefSeq/NCBI GeneID/UniProt/model-org DBs/symbol/unknown).
fetch_proteome— derive a protein FASTA whose headers are yourvar_names (Ensembl REST, NCBI Datasets), one record per gene, longest
isoform.
match_fasta— score a header-transform cascade (first-token,strip-version,
gene:=/gene_id=extraction, GTF tx→gene, explicitmapping) against your var_names and write a renamed FASTA.
run_blast— DIAMOND / MMseqs2 / NCBI BLAST+ with engine auto-detect(DIAMOND for prot↔prot, MMseqs2 for translated nucl↔nucl with GPU
auto-detect, BLAST+ fallback).
gnnm_from_pairs/homology_from_eggnog— bypass BLAST entirely ifyou already have an orthology table.
save_gnnm/load_gnnm— npz cache for the homology graph.samapCLI —detect-ids,fetch-proteome,match-fasta,blast.var_names ↔ BLASToverlap is below 70%(catches the most common silent failure).
samap.analysis— interpreting mappingsget_mapping_scores(..., which_iter=0|'final')— score against theiter-0 (raw BLAST homology) manifold as well as the converged one.
SAMAP.runnow storesobsp['connectivities_iter0']andnnm_per_iter[]so the "find" vs "measure" axes are separable.permutation_null_scores— label-permutation null for mapping scores(no manifold rerun; cheap empirical p-values per pair).
cluster_to_k/mapping_degeneracy— leiden-to-target-k forgranularity-matched comparison; reciprocal-best fraction, entropy, and
effective-rank summaries of the score matrix. These are the metrics that
track phylogenetic distance — they decay cleanly to ≈550 Mya, while
max_scoresaturates almost immediately.gene_modules/module_factored_scores— Leiden-partition thehomology graph and decompose each cell-type alignment by how many
independent gene modules support it.
with_coexpression=kisload-bearing: the BLAST graph alone fragments into ~3,000 size-2
components; bridging it with within-species gene–gene correlation kNN
produces interpretable modules (sarcomere, ciliome, ECM, neural, …).
homology_graph_delta/find_paralog_substitutions— per-edgesequence-similarity vs expression-correlation residuals; ranked
paralog-substitution candidates without an external orthology DB.
samap.analysis.ontology— disk-based re-scoring of persistedobsp['connectivities']against arbitrary per-species label sets(~0.2 s/pair, no SAM/SAMAP objects); union-graph assembly and Leiden
community detection for many-species cell-type-family discovery;
family_phylogenetic_signalclassifies families as "lineage" vs"program" by within-family score–divergence correlation with
same-study-clade exclusion control.
SAMAP.run(..., joint_weights=True)— experimental: recompute SAMgene weights on the joint manifold after iteration 1. Off by default;
no behaviour change.
Tests
tests/regression/test_golden_3species):gnnm_refinedmismatches the recorded golden (max abs diff 1.34e−2 vstol 1e-4) — hnswlib
add_itemsis non-deterministic across platformseven with the thin determinism wrapper. Verified to fail identically
on
main(8ffeaec); unrelated tosamap.io/samap.analysis.NUMBA_NUM_THREADSpinning hoisted to roottests/conftest.pyso itapplies before any test module imports numba.
ruffclean.Docs
docs/io.md— onboarding guide with workflow diagram and CLI summary.docs/analysis.md— interpretation guide with workflow diagram,per-function rationale, and a quick-reference table.
CHANGELOG.mdupdated under[Unreleased].Numbers from the benchmark
Used to set defaults and validate API design (not committed to the repo):
2,954 reciprocal-best edges.
housekeeper filter improves modularity 0.65 → 0.67.