SAMap v2: samap.io onboarding + samap.analysis interpretation by atarashansky · Pull Request #176 · atarashansky/SAMap

atarashansky · 2026-04-30T21:25:27Z

SAMap v2: `samap.io` onboarding + `samap.analysis` interpretation

This PR adds two modules that bracket the core SAMap pipeline: samap.io
makes it easy to get into SAMap (resolving the var_names ↔ FASTA-header
mismatch that is the most common onboarding failure), and
samap.analysis makes it possible to get something out of SAMap (mapping
diagnostics, gene-program decomposition, and a disk-based ontology builder
for many-species corpora).

Both modules were driven by a 21-species, 210-pair pan-metazoan benchmark
(Ctenophora → Mammalia, ~3.9 M cells). See docs/io.md and
docs/analysis.md for full usage guides.

`samap.io` — preparing inputs

detect_id_flavor — regex classifier for var_names namespace
(Ensembl/RefSeq/NCBI GeneID/UniProt/model-org DBs/symbol/unknown).
fetch_proteome — derive a protein FASTA whose headers are your
var_names (Ensembl REST, NCBI Datasets), one record per gene, longest
isoform.
match_fasta — score a header-transform cascade (first-token,
strip-version, gene:=/gene_id= extraction, GTF tx→gene, explicit
mapping) against your var_names and write a renamed FASTA.
run_blast — DIAMOND / MMseqs2 / NCBI BLAST+ with engine auto-detect
(DIAMOND for prot↔prot, MMseqs2 for translated nucl↔nucl with GPU
auto-detect, BLAST+ fallback).
gnnm_from_pairs / homology_from_eggnog — bypass BLAST entirely if
you already have an orthology table.
save_gnnm / load_gnnm — npz cache for the homology graph.
samap CLI — detect-ids, fetch-proteome, match-fasta, blast.
Initialization warning when var_names ↔ BLAST overlap is below 70%
(catches the most common silent failure).

`samap.analysis` — interpreting mappings

get_mapping_scores(..., which_iter=0|'final') — score against the
iter-0 (raw BLAST homology) manifold as well as the converged one.
SAMAP.run now stores obsp['connectivities_iter0'] and
nnm_per_iter[] so the "find" vs "measure" axes are separable.
permutation_null_scores — label-permutation null for mapping scores
(no manifold rerun; cheap empirical p-values per pair).
cluster_to_k / mapping_degeneracy — leiden-to-target-k for
granularity-matched comparison; reciprocal-best fraction, entropy, and
effective-rank summaries of the score matrix. These are the metrics that
track phylogenetic distance — they decay cleanly to ≈550 Mya, while
max_score saturates almost immediately.
gene_modules / module_factored_scores — Leiden-partition the
homology graph and decompose each cell-type alignment by how many
independent gene modules support it. with_coexpression=k is
load-bearing: the BLAST graph alone fragments into ~3,000 size-2
components; bridging it with within-species gene–gene correlation kNN
produces interpretable modules (sarcomere, ciliome, ECM, neural, …).
homology_graph_delta / find_paralog_substitutions — per-edge
sequence-similarity vs expression-correlation residuals; ranked
paralog-substitution candidates without an external orthology DB.
samap.analysis.ontology — disk-based re-scoring of persisted
obsp['connectivities'] against arbitrary per-species label sets
(~0.2 s/pair, no SAM/SAMAP objects); union-graph assembly and Leiden
community detection for many-species cell-type-family discovery;
family_phylogenetic_signal classifies families as "lineage" vs
"program" by within-family score–divergence correlation with
same-study-clade exclusion control.
SAMAP.run(..., joint_weights=True) — experimental: recompute SAM
gene weights on the joint manifold after iteration 1. Off by default;
no behaviour change.

Tests

261 passed / 26 skipped.
1 known pre-existing failure (tests/regression/test_golden_3species):
gnnm_refined mismatches the recorded golden (max abs diff 1.34e−2 vs
tol 1e-4) — hnswlib add_items is non-deterministic across platforms
even with the thin determinism wrapper. Verified to fail identically
on main (8ffeaec); unrelated to samap.io / samap.analysis.
NUMBA_NUM_THREADS pinning hoisted to root tests/conftest.py so it
applies before any test module imports numba.
ruff clean.

Docs

docs/io.md — onboarding guide with workflow diagram and CLI summary.
docs/analysis.md — interpretation guide with workflow diagram,
per-function rationale, and a quick-reference table.
CHANGELOG.md updated under [Unreleased].

Numbers from the benchmark

Used to set defaults and validate API design (not committed to the repo):

21 species, 210 pairs, ~3.9 M cells (CELLxGENE / GEO / org-specific DBs).
838 cell-type clusters (k=40 Leiden), 12,731 cross-species edges,
2,954 reciprocal-best edges.
Permutation null q95 ≈ 0.01–0.05; tile-ability horizon ≈ 90–160 Mya.
39 cross-method consensus cell-type families (RBH ∩ program-expression),
housekeeper filter improves modularity 0.65 → 0.67.

SAMAP.__init__ now logs the per-species fraction of var_names that intersect the homology graph, and emits a WARNING (with 5 example unmatched IDs from each side) when overlap < 30%. This catches the single most common onboarding failure — FASTA-header / var_names namespace mismatch — at the moment both ID sets are in hand. Adds HOMOLOGY_OVERLAP_WARN_THRESHOLD constant and a focused unit test.

…pace Classifies a sample of identifiers as ensembl_{gene,tx,protein}, refseq_{rna,protein}, ncbi_geneid, uniprot, flybase/wormbase/zfin/ mgi/hgnc/wbps, symbol (heuristic), unknown, or mixed. Reports dominant flavor + confidence + per-pattern counts + version-suffix flag + examples. Downstream io.fetch / io.match dispatch on this.

Backends: Ensembl REST (ensembl_gene/tx/protein, batch=50, longest- isoform-per-gene by default, version-suffix stripped for the request but preserved in output headers, tx/protein → gene names map via /lookup/id) and NCBI Datasets v2 (ncbi_geneid → protein.faa, headers parsed back via [GeneID=N]). Output FASTA headers ARE the input ids, so they match adata.var_names by construction. Stdlib urllib only — no new runtime deps. Adds `network` pytest marker; live smoke test passes against rest.ensembl.org.

Runs a transform cascade (identity, first_token, strip_version, uniprot sp|ACC|, lcl| strip, gene[:=]/gene_id[:=], transcript[:=], before/after pipe; plus optional GTF tx→gene and explicit mapping) and reports per-transform overlap fraction. Picks the best, builds the names[] array for SAMAP(names=...), and optionally writes a renamed FASTA whose headers are the var_names (longest seq per var_name). Tested against bundled planarian/schistosome data.

gnnm_from_pairs is the primitive: any (N,2) ortholog table → (csr_matrix, gns, gns_dict) for SAMAP(gnnm=...). homology_from_eggnog parses emapper TSVs, takes OG co-membership at a chosen taxon level (default Eukaryota 2759), and emits cross-species edges via gnnm_from_pairs. Works for non-model species — emapper takes raw sequences. Live-tested on bundled hy/pl/sc eggnog TSVs (358k edges, 0.7s); resulting gnnm plugs straight into SAMAP init. Compara is intentionally not REST-wrapped (one request per gene); the docstring points BioMart bulk-export users at gnnm_from_pairs.

run_blast ports map_genes.sh to Python: all N-choose-2 reciprocal alignments in one call, DIAMOND-first (very-sensitive, falls back to BLAST+ for tblastn/tblastx), -outfmt 6, output layout matches _calculate_blast_graph. Skip-existing, per-direction engine resolution, informative no-aligner error. save_gnnm/load_gnnm round-trip the (gnnm, gns, gns_dict) tuple to a single .npz so SAMAP init can skip the ~38s BLAST-table parse on repeat runs.

…a / blast Thin argparse shim around samap.io. Registers [project.scripts] samap = 'samap.__main__:main'. `samap blast --cache gnnm.npz` runs all reciprocal alignments and writes the npz cache in one shot.

engine='auto' now resolves DIAMOND for protein-DB targets and MMseqs2 for nucleotide-DB targets (--search-type 2 for nucl↔nucl), with NCBI BLAST+ as last resort. MMseqs2 path adds --gpu 1 when nvidia-smi is present (override via run_blast(gpu=...) / `samap blast --gpu/--no-gpu`). DBs are now built lazily per (species, engine). All three engines emit the same 12-col outfmt-6 table that _calculate_blast_graph reads. All aligners are conda-installable: `conda install -c bioconda diamond mmseqs2 blast`. Sensitivity flag maps to MMseqs2 -s (5.7/7.5/8.5).

…generacy, iter-0 scoring, joint_weights flag - get_mapping_scores(..., which_iter): score against any iteration's manifold; SAMAP.run now stores obsp['connectivities_iter0'] + sm.nnm_per_iter[] - analysis/null.py: permutation_null_scores — label-shuffle null, no rerun - analysis/homology_delta.py: per-edge seq-vs-corr residuals + paralog-sub finder - analysis/modules.py: gene_modules (Leiden on hom graph) + module_factored_scores - analysis/degeneracy.py: cluster_to_k (binary-search res→k) + mapping_degeneracy - SAMAP.run(joint_weights=True): experimental cross-species SAM weight recompute - tests: tiny_samap fixture (2-species planted structure, ~15s) + 16 unit tests - fix: hoist NUMBA_NUM_THREADS pinning to root conftest (collection-order race)

…ex order GenePairFinder.find_all lexically sorts each (ct1,ct2) pair by full species-prefixed label, so column-name species order need not match the keys-dict order. Surfaced on real data (nv↔dr: KeyError 'dr_neuronal'); the synthetic fixture's species codes (aa,bb) happen to be lex-ordered so unit tests didn't catch it.

…oexpression The reweighted homology graph is bipartite-per-species-pair (edges only between species), so Leiden fragments it into orthogroup-sized components (~3k size-2 modules on real pairs) and module_factored_scores degenerates to a gene-pair count. Two new densification options: - snn=k: Jaccard shared-nearest-neighbour over the homology adjacency (cheap, sequence-driven; insufficient alone — components stay disconnected) - with_coexpression=k: per-species gene-gene Pearson kNN on kNN-averaged expression (bridges within-species; produces program-sized modules, ~30-40 modules of median size ~300 on ml↔mm / nv↔mm) This is the gene-level analogue of SAMap's cell-level cross∪within-kNN manifold. With coexpr densification, ml:Colloblast_1→mm:cardiac is 76% one module (the sarcomere program: Myh*/Myl*/Tnn*/Ttn/Actc1/Ryr2/Nkx2-5) while nv:retractor→mm:cardiac is 26% top across 5 modules — the single-program-vs-multi-program discrimination the function was designed for now reads out.

… builder persist_pair / score_from_connectivities / build_union_graph / cluster_families. Recomputes get_mapping_scores' output from a persisted obsp + per-species labels in ~0.2s/pair, no SAM/SAMAP objects in memory. Validated to r=0.9999 over 105 corpus pairs. The persisted obs.parquet carries the species column (do NOT infer species from obs-name prefix — not guaranteed); the connectivities matrix is directed kNN, so both X[ia,ib] and X[ib,ia] are read separately (transpose is wrong when cell counts differ). Enables: rebuild a 600-node 15-species cell-type union graph at any label set in <30s; Leiden on RBH-only edges → cell-type families; housekeeper-edge filtering via the global gene-module decomposition (to be added in a follow-up) raises modularity 0.55→0.61 and gives each family a defining gene program (sarcomere, ciliated, ECM, lysosomal, ...).

…core_from_connectivities SAMAP's combined adata calls obs_names_make_unique(), which appends -1/-2/... to obs-names that collide across species (common when two 10x datasets share raw barcodes). score_from_connectivities reindexes user-supplied per-species labels against the *persisted* (suffixed) obs-names, so a direct reindex misses those cells. Now retry missed lookups with the trailing -N stripped (only for cells that missed on direct lookup, so legitimate '-1' suffixes are preserved). Found on the 21-species corpus: blsy had 4904 sy cells (WASP barcodes colliding with bl) and chha had 13602 ch cells unmatched.

…ram classifier For each cell-type family from cluster_families, correlate the per-species- pair within-family max alignment score with phylogenetic divergence. A family with negative ρ behaves like a conserved cell-type lineage; positive ρ marks a generic-program bin (deeper pairs score higher because the program is the only signal left). exclude_pairs control removes same-study clades that otherwise dominate the rank correlation. On the 21-species/210-pair corpus this classifies 4 families lineage-like (muscle, cnidocyte/secretory, gastrodermis, proliferative; ρ_ex −0.24 to −0.50), 3 program-bins (fibre/ECM, epithelial, gland-like; ρ_ex +0.16 to +0.29), 18 ambiguous. Validates to ρ=1.0000 against the in-kernel reference. 0.5s for 25 families × 12,731 edges.

Mirrors docs/io.md: workflow diagram, per-function rationale, and a quick-reference table covering iter-0 vs converged scores, permutation null, degeneracy/tile-ability, module-factored decomposition, paralog substitutions, disk-based re-scoring, and lineage-vs-program family classification. Updates CHANGELOG to reference the 21-species 210-pair benchmark and the new doc.

Replace recalled '~3,000 size-2 components' with measured numbers from the 3-species mm-hs-dr graph (6,697 components: 1,295 singletons, ~3,000 size 2-3, one giant of ~20k genes). The qualitative point (coexpression bridging is load-bearing) is unchanged.

operon added 15 commits April 30, 2026 20:58

feat(cli): add samap CLI — detect-ids / fetch-proteome / match-fast…

5285247

…a / blast Thin argparse shim around samap.io. Registers [project.scripts] samap = 'samap.__main__:main'. `samap blast --cache gnnm.npz` runs all reciprocal alignments and writes the npz cache in one shot.

docs: add docs/io.md onboarding guide; CHANGELOG + README for samap.io

fabf7f3

atarashansky force-pushed the io-onboarding branch from 08fa2df to 10a6852 Compare May 4, 2026 06:33

atarashansky changed the title ~~samap.io — lower the BLAST onboarding wall~~ SAMap v2: samap.io onboarding + samap.analysis interpretation May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAMap v2: samap.io onboarding + samap.analysis interpretation#176

SAMap v2: samap.io onboarding + samap.analysis interpretation#176
atarashansky wants to merge 17 commits into
mainfrom
io-onboarding

atarashansky commented Apr 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atarashansky commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!