Skip to content

SAMap v2: samap.io onboarding + samap.analysis interpretation#176

Open
atarashansky wants to merge 17 commits into
mainfrom
io-onboarding
Open

SAMap v2: samap.io onboarding + samap.analysis interpretation#176
atarashansky wants to merge 17 commits into
mainfrom
io-onboarding

Conversation

@atarashansky
Copy link
Copy Markdown
Owner

@atarashansky atarashansky commented Apr 30, 2026

SAMap v2: samap.io onboarding + samap.analysis interpretation

This PR adds two modules that bracket the core SAMap pipeline: samap.io
makes it easy to get into SAMap (resolving the var_names ↔ FASTA-header
mismatch that is the most common onboarding failure), and
samap.analysis makes it possible to get something out of SAMap (mapping
diagnostics, gene-program decomposition, and a disk-based ontology builder
for many-species corpora).

Both modules were driven by a 21-species, 210-pair pan-metazoan benchmark
(Ctenophora → Mammalia, ~3.9 M cells). See docs/io.md and
docs/analysis.md for full usage guides.

samap.io — preparing inputs

  • detect_id_flavor — regex classifier for var_names namespace
    (Ensembl/RefSeq/NCBI GeneID/UniProt/model-org DBs/symbol/unknown).
  • fetch_proteome — derive a protein FASTA whose headers are your
    var_names (Ensembl REST, NCBI Datasets), one record per gene, longest
    isoform.
  • match_fasta — score a header-transform cascade (first-token,
    strip-version, gene:=/gene_id= extraction, GTF tx→gene, explicit
    mapping) against your var_names and write a renamed FASTA.
  • run_blast — DIAMOND / MMseqs2 / NCBI BLAST+ with engine auto-detect
    (DIAMOND for prot↔prot, MMseqs2 for translated nucl↔nucl with GPU
    auto-detect, BLAST+ fallback).
  • gnnm_from_pairs / homology_from_eggnog — bypass BLAST entirely if
    you already have an orthology table.
  • save_gnnm / load_gnnm — npz cache for the homology graph.
  • samap CLI — detect-ids, fetch-proteome, match-fasta, blast.
  • Initialization warning when var_names ↔ BLAST overlap is below 70%
    (catches the most common silent failure).

samap.analysis — interpreting mappings

  • get_mapping_scores(..., which_iter=0|'final') — score against the
    iter-0 (raw BLAST homology) manifold as well as the converged one.
    SAMAP.run now stores obsp['connectivities_iter0'] and
    nnm_per_iter[] so the "find" vs "measure" axes are separable.
  • permutation_null_scores — label-permutation null for mapping scores
    (no manifold rerun; cheap empirical p-values per pair).
  • cluster_to_k / mapping_degeneracy — leiden-to-target-k for
    granularity-matched comparison; reciprocal-best fraction, entropy, and
    effective-rank summaries of the score matrix. These are the metrics that
    track phylogenetic distance — they decay cleanly to ≈550 Mya, while
    max_score saturates almost immediately.
  • gene_modules / module_factored_scores — Leiden-partition the
    homology graph and decompose each cell-type alignment by how many
    independent gene modules support it. with_coexpression=k is
    load-bearing: the BLAST graph alone fragments into ~3,000 size-2
    components; bridging it with within-species gene–gene correlation kNN
    produces interpretable modules (sarcomere, ciliome, ECM, neural, …).
  • homology_graph_delta / find_paralog_substitutions — per-edge
    sequence-similarity vs expression-correlation residuals; ranked
    paralog-substitution candidates without an external orthology DB.
  • samap.analysis.ontology — disk-based re-scoring of persisted
    obsp['connectivities'] against arbitrary per-species label sets
    (~0.2 s/pair, no SAM/SAMAP objects); union-graph assembly and Leiden
    community detection for many-species cell-type-family discovery;
    family_phylogenetic_signal classifies families as "lineage" vs
    "program" by within-family score–divergence correlation with
    same-study-clade exclusion control.
  • SAMAP.run(..., joint_weights=True)experimental: recompute SAM
    gene weights on the joint manifold after iteration 1. Off by default;
    no behaviour change.

Tests

  • 261 passed / 26 skipped.
  • 1 known pre-existing failure (tests/regression/test_golden_3species):
    gnnm_refined mismatches the recorded golden (max abs diff 1.34e−2 vs
    tol 1e-4) — hnswlib add_items is non-deterministic across platforms
    even with the thin determinism wrapper. Verified to fail identically
    on main (8ffeaec)
    ; unrelated to samap.io / samap.analysis.
  • NUMBA_NUM_THREADS pinning hoisted to root tests/conftest.py so it
    applies before any test module imports numba.
  • ruff clean.

Docs

  • docs/io.md — onboarding guide with workflow diagram and CLI summary.
  • docs/analysis.md — interpretation guide with workflow diagram,
    per-function rationale, and a quick-reference table.
  • CHANGELOG.md updated under [Unreleased].

Numbers from the benchmark

Used to set defaults and validate API design (not committed to the repo):

  • 21 species, 210 pairs, ~3.9 M cells (CELLxGENE / GEO / org-specific DBs).
  • 838 cell-type clusters (k=40 Leiden), 12,731 cross-species edges,
    2,954 reciprocal-best edges.
  • Permutation null q95 ≈ 0.01–0.05; tile-ability horizon ≈ 90–160 Mya.
  • 39 cross-method consensus cell-type families (RBH ∩ program-expression),
    housekeeper filter improves modularity 0.65 → 0.67.

operon added 15 commits April 30, 2026 20:58
SAMAP.__init__ now logs the per-species fraction of var_names that
intersect the homology graph, and emits a WARNING (with 5 example
unmatched IDs from each side) when overlap < 30%. This catches the
single most common onboarding failure — FASTA-header / var_names
namespace mismatch — at the moment both ID sets are in hand.

Adds HOMOLOGY_OVERLAP_WARN_THRESHOLD constant and a focused unit test.
…pace

Classifies a sample of identifiers as ensembl_{gene,tx,protein},
refseq_{rna,protein}, ncbi_geneid, uniprot, flybase/wormbase/zfin/
mgi/hgnc/wbps, symbol (heuristic), unknown, or mixed. Reports
dominant flavor + confidence + per-pattern counts + version-suffix
flag + examples. Downstream io.fetch / io.match dispatch on this.
Backends: Ensembl REST (ensembl_gene/tx/protein, batch=50, longest-
isoform-per-gene by default, version-suffix stripped for the request
but preserved in output headers, tx/protein → gene names map via
/lookup/id) and NCBI Datasets v2 (ncbi_geneid → protein.faa, headers
parsed back via [GeneID=N]). Output FASTA headers ARE the input ids,
so they match adata.var_names by construction.

Stdlib urllib only — no new runtime deps. Adds `network` pytest
marker; live smoke test passes against rest.ensembl.org.
Runs a transform cascade (identity, first_token, strip_version,
uniprot sp|ACC|, lcl| strip, gene[:=]/gene_id[:=], transcript[:=],
before/after pipe; plus optional GTF tx→gene and explicit mapping)
and reports per-transform overlap fraction. Picks the best, builds
the names[] array for SAMAP(names=...), and optionally writes a
renamed FASTA whose headers are the var_names (longest seq per
var_name). Tested against bundled planarian/schistosome data.
gnnm_from_pairs is the primitive: any (N,2) ortholog table →
(csr_matrix, gns, gns_dict) for SAMAP(gnnm=...). homology_from_eggnog
parses emapper TSVs, takes OG co-membership at a chosen taxon level
(default Eukaryota 2759), and emits cross-species edges via
gnnm_from_pairs. Works for non-model species — emapper takes raw
sequences. Live-tested on bundled hy/pl/sc eggnog TSVs (358k edges,
0.7s); resulting gnnm plugs straight into SAMAP init.

Compara is intentionally not REST-wrapped (one request per gene);
the docstring points BioMart bulk-export users at gnnm_from_pairs.
run_blast ports map_genes.sh to Python: all N-choose-2 reciprocal
alignments in one call, DIAMOND-first (very-sensitive, falls back
to BLAST+ for tblastn/tblastx), -outfmt 6, output layout matches
_calculate_blast_graph. Skip-existing, per-direction engine
resolution, informative no-aligner error.

save_gnnm/load_gnnm round-trip the (gnnm, gns, gns_dict) tuple
to a single .npz so SAMAP init can skip the ~38s BLAST-table
parse on repeat runs.
…a / blast

Thin argparse shim around samap.io. Registers [project.scripts]
samap = 'samap.__main__:main'. `samap blast --cache gnnm.npz` runs
all reciprocal alignments and writes the npz cache in one shot.
engine='auto' now resolves DIAMOND for protein-DB targets and MMseqs2
for nucleotide-DB targets (--search-type 2 for nucl↔nucl), with NCBI
BLAST+ as last resort. MMseqs2 path adds --gpu 1 when nvidia-smi is
present (override via run_blast(gpu=...) / `samap blast --gpu/--no-gpu`).
DBs are now built lazily per (species, engine). All three engines emit
the same 12-col outfmt-6 table that _calculate_blast_graph reads.

All aligners are conda-installable: `conda install -c bioconda diamond
mmseqs2 blast`. Sensitivity flag maps to MMseqs2 -s (5.7/7.5/8.5).
…generacy, iter-0 scoring, joint_weights flag

- get_mapping_scores(..., which_iter): score against any iteration's manifold;
  SAMAP.run now stores obsp['connectivities_iter0'] + sm.nnm_per_iter[]
- analysis/null.py: permutation_null_scores — label-shuffle null, no rerun
- analysis/homology_delta.py: per-edge seq-vs-corr residuals + paralog-sub finder
- analysis/modules.py: gene_modules (Leiden on hom graph) + module_factored_scores
- analysis/degeneracy.py: cluster_to_k (binary-search res→k) + mapping_degeneracy
- SAMAP.run(joint_weights=True): experimental cross-species SAM weight recompute
- tests: tiny_samap fixture (2-species planted structure, ~15s) + 16 unit tests
- fix: hoist NUMBA_NUM_THREADS pinning to root conftest (collection-order race)
…ex order

GenePairFinder.find_all lexically sorts each (ct1,ct2) pair by full
species-prefixed label, so column-name species order need not match the
keys-dict order. Surfaced on real data (nv↔dr: KeyError 'dr_neuronal');
the synthetic fixture's species codes (aa,bb) happen to be lex-ordered
so unit tests didn't catch it.
…oexpression

The reweighted homology graph is bipartite-per-species-pair (edges only
between species), so Leiden fragments it into orthogroup-sized components
(~3k size-2 modules on real pairs) and module_factored_scores degenerates
to a gene-pair count. Two new densification options:

- snn=k: Jaccard shared-nearest-neighbour over the homology adjacency
  (cheap, sequence-driven; insufficient alone — components stay disconnected)
- with_coexpression=k: per-species gene-gene Pearson kNN on kNN-averaged
  expression (bridges within-species; produces program-sized modules,
  ~30-40 modules of median size ~300 on ml↔mm / nv↔mm)

This is the gene-level analogue of SAMap's cell-level cross∪within-kNN
manifold. With coexpr densification, ml:Colloblast_1→mm:cardiac is 76%
one module (the sarcomere program: Myh*/Myl*/Tnn*/Ttn/Actc1/Ryr2/Nkx2-5)
while nv:retractor→mm:cardiac is 26% top across 5 modules — the
single-program-vs-multi-program discrimination the function was designed
for now reads out.
… builder

persist_pair / score_from_connectivities / build_union_graph /
cluster_families. Recomputes get_mapping_scores' output from a persisted
obsp + per-species labels in ~0.2s/pair, no SAM/SAMAP objects in memory.
Validated to r=0.9999 over 105 corpus pairs. The persisted obs.parquet
carries the species column (do NOT infer species from obs-name prefix —
not guaranteed); the connectivities matrix is directed kNN, so both
X[ia,ib] and X[ib,ia] are read separately (transpose is wrong when cell
counts differ).

Enables: rebuild a 600-node 15-species cell-type union graph at any
label set in <30s; Leiden on RBH-only edges → cell-type families;
housekeeper-edge filtering via the global gene-module decomposition
(to be added in a follow-up) raises modularity 0.55→0.61 and gives each
family a defining gene program (sarcomere, ciliated, ECM, lysosomal, ...).
…core_from_connectivities

SAMAP's combined adata calls obs_names_make_unique(), which appends -1/-2/...
to obs-names that collide across species (common when two 10x datasets share
raw barcodes). score_from_connectivities reindexes user-supplied per-species
labels against the *persisted* (suffixed) obs-names, so a direct reindex
misses those cells. Now retry missed lookups with the trailing -N stripped
(only for cells that missed on direct lookup, so legitimate '-1' suffixes
are preserved).

Found on the 21-species corpus: blsy had 4904 sy cells (WASP barcodes
colliding with bl) and chha had 13602 ch cells unmatched.
…ram classifier

For each cell-type family from cluster_families, correlate the per-species-
pair within-family max alignment score with phylogenetic divergence. A
family with negative ρ behaves like a conserved cell-type lineage; positive
ρ marks a generic-program bin (deeper pairs score higher because the
program is the only signal left). exclude_pairs control removes same-study
clades that otherwise dominate the rank correlation.

On the 21-species/210-pair corpus this classifies 4 families lineage-like
(muscle, cnidocyte/secretory, gastrodermis, proliferative; ρ_ex −0.24 to
−0.50), 3 program-bins (fibre/ECM, epithelial, gland-like; ρ_ex +0.16 to
+0.29), 18 ambiguous. Validates to ρ=1.0000 against the in-kernel
reference. 0.5s for 25 families × 12,731 edges.
Mirrors docs/io.md: workflow diagram, per-function rationale, and a
quick-reference table covering iter-0 vs converged scores, permutation
null, degeneracy/tile-ability, module-factored decomposition, paralog
substitutions, disk-based re-scoring, and lineage-vs-program family
classification. Updates CHANGELOG to reference the 21-species 210-pair
benchmark and the new doc.
@atarashansky atarashansky changed the title samap.io — lower the BLAST onboarding wall SAMap v2: samap.io onboarding + samap.analysis interpretation May 5, 2026
Replace recalled '~3,000 size-2 components' with measured numbers
from the 3-species mm-hs-dr graph (6,697 components: 1,295 singletons,
~3,000 size 2-3, one giant of ~20k genes). The qualitative point
(coexpression bridging is load-bearing) is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant