Skip to content

WIP: Add MSA homologue search and RAG-augmeneted E1 embedding/scoring pipelines#25

Open
nrafaili wants to merge 2 commits intoSynthyra:mainfrom
nrafaili:e1-rag
Open

WIP: Add MSA homologue search and RAG-augmeneted E1 embedding/scoring pipelines#25
nrafaili wants to merge 2 commits intoSynthyra:mainfrom
nrafaili:e1-rag

Conversation

@nrafaili
Copy link
Contributor

@nrafaili nrafaili commented Mar 17, 2026

Summary

  • homologue_search.py: MMseqs2 via Docker OR ColabFold API homologue retrieval modules
  • rag_e1.py: Retrieval-augmented prediction with E1 models using MSA-derived context:
    • MSA loading from local dir or HuggingFace, with fuzzy matching for mutants
    • PPLL scoring with RAG using the 15 context prompts from E1 preprint
    • Embedding extraction with single-context prompt
  • e1_utils.py: Unmodified contents of io.py, msa_sampling.py, and predictor.py from the E1 repository

Usage

Homologue search

from e1_fastplms.homologue_search import HomologueSearcher, ColabFoldSearcher

# Local MMseqs2 (requires Docker + target DB)
searcher = HomologueSearcher(target_db="path/to/uniref30")
a3m_path = searcher.search("MKTL...", output_dir="msas/")

# ColabFold API (no local DB needed)
searcher = ColabFoldSearcher(user_agent="your@email.com")
a3m_path = searcher.search("MKTL...", output_dir="msas/")

RAG-augmented scoring and embedding

from e1_fastplms.rag_e1 import E1RAGPredictor

predictor = E1RAGPredictor.from_pretrained("Synthyra/Profluent-E1-600M")

# PPLL scoring (15-prompt ensemble)
scores = predictor.score_ppll(["MKTL...", "MKTV..."], a3m_path="msas/query.a3m")

# Embedding with RAG context
embeddings = predictor.embed(["MKTL..."], a3m_path="msas/query.a3m", pooling="mean")

# Dataset embedding with MSA lookup
emb_dict = predictor.embed_dataset(sequences, msa_dir="msas/", batch_size=4)

Test plan

  • Verify HomologueSearcher and ColabFoldSearcher return valid .a3m files
  • Confirm PPLL scoring works as intended
  • Confirm RAG embeddings differ from no-context embeddings and evaluate performance impact

@lhallee
Copy link
Contributor

lhallee commented Mar 17, 2026

Hi @nrafaili ,

Thanks for the PR!

Just for clarity, in order to be within the goals behind this project, the entire model should be a single object that's loaded via auto model in HuggingFace. These options to import from E1 FastPLMs aren't going to be merged and are not supported because it's the expectation that users do not have to clone this repository to use the classes. We are open to solutions for RAG and homolog search; that sounds great, but it has to exist within the main class inherited in the E1 model. These have to be easy-to-use functions within the base class, like PPLL and embed, etc.

Also, because the embed mixin is already inherited, the .embed function needs to have a different name if it's going to function differently than the base sort of natural last hidden state to pooling workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants