WIP: Add MSA homologue search and RAG-augmeneted E1 embedding/scoring pipelines by nrafaili · Pull Request #25 · Synthyra/FastPLMs

nrafaili · 2026-03-17T18:36:05Z

Summary

homologue_search.py: MMseqs2 via Docker OR ColabFold API homologue retrieval modules
rag_e1.py: Retrieval-augmented prediction with E1 models using MSA-derived context:
- MSA loading from local dir or HuggingFace, with fuzzy matching for mutants
- PPLL scoring with RAG using the 15 context prompts from E1 preprint
- Embedding extraction with single-context prompt
e1_utils.py: Unmodified contents of io.py, msa_sampling.py, and predictor.py from the E1 repository

Usage

Homologue search

from e1_fastplms.homologue_search import HomologueSearcher, ColabFoldSearcher

# Local MMseqs2 (requires Docker + target DB)
searcher = HomologueSearcher(target_db="path/to/uniref30")
a3m_path = searcher.search("MKTL...", output_dir="msas/")

# ColabFold API (no local DB needed)
searcher = ColabFoldSearcher(user_agent="your@email.com")
a3m_path = searcher.search("MKTL...", output_dir="msas/")

RAG-augmented scoring and embedding

from e1_fastplms.rag_e1 import E1RAGPredictor

predictor = E1RAGPredictor.from_pretrained("Synthyra/Profluent-E1-600M")

# PPLL scoring (15-prompt ensemble)
scores = predictor.score_ppll(["MKTL...", "MKTV..."], a3m_path="msas/query.a3m")

# Embedding with RAG context
embeddings = predictor.embed(["MKTL..."], a3m_path="msas/query.a3m", pooling="mean")

# Dataset embedding with MSA lookup
emb_dict = predictor.embed_dataset(sequences, msa_dir="msas/", batch_size=4)

Test plan

Verify HomologueSearcher and ColabFoldSearcher return valid .a3m files
Confirm PPLL scoring works as intended
Confirm RAG embeddings differ from no-context embeddings and evaluate performance impact

lhallee · 2026-03-17T20:13:01Z

Hi @nrafaili ,

Thanks for the PR!

Just for clarity, in order to be within the goals behind this project, the entire model should be a single object that's loaded via auto model in HuggingFace. These options to import from E1 FastPLMs aren't going to be merged and are not supported because it's the expectation that users do not have to clone this repository to use the classes. We are open to solutions for RAG and homolog search; that sounds great, but it has to exist within the main class inherited in the E1 model. These have to be easy-to-use functions within the base class, like PPLL and embed, etc.

Also, because the embed mixin is already inherited, the .embed function needs to have a different name if it's going to function differently than the base sort of natural last hidden state to pooling workflow.

nrafaili added 2 commits March 17, 2026 14:27

e1_rag

f74662a

typo

66b2474

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add MSA homologue search and RAG-augmeneted E1 embedding/scoring pipelines#25

WIP: Add MSA homologue search and RAG-augmeneted E1 embedding/scoring pipelines#25
nrafaili wants to merge 2 commits intoSynthyra:mainfrom
nrafaili:e1-rag

nrafaili commented Mar 17, 2026 •

edited

Loading

Uh oh!

lhallee commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nrafaili commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Homologue search

RAG-augmented scoring and embedding

Test plan

Uh oh!

lhallee commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nrafaili commented Mar 17, 2026 •

edited

Loading