Describe the bug
_extract_representative_docs() maps selected documents back to their original indices using text membership testing:
doc_ids = [selected_docs_ids[index] for index, doc in enumerate(selected_docs) if doc in docs]
When the same document text appears in multiple topics (common with short texts, boilerplate, or near-duplicates), doc in docs matches the first occurrence regardless of which topic the document actually belongs to. This causes:
- Documents skipped entirely (their index never matched for the right topic)
- Misaligned
doc_ids ↔ selected_docs pairs — wrong similarity scores assigned to wrong documents
representative_docs_ containing documents that don't belong to their assigned topic
This is a separate bug from the replace=True sampling issue (#2491). That one creates duplicates going into the function; this bug is in the output mapping logic.
Reproduction
from bertopic import BERTopic
# Create a dataset where some documents appear verbatim in multiple topics
docs = [
"machine learning is great", # shared text
"deep learning neural networks",
"machine learning is great", # same text, different context
"topic modeling with BERTopic",
# ... enough docs to form 2+ topics with shared text
]
topic_model = BERTopic(min_topic_size=2)
topics, _ = topic_model.fit_transform(docs)
# → representative_docs_ may contain docs mapped to the wrong topic
BERTopic Version
0.17.4
Your contribution
I've already worked through a fix in my fork (replacing the text-based membership lookup with positional indexing), with tests. Happy to open a PR if you're on board with the approach.
Describe the bug
_extract_representative_docs()maps selected documents back to their original indices using text membership testing:When the same document text appears in multiple topics (common with short texts, boilerplate, or near-duplicates),
doc in docsmatches the first occurrence regardless of which topic the document actually belongs to. This causes:doc_ids↔selected_docspairs — wrong similarity scores assigned to wrong documentsrepresentative_docs_containing documents that don't belong to their assigned topicThis is a separate bug from the
replace=Truesampling issue (#2491). That one creates duplicates going into the function; this bug is in the output mapping logic.Reproduction
BERTopic Version
0.17.4
Your contribution
I've already worked through a fix in my fork (replacing the text-based membership lookup with positional indexing), with tests. Happy to open a PR if you're on board with the approach.