Skip to content

_extract_representative_docs uses text-based in matching, causing wrong documents when duplicates exist #2492

@pidefrem

Description

@pidefrem

Describe the bug

_extract_representative_docs() maps selected documents back to their original indices using text membership testing:

doc_ids = [selected_docs_ids[index] for index, doc in enumerate(selected_docs) if doc in docs]

When the same document text appears in multiple topics (common with short texts, boilerplate, or near-duplicates), doc in docs matches the first occurrence regardless of which topic the document actually belongs to. This causes:

  • Documents skipped entirely (their index never matched for the right topic)
  • Misaligned doc_idsselected_docs pairs — wrong similarity scores assigned to wrong documents
  • representative_docs_ containing documents that don't belong to their assigned topic

This is a separate bug from the replace=True sampling issue (#2491). That one creates duplicates going into the function; this bug is in the output mapping logic.

Reproduction

from bertopic import BERTopic

# Create a dataset where some documents appear verbatim in multiple topics
docs = [
    "machine learning is great",  # shared text
    "deep learning neural networks",
    "machine learning is great",  # same text, different context
    "topic modeling with BERTopic",
    # ... enough docs to form 2+ topics with shared text
]
topic_model = BERTopic(min_topic_size=2)
topics, _ = topic_model.fit_transform(docs)
# → representative_docs_ may contain docs mapped to the wrong topic

BERTopic Version

0.17.4

Your contribution

I've already worked through a fix in my fork (replacing the text-based membership lookup with positional indexing), with tests. Happy to open a PR if you're on board with the approach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions