Skip to content

feat: Add StringListBinarizer to encode multi-label strings and lists#916

Open
ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
ankitlade12:feat/string-list-binarizer
Open

feat: Add StringListBinarizer to encode multi-label strings and lists#916
ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
ankitlade12:feat/string-list-binarizer

Conversation

@ankitlade12
Copy link
Contributor

Description

This PR introduces the StringListBinarizer to the feature_engine.encoding module.

When dealing with modern datasets (e.g., e-commerce, web logs, or NLP metadata), it's extremely common to have columns containing multiple categories per row. This data usually arrives in one of two ways:

  1. Comma-delimited strings: "action, comedy, thriller"
  2. Python lists evaluated from JSON: ["action", "comedy", "thriller"]

Currently in scikit-learn, users are forced to write messy custom pandas .apply functions or wrestle with MultiLabelBinarizer (which returns raw numpy arrays, strips feature names, and requires iterable-of-iterables).

The StringListBinarizer acts as a native Feature-engine transformer that smoothly splits string lists by a given separator and applies one-hot encoding across all the tags identified in the dataset. It operates directly on pandas DataFrames and returns beautifully named Boolean columns (e.g., genres_action, genres_comedy).

Changes:

  • Added StringListBinarizer class in feature_engine/encoding/string_list_binarizer.py.
  • Exported StringListBinarizer in feature_engine/encoding/init.py.
  • Included rigorous tests for delimited string formats, python list formats, unseen categories fallback, and parameter validation.
  • Added full API documentation in docs/api_doc/encoding/StringListBinarizer.rst.
  • Added User Guide explanations and examples in docs/user_guide/encoding/StringListBinarizer.rst.

Examples:

import pandas as pd
from feature_engine.encoding import StringListBinarizer

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "genres": ["action, comedy", "comedy", "action, thriller"]
})

encoder = StringListBinarizer(
    variables=["genres"],
    separator=", " 
)

encoder.fit(df)
df_encoded = encoder.transform(df)

# Output:
#    user_id  genres_action  genres_comedy  genres_thriller
# 0        1              1              1                0
# 1        2              0              1                0
# 2        3              1              0                1

@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.30%. Comparing base (f72a2b7) to head (eae5f5e).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #916      +/-   ##
==========================================
+ Coverage   98.27%   98.30%   +0.02%     
==========================================
  Files         116      117       +1     
  Lines        4978     5063      +85     
  Branches      795      814      +19     
==========================================
+ Hits         4892     4977      +85     
  Misses         55       55              
  Partials       31       31              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ormat paths, non-str/list rows, get_feature_names_out, _more_tags)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant