feat: Add StringListBinarizer to encode multi-label strings and lists by ankitlade12 · Pull Request #916 · feature-engine/feature_engine

ankitlade12 · 2026-03-10T16:07:58Z

Description

This PR introduces the StringListBinarizer to the feature_engine.encoding module.

When dealing with modern datasets (e.g., e-commerce, web logs, or NLP metadata), it's extremely common to have columns containing multiple categories per row. This data usually arrives in one of two ways:

Comma-delimited strings: "action, comedy, thriller"
Python lists evaluated from JSON: ["action", "comedy", "thriller"]

Currently in scikit-learn, users are forced to write messy custom pandas .apply functions or wrestle with MultiLabelBinarizer (which returns raw numpy arrays, strips feature names, and requires iterable-of-iterables).

The StringListBinarizer acts as a native Feature-engine transformer that smoothly splits string lists by a given separator and applies one-hot encoding across all the tags identified in the dataset. It operates directly on pandas DataFrames and returns beautifully named Boolean columns (e.g., genres_action, genres_comedy).

Changes:

Added StringListBinarizer class in feature_engine/encoding/string_list_binarizer.py.
Exported StringListBinarizer in feature_engine/encoding/init.py.
Included rigorous tests for delimited string formats, python list formats, unseen categories fallback, and parameter validation.
Added full API documentation in docs/api_doc/encoding/StringListBinarizer.rst.
Added User Guide explanations and examples in docs/user_guide/encoding/StringListBinarizer.rst.

Examples:

import pandas as pd
from feature_engine.encoding import StringListBinarizer

df = pd.DataFrame({
    "user_id": [1, 2, 3],
    "genres": ["action, comedy", "comedy", "action, thriller"]
})

encoder = StringListBinarizer(
    variables=["genres"],
    separator=", " 
)

encoder.fit(df)
df_encoded = encoder.transform(df)

# Output:
#    user_id  genres_action  genres_comedy  genres_thriller
# 0        1              1              1                0
# 1        2              0              1                0
# 2        3              1              0                1

codecov · 2026-03-11T21:55:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.30%. Comparing base (f72a2b7) to head (eae5f5e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #916      +/-   ##
==========================================
+ Coverage   98.27%   98.30%   +0.02%     
==========================================
  Files         116      117       +1     
  Lines        4978     5063      +85     
  Branches      795      814      +19     
==========================================
+ Hits         4892     4977      +85     
  Misses         55       55              
  Partials       31       31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ormat paths, non-str/list rows, get_feature_names_out, _more_tags)

ankitlade12 added 6 commits March 10, 2026 11:06

feat: Add StringListBinarizer to encode multi-label strings and lists

11081e5

fix: address flake8 for StringListBinarizer

5fc5bf5

fix: support pandas string dtype in StringListBinarizer

8e67332

chore: fix flake8 for StringListBinarizer tests

3edebe0

chore: normalize blank lines in StringListBinarizer tests

9e30848

chore: add missing blank line before first StringListBinarizer test

2842cca

test: add coverage for StringListBinarizer (init validation, ignore_f…

eae5f5e

…ormat paths, non-str/list rows, get_feature_names_out, _more_tags)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add StringListBinarizer to encode multi-label strings and lists#916

feat: Add StringListBinarizer to encode multi-label strings and lists#916
ankitlade12 wants to merge 7 commits intofeature-engine:mainfrom
ankitlade12:feat/string-list-binarizer

ankitlade12 commented Mar 10, 2026

Uh oh!

codecov bot commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ankitlade12 commented Mar 10, 2026

Description

Changes:

Examples:

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 11, 2026 •

edited

Loading