-
Notifications
You must be signed in to change notification settings - Fork 338
Description
Is your feature request related to a problem? Please describe.
Currently, CountFrequencyEncoder raises a ValueError during the transform() step if it encounters categories that were not seen during the fit() phase. This behavior can interrupt pipelines and make the transformer less flexible when working with real-world datasets where unseen categories frequently occur during inference or deployment.
Describe the solution you'd like
Introduce a parameter to control how unseen categories should be handled during transform(). For example:
unseen_categories: str = "raise" # options: 'raise', 'warn', 'ignore'raise→ Keep the current behavior and raise aValueError.warn→ Encode unseen categories asNaN(or optionally0) and emit aUserWarningindicating which categories were unseen.ignore→ Encode unseen categories asNaNsilently without raising an error.
This would allow the transformer to continue operating while still informing the user when unexpected categories appear.
Describe alternatives you've considered
An alternative approach could be to always encode unseen categories as NaN without providing configuration options. However, this removes user control over strict validation and may hide data issues. Providing a configurable parameter maintains flexibility while preserving the option to enforce strict behavior.
Additional context
This change would align CountFrequencyEncoder with the design pattern being introduced across other transformers in the library that avoid raising errors during transformation and instead provide configurable handling of unexpected values. It would also improve usability in production pipelines where unseen categories are common.