-
Notifications
You must be signed in to change notification settings - Fork 338
Description
Is your feature request related to a problem? Please describe.
In the current implementation, when drop_last=True, OneHotEncoder always drops the last category (alphabetically). This makes it impossible for users to control which category is used as the reference group. In many modeling scenarios (for example logistic regression or other linear models), the choice of the reference category matters and users may want to drop a different category.
Describe the solution you'd like
Add a drop parameter that allows users to control which dummy category is dropped.
drop: str = "last" # options: "last", "first", "most_frequent""last"(default): preserves current behaviour — drops the last category alphabetically."first": drops the first category alphabetically."most_frequent": drops the most frequent category found duringfit(), which can be a more statistically meaningful reference group.
If drop="most_frequent" and multiple categories have the same highest frequency, the transformer should raise a UserWarning and fall back to dropping the first category found.
The existing drop_last parameter should remain for backward compatibility, but a deprecation warning should be raised if it is used together with the new drop parameter.
Describe alternatives you've considered
Users can currently control the reference category only by manually reordering or preprocessing the categorical values before applying the encoder. However, this is inconvenient and error-prone, especially in larger pipelines.
Additional context
Adding this parameter would make OneHotEncoder more flexible and align it better with common machine learning workflows, particularly when building statistical or linear models where the reference category is important.