preprocessing: standardize constant columns to 0.0 (#1058)#1163
Open
jbbqqf wants to merge 1 commit into
Open
Conversation
The constant-column branch of `standardize` pre-zeroed the column before the (x - mean) / std division. With std forced to 1.0 and mean unchanged, that turned a constant column of value `c` into `(0 - c) / 1 = -c` rather than the all-zeros vector promised by the docstring's Notes section. Removing the pre-zeroing lets the existing subtraction collapse the column to exactly 0.0 ((c - c) / 1.0 = 0). Existing tests only used a constant column of value 0, so the bug went unnoticed. Adds three regression tests covering numpy + pandas inputs and the return_params dict, all of which fail on origin/master and pass on this branch. Co-Authored-By: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Code of Conduct
I have read the project's Code of Conduct.
Description
mlxtend.preprocessing.standardizeclaims in itsNotessection that"if all values in a given column are the same, these values are all set
to
0.0." For a constant column whose value is non-zero (e.g.[5, 5, 5]), the function actually returns[-5, -5, -5]instead of[0, 0, 0]. This PR brings the behaviour in line with the documentedcontract.
The current code pre-zeroes the column and then runs
(x - mean) / std. Withstdforced to1.0andmeanleft untouchedat the original constant value, that yields
(0 - c) / 1 = -cratherthan zero. Removing the pre-zero step lets the existing subtraction
collapse the column to exactly
0.0((c - c) / 1.0 = 0), which is theintent.
The existing
test_zero_division_*tests only ever used a constantcolumn of value
0, so the bug never surfaced —(0 - 0) / 1is0either way. The new regression tests here cover a non-zero constant
column.
Related issues or pull requests
Fixes #1058
Pull Request Checklist
./docs/sources/CHANGELOG.mdfile./mlxtend/preprocessing/tests/directorymlxtend/docs/sources/(not applicable — behaviour now matches the existing docstring)PYTHONPATH='.' pytest ./mlxtend/preprocessing -sv— 41/41 passedflake8 ./mlxtend(clean) andblack --check(clean)Reproduce BEFORE/AFTER yourself (copy-paste)
A reviewer can verify this fix end-to-end by pasting the block below.
What I ran locally
PYTHONPATH=. pytest mlxtend/preprocessing/tests/test__scaling__standardizing.py -v→ 12/12 passed (9 existing + 3 new regression)PYTHONPATH=. pytest mlxtend/preprocessing -q→ 41/41 passedflake8 mlxtend/preprocessing/scaling.py mlxtend/preprocessing/tests/test__scaling__standardizing.py→ cleanblack --check mlxtend/preprocessing/scaling.py mlxtend/preprocessing/tests/test__scaling__standardizing.py→ cleanorigin/master'sscaling.py: 3/3 fail with the documented-mean(column)regression — confirming they pin the bug.Edge cases tested
[[0,1,2,5],[1,2,3,5],[3,1,2,5]][0,0,0]test_standardize_constant_column_numpy_issue_1058k=[5,5,5]k→[0,0,0]test_standardize_constant_column_pandas_issue_1058return_params=Trueround-trip[[5,1],[5,2],[5,3]]params['stds'][0] == 1.0test_standardize_constant_column_returns_unit_std_param_issue_1058[[0,...],[0,...],...]test_zero_division_pandas,test_zero_division_numpytest_pandas_standardize,test_numpy_standardizeRisk / blast radius
Minimal. The change only affects rows in the constant-column branch and only widens the scope of "constant value" from "constant value happens to be 0" to "any constant value". No public API change. No new dependency. The previous
parametersdict still hasstds[c] = 1.0for constant columns, so any downstream code reusing those params is unaffected.Release note
PR drafted with assistance from Claude Code. The change was reviewed manually against rasbt/mlxtend's source and the docstring's "Notes" section, which already specified the intended behaviour. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.