[SPARK-34591][ML] Add decision tree pruning as a parameter by WeichenXu123 · Pull Request #55728 · apache/spark

WeichenXu123 · 2026-05-07T08:34:01Z

What changes were proposed in this pull request?

This PR adds a parameter to enable/disable a featuer where LearningNodes are merged after a RF model is trained.

This PR takes over #32813

Why are the changes needed?

2 Reasons:

In addition to basic classification, another use case for decision trees are the probabilities associated with predictions.
Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable.
It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default.

Please see Jira ticket for more explanation.

Does this PR introduce any user-facing change?

Behavior change:
Default pruning behavior flips from always-on to always-off, making all existing decision tree/random forest/GBT callers produce larger, unpruned trees by default.

New params:
adds a parameter that is exposed to the Tree based classifiers. Will add tests here to ensure parameter is exposed correctly.

How was this patch tested?

I modified the two tests introduced with this change to verify postive/negative use of feature. I also added assertions for default behavior

Will add tests that ensure user exposed API is validated.

Locally ran ./build/mvn -pl mllib package and verified tests passed
Additionally, running through git workflow as described here:
https://spark.apache.org/developer-tools.html#github-workflow-tests

### What changes were proposed in this pull request? This PR disables a feature created in SPARK-3159 where LearningNodes are merged after a RF model is trained. ### Why are the changes needed? 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Locally ran `./build/mvn -pl mllib package` and verified tests passed Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

This PR disables a feature created in SPARK-3159 where LearningNodes are merged after a RF model is trained. 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. Please see Jira ticket for more explanation. No, it's dev-only. I modified the two tests introduced with this change to verify postive/negative use of feature. I also added assertions for default behavior Locally ran `./build/mvn -pl mllib package` and verified tests passed Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

…are merged after a RF model is trained. 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. Please see Jira ticket for more explanation. No, it's dev-only. I modified the two tests introduced with this change to verify postive/negative use of feature. I also added assertions for default behavior Locally ran `./build/mvn -pl mllib package` and verified tests passed Locally ran `./dev/scalafmt` which resulted in some minor cosmetic changes Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot

Pull request overview

Adds a configurable switch to control post-training decision tree “pruning” (merging redundant leaf nodes) and wires it through Spark ML (Scala + PySpark) APIs down to the underlying training implementation.

Changes:

Introduce a new pruneTree parameter on Spark ML tree-based classifiers (Scala + PySpark) and propagate it into the old Strategy used by the training code.
Modify ml/tree/impl/RandomForest to use strategy.pruneTree when converting LearningNode to final Node trees (affecting pruning behavior).
Update/extend RandomForest implementation tests and reformat a large portion of the suite.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
python/pyspark/ml/tree.py	Adds `pruneTree` param + getter to the shared Python tree classifier params.
python/pyspark/ml/classification.py	Exposes `pruneTree` in Python `DecisionTreeClassifier` / `RandomForestClassifier` constructors, setters, and docstrings.
mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala	Adds `pruneTree` to Scala ML `TreeClassifierParams` with defaults and docs.
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala	Adds `pruneTree` to the old mllib `Strategy` so training code can read it.
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala	Uses `strategy.pruneTree` to decide whether to prune when finalizing trees (and for early-stop size estimation).
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala	Adds `setPruneTree` and sets `strategy.pruneTree` during training; logs the param.
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala	Same as above for RF classifier.
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala	Reformats tests and adds/updates pruning-related expectations (but currently contains compilation-breaking calls).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

zhengruifeng · 2026-05-07T12:05:41Z

  def setMinInfoGain(value: Double): this.type = set(minInfoGain, value)

+  /** @group setParam */
+  @Since("5.0.0")


@HyukjinKwon do we have 4.3?

zhengruifeng · 2026-05-07T12:09:50Z

      featureSubsetStrategy: String,
      seed: Long,
      instr: Option[Instrumentation],
-      prune: Boolean = true, // exposed for testing only, real trees are always pruned


should the default value be true to align with previous impl?

"default prune = false" is proposed in the jira: https://issues.apache.org/jira/browse/SPARK-34591

but to keep API compatibility, keeping it to true might be safer.

bribiescas-carlos and others added 15 commits June 8, 2021 12:10

Merge branch 'master' into SPARK-34591

36a3527

Merge branch 'master' into SPARK-34591

b53c0e4

Merge branch 'master' into SPARK-34591

2fc33ec

Exposed pruning parameter accessible in Scala WIP

fb835db

Merge branch 'master' into SPARK-34591

a471d5e

Added to decision tree classifier and to python

43ee852

Merge branch 'master' into SPARK-34591

dcec830

Finished a TODO for comments in Strategy.scala

4bb58f6

Merge branch 'master' into SPARK-34591

ea028d4

Merge branch 'master' into SPARK-34591

f51bbae

merge master

86aa0c2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address conflicts

3ab5b2e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot AI review requested due to automatic review settings May 7, 2026 08:34

Copilot started reviewing on behalf of WeichenXu123 May 7, 2026 08:41 View session

WeichenXu123 added 2 commits May 7, 2026 16:42

default pruneTree false

85f3da4

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update test

e5a7896

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot AI reviewed May 7, 2026

View reviewed changes

address comments

84498d2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

zhengruifeng reviewed May 7, 2026

View reviewed changes

zhengruifeng changed the title ~~[SPARK-34591] Add decision tree pruning as a parameter~~ [SPARK-34591][ML] Add decision tree pruning as a parameter May 7, 2026

zhengruifeng reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34591][ML] Add decision tree pruning as a parameter#55728

[SPARK-34591][ML] Add decision tree pruning as a parameter#55728
WeichenXu123 wants to merge 18 commits intomasterfrom
SPARK-34591

WeichenXu123 commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhengruifeng May 7, 2026 •

edited

Loading

Uh oh!

zhengruifeng May 7, 2026

Uh oh!

WeichenXu123 May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WeichenXu123 commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhengruifeng May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng May 7, 2026

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WeichenXu123 commented May 7, 2026 •

edited

Loading

zhengruifeng May 7, 2026 •

edited

Loading