Skip to content

[SPARK-42839][SQL] Convert _LEGACY_ERROR_TEMP_2003 to an internal error#56727

Closed
AgenticSpark wants to merge 1 commit into
apache:masterfrom
AgenticSpark:agenticspark/SPARK-42839-on-fork
Closed

[SPARK-42839][SQL] Convert _LEGACY_ERROR_TEMP_2003 to an internal error#56727
AgenticSpark wants to merge 1 commit into
apache:masterfrom
AgenticSpark:agenticspark/SPARK-42839-on-fork

Conversation

@AgenticSpark

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Convert _LEGACY_ERROR_TEMP_2003 into an internal error. mapSizeExceedArraySizeWhenZipMapError now returns SparkException.internalError(...) and the _LEGACY_ERROR_TEMP_2003 entry is removed from error-conditions.json.

Why are the changes needed?

Part of the effort to assign proper names to legacy error classes (SPARK-42839). This error is thrown only by map_zip_with (MapZipWith.assertSizeOfArrayBuffer) when the number of unique keys exceeds ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH (~2^31). That condition is not reachable from user code, so the error is not user-facing. Per the ticket, errors that cannot be reproduced from user space should become internal errors rather than be given a user-facing name.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

No test triggers this error (the array-size limit cannot be reached from user code, so there is nothing to assert with checkError()). Verified that no references to _LEGACY_ERROR_TEMP_2003 remain in the tree and that error-conditions.json is still valid JSON with sorted keys.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.8)

The error is thrown by map_zip_with only when the number of unique keys
exceeds MAX_ROUNDED_ARRAY_LENGTH, which cannot be reached from user code, so
it is not user-facing. Per the ticket, replace the legacy error class with
SparkException.internalError and drop the _LEGACY_ERROR_TEMP_2003 entry from
error-conditions.json.

@MaxGekk MaxGekk left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 blocking, 0 non-blocking, 0 nits. Mechanical SPARK-42839 conversion of _LEGACY_ERROR_TEMP_2003 to an internal error — message preserved verbatim, sole caller (MapZipWith.assertSizeOfArrayBuffer) throws the result so the SparkRuntimeExceptionSparkException return-type widening is safe, no _LEGACY_ERROR_TEMP_2003 references remain, and error-conditions.json stays sorted/valid.

Verification

Confirmed the guard is not reproducible from user space, which is what justifies making it internal. It fires only when accumulated unique keys exceed MAX_ROUNDED_ARRAY_LENGTH (~2^31), and three independent walls block that via the public map_zip_with: (1) no scalar amplifier — the count comes only from the materialized keys1/keys2, unlike array_repeat(x, count) which is why the reproducible COLLECTION_SIZE_LIMIT_EXCEEDED.* siblings keep user-facing names; (2) each map's key ArrayData is itself capped at MAX_ROUNDED_ARRAY_LENGTH and a ~2^31-entry map can't fit one UnsafeArrayData/row field; (3) the guard lives only on the O(n²) brute-force path (the fast hash paths don't call it), so reaching size 2^31 is ~2^61 comparisons. Consistent with the existing line between reproducible (user-facing) and non-reproducible (internal) array-size overflows.

@MaxGekk MaxGekk closed this in be87c1e Jun 24, 2026
MaxGekk pushed a commit that referenced this pull request Jun 24, 2026
### What changes were proposed in this pull request?

Convert `_LEGACY_ERROR_TEMP_2003` into an internal error. `mapSizeExceedArraySizeWhenZipMapError` now returns `SparkException.internalError(...)` and the `_LEGACY_ERROR_TEMP_2003` entry is removed from `error-conditions.json`.

### Why are the changes needed?

Part of the effort to assign proper names to legacy error classes (SPARK-42839). This error is thrown only by `map_zip_with` (`MapZipWith.assertSizeOfArrayBuffer`) when the number of unique keys exceeds `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH` (~2^31). That condition is not reachable from user code, so the error is not user-facing. Per the ticket, errors that cannot be reproduced from user space should become internal errors rather than be given a user-facing name.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No test triggers this error (the array-size limit cannot be reached from user code, so there is nothing to assert with `checkError()`). Verified that no references to `_LEGACY_ERROR_TEMP_2003` remain in the tree and that `error-conditions.json` is still valid JSON with sorted keys.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.8)

Closes #56727 from AgenticSpark/agenticspark/SPARK-42839-on-fork.

Authored-by: AgenticSpark <AgenticSpark@users.noreply.github.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit be87c1e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk

MaxGekk commented Jun 24, 2026

Copy link
Copy Markdown
Member

+1, LGTM. Merged to master/4.x.
Thank you, @AgenticSpark. Could you give me your JIRA id to assign the JIRA ticket to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants