[SPARK-42839][SQL] Convert _LEGACY_ERROR_TEMP_2003 to an internal error#56727
[SPARK-42839][SQL] Convert _LEGACY_ERROR_TEMP_2003 to an internal error#56727AgenticSpark wants to merge 1 commit into
Conversation
The error is thrown by map_zip_with only when the number of unique keys exceeds MAX_ROUNDED_ARRAY_LENGTH, which cannot be reached from user code, so it is not user-facing. Per the ticket, replace the legacy error class with SparkException.internalError and drop the _LEGACY_ERROR_TEMP_2003 entry from error-conditions.json.
MaxGekk
left a comment
There was a problem hiding this comment.
0 blocking, 0 non-blocking, 0 nits. Mechanical SPARK-42839 conversion of _LEGACY_ERROR_TEMP_2003 to an internal error — message preserved verbatim, sole caller (MapZipWith.assertSizeOfArrayBuffer) throws the result so the SparkRuntimeException → SparkException return-type widening is safe, no _LEGACY_ERROR_TEMP_2003 references remain, and error-conditions.json stays sorted/valid.
Verification
Confirmed the guard is not reproducible from user space, which is what justifies making it internal. It fires only when accumulated unique keys exceed MAX_ROUNDED_ARRAY_LENGTH (~2^31), and three independent walls block that via the public map_zip_with: (1) no scalar amplifier — the count comes only from the materialized keys1/keys2, unlike array_repeat(x, count) which is why the reproducible COLLECTION_SIZE_LIMIT_EXCEEDED.* siblings keep user-facing names; (2) each map's key ArrayData is itself capped at MAX_ROUNDED_ARRAY_LENGTH and a ~2^31-entry map can't fit one UnsafeArrayData/row field; (3) the guard lives only on the O(n²) brute-force path (the fast hash paths don't call it), so reaching size 2^31 is ~2^61 comparisons. Consistent with the existing line between reproducible (user-facing) and non-reproducible (internal) array-size overflows.
### What changes were proposed in this pull request? Convert `_LEGACY_ERROR_TEMP_2003` into an internal error. `mapSizeExceedArraySizeWhenZipMapError` now returns `SparkException.internalError(...)` and the `_LEGACY_ERROR_TEMP_2003` entry is removed from `error-conditions.json`. ### Why are the changes needed? Part of the effort to assign proper names to legacy error classes (SPARK-42839). This error is thrown only by `map_zip_with` (`MapZipWith.assertSizeOfArrayBuffer`) when the number of unique keys exceeds `ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH` (~2^31). That condition is not reachable from user code, so the error is not user-facing. Per the ticket, errors that cannot be reproduced from user space should become internal errors rather than be given a user-facing name. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? No test triggers this error (the array-size limit cannot be reached from user code, so there is nothing to assert with `checkError()`). Verified that no references to `_LEGACY_ERROR_TEMP_2003` remain in the tree and that `error-conditions.json` is still valid JSON with sorted keys. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: GitHub Copilot CLI (Claude Opus 4.8) Closes #56727 from AgenticSpark/agenticspark/SPARK-42839-on-fork. Authored-by: AgenticSpark <AgenticSpark@users.noreply.github.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit be87c1e) Signed-off-by: Max Gekk <max.gekk@gmail.com>
|
+1, LGTM. Merged to master/4.x. |
What changes were proposed in this pull request?
Convert
_LEGACY_ERROR_TEMP_2003into an internal error.mapSizeExceedArraySizeWhenZipMapErrornow returnsSparkException.internalError(...)and the_LEGACY_ERROR_TEMP_2003entry is removed fromerror-conditions.json.Why are the changes needed?
Part of the effort to assign proper names to legacy error classes (SPARK-42839). This error is thrown only by
map_zip_with(MapZipWith.assertSizeOfArrayBuffer) when the number of unique keys exceedsByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH(~2^31). That condition is not reachable from user code, so the error is not user-facing. Per the ticket, errors that cannot be reproduced from user space should become internal errors rather than be given a user-facing name.Does this PR introduce any user-facing change?
No.
How was this patch tested?
No test triggers this error (the array-size limit cannot be reached from user code, so there is nothing to assert with
checkError()). Verified that no references to_LEGACY_ERROR_TEMP_2003remain in the tree and thaterror-conditions.jsonis still valid JSON with sorted keys.Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (Claude Opus 4.8)