arrow-ipc: Write 0 offset buffer for length-0 variable-size arrays#9717
arrow-ipc: Write 0 offset buffer for length-0 variable-size arrays#9717atwam wants to merge 2 commits intoapache:mainfrom
Conversation
I don't understand this statement. The spec you reference in #9716 is for the in memory structure (not the ipc format) It almost sounds like a bug in polars -- shouldn't it accept zero length IPC buffers ? |
|
For anyone else following this PR, please see discussion on the issue. |
alamb
left a comment
There was a problem hiding this comment.
Thank you @atwam -- after the discussion on this change makes sense to me
I think the only thing this PR should have is a test showing that the old style (zero length offset buffers) are still accepted.
Maybe we can make a small file with such a buffer and have a test that verifies it can be read successfully
| assert_eq!(child_data.len(), 0, "child data should remain empty"); | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
Can we please also add a test that shows IPC files without any offets are still accepted?
There was a problem hiding this comment.
Added two tests for legacy files.
Creating files with the legacy format means I could not rely on the writer (unless I added options to enable some legacy format, which was ugly), so I just went back to raw flatbuffers for the purpose of these tests.
Another alternative would be to generate such files once with the current version and embed some raw files, but that amounts to embedding small binary files, which I found equally repulsive.
I hope my approach will do.
Which issue does this PR close?
Rationale for this change
Current version serializes a length-0 offsets buffer, and relies on the array constructor to set offsets to
[0]for empty arrays. This does not conform with spec, which specifies that the offsets buffer should havelength + 1elements.This correctly serializes a length-1 offsets buffer containing
[0]. I have left the current behavior of filling-in offsets with[0]for empty arrays, so that future versions can still read IPC files serialized by previous versions.What changes are included in this PR?
Added tests for serialization of empty arrays, ensuring length-1 offsets buffer. Fixed serialization of empty arrays.
This fixes serialization of empty Binary/Utf8/List/LargeList (and Map) arrays.
Are these changes tested?
Yes
Are there any user-facing changes?
There should not be any breaking change for users. Serialized files for empty record batches will be slightly different. Older serialized files should still be read fine.