Parquet dictionary encoding fallback is sub-optimal, may violate writer parameters

The `dict_fallback` method of `GenericColumnWriter` writes the dictionary page to the output even though the conditions for the fallback are reached, meaning that the dictionary encoding is unsatisfactory to encode the entire column chunk. This presents a few minor problems:

1. In the current `should_dict_fallback` logic, the dictionary has met or exceeded its page size limit as configured in the column properties. Oversized dictionary pages, though not violating any format constraints, may be surprising to the user.
2. The data pages of the column chunk are then encoded piecemeal using first the dictionary, then a fallback encoding, which is again legal but weird. More importantly, a larger than expected dictionary may arise from high cardinality of the values, so encoding all data pages in fallback may result in a more compact encoding.
3. More fallback strategies may be added in the future, as proposed in #9699 and implemented in #9700. In such cases, the dictionary encoding is decided to be inefficient based on the size of a partial encoding, so it does not make sense to write out the first inefficiently encoded pages and then continue on the better encoding.

For comparison, the `FallbackValuesWriter` implementation in parquet-java extracts all values from the dictionary encoder to be re-encoded by the fallback encoder.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet dictionary encoding fallback is sub-optimal, may violate writer parameters #9739

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Parquet dictionary encoding fallback is sub-optimal, may violate writer parameters #9739

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions