Skip to content

Parquet dictionary encoding fallback is sub-optimal, may violate writer parameters #9739

@mzabaluev

Description

@mzabaluev

The dict_fallback method of GenericColumnWriter writes the dictionary page to the output even though the conditions for the fallback are reached, meaning that the dictionary encoding is unsatisfactory to encode the entire column chunk. This presents a few minor problems:

  1. In the current should_dict_fallback logic, the dictionary has met or exceeded its page size limit as configured in the column properties. Oversized dictionary pages, though not violating any format constraints, may be surprising to the user.
  2. The data pages of the column chunk are then encoded piecemeal using first the dictionary, then a fallback encoding, which is again legal but weird. More importantly, a larger than expected dictionary may arise from high cardinality of the values, so encoding all data pages in fallback may result in a more compact encoding.
  3. More fallback strategies may be added in the future, as proposed in Parquet dictionary fallback heuristics #9699 and implemented in feat(parquet): dictionary fallback heuristic based on compression efficiency #9700. In such cases, the dictionary encoding is decided to be inefficient based on the size of a partial encoding, so it does not make sense to write out the first inefficiently encoded pages and then continue on the better encoding.

For comparison, the FallbackValuesWriter implementation in parquet-java extracts all values from the dictionary encoder to be re-encoded by the fallback encoder.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions