You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The dict_fallback method of GenericColumnWriter writes the dictionary page to the output even though the conditions for the fallback are reached, meaning that the dictionary encoding is unsatisfactory to encode the entire column chunk. This presents a few minor problems:
In the current should_dict_fallback logic, the dictionary has met or exceeded its page size limit as configured in the column properties. Oversized dictionary pages, though not violating any format constraints, may be surprising to the user.
The data pages of the column chunk are then encoded piecemeal using first the dictionary, then a fallback encoding, which is again legal but weird. More importantly, a larger than expected dictionary may arise from high cardinality of the values, so encoding all data pages in fallback may result in a more compact encoding.
For comparison, the FallbackValuesWriter implementation in parquet-java extracts all values from the dictionary encoder to be re-encoded by the fallback encoder.
The
dict_fallbackmethod ofGenericColumnWriterwrites the dictionary page to the output even though the conditions for the fallback are reached, meaning that the dictionary encoding is unsatisfactory to encode the entire column chunk. This presents a few minor problems:should_dict_fallbacklogic, the dictionary has met or exceeded its page size limit as configured in the column properties. Oversized dictionary pages, though not violating any format constraints, may be surprising to the user.For comparison, the
FallbackValuesWriterimplementation in parquet-java extracts all values from the dictionary encoder to be re-encoded by the fallback encoder.