Description
toParquetMetadata method converts org.apache.parquet.hadoop.metadata.ParquetMetadata to org.apache.parquet.format.FileMetaData but this does not set the dictionary page offset bit in FileMetaData.
When a FileMetaData object is serialized while writing to the footer and then deserialized, the dictionary offset is lost as the dictionary page offset bit was never set.
PARQUET-1850 tried to fix this but it did only a partial fix.
It sets setDictionary_page_offset only if getEncodingStats are present
if (columnMetaData.getEncodingStats() != null
&& columnMetaData.getEncodingStats().hasDictionaryPages())
{ metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }
However, it should setDictionary_page_offset even when getEncodingStats are not present but encodings are present.
It should use the implementation in ColumnChunkMetatdata below:
public boolean hasDictionaryPage() {
EncodingStats stats = getEncodingStats();
if (stats != null) {
return stats.hasDictionaryPages() && stats.hasDictionaryEncodedPages();
}
Set<Encoding> encodings = getEncodings();
return (encodings.contains(PLAIN_DICTIONARY) || encodings.contains(RLE_DICTIONARY));
}
So new change in ParquetMetadataCOnvertor should be like:
if (columnMetaData.hasDictionaryPage()) { metaData.setDictionary_page_offset(columnMetaData.getDictionaryPageOffset()); }
Reporter: Abhishek Dixit
PRs and other links:
Note: This issue was originally created as PARQUET-2464. Please see the migration documentation for further details.