Parquet allows the data block inside dictionary pages and data pages to be compressed for better space efficiency. The Parquet format supports several compression covering different areas in the compression ratio / processing cost spectrum.
The detailed specifications of compression codecs are maintained externally by their respective authors or maintainers, which we reference hereafter.
For all compression codecs except the deprecated
LZ4 codec, the raw data
of a (data or dictionary) page is fed as-is to the underlying compression
library, without any additional framing or padding. The information required
for precise allocation of compressed and decompressed buffers is written
No-op codec. Data is left uncompressed.
A codec based on the GZIP format (not the closely-related “zlib” or “deflate” formats) defined by RFC 1952. If any ambiguity arises when implementing this format, the implementation provided by the zlib compression library is authoritative.
Readers should support reading pages containing multiple GZIP members, however, as this has historically not been supported by all implementations, it is recommended that writers refrain from creating such pages by default for better interoperability.
A codec based on or interoperable with the LZO compression library.
A deprecated codec loosely based on the LZ4 compression algorithm, but with an additional undocumented framing scheme. The framing is part of the original Hadoop compression library and was historically copied first in parquet-mr, then emulated with mixed results by parquet-cpp.
It is strongly suggested that implementors of Parquet writers deprecate
this compression codec in their user-facing APIs, and advise users to
switch to the newer, interoperable