Overview

All about Parquet.

1: Motivation

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides high performance compression and encoding schemes to handle complex data in bulk and is supported in many programming language and analytics tools.

This documentation contains information about both the parquet-mr and parquet-format repositories.

parquet-format

The parquet-format repository hosts the official specification of the Apache Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files.

As a repository focused on specification, the parquet-format repository does not contain source code.

parquet-mr

The parquet-mr repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. The “mr” stands for MapReduce. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Apache Parquet files.

The parquet-mr repository contains an implementation of the Apache Parquet format. There are a number of other Parquet format implementations, which are listed below.

Included in parquet-mr:

Java Implementation: It contains the core Java implementation of the Apache Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop.
Utilities and APIs: It provides various utilities and APIs for working with Apache Parquet files, including tools for data import/export, schema management, and data conversion.

Other Clients / Libraries / Tools

The Parquet ecosystem is rich and varied, encompassing a wide array of tools, libraries, and clients, each offering different levels of feature support. It’s important to note that not all implementations support the same features of the Parquet format. When integrating multiple Parquet implementations within your workflow, it is crucial to conduct thorough testing to ensure compatibility and performance across different platforms and tools.

Here is a non-exhaustive list of Parquet implementations:

1 - Motivation

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind, and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.